Commit graph

31 commits

Author SHA1 Message Date
Emilio G. Cota 23a55a277f
tcg: enable multiple TCG contexts in softmmu
This enables parallel TCG code generation. However, we do not take
advantage of it yet since tb_lock is still held during tb_gen_code.

In user-mode we use a single TCG context; see the documentation
added to tcg_region_init for the rationale.

Note that targets do not need any conversion: targets initialize a
TCGContext (e.g. defining TCG globals), and after this initialization
has finished, the context is cloned by the vCPU threads, each of
them keeping a separate copy.

TCG threads claim one entry in tcg_ctxs[] by atomically increasing
n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's
of that variable and tcg_ctxs; they are there just to play nice with
analysis tools such as thread sanitizer.

Note that we do not allocate an array of contexts (we allocate
an array of pointers instead) because when tcg_context_init
is called, we do not know yet how many contexts we'll use since
the bool behind qemu_tcg_mttcg_enabled() isn't set yet.

Previous patches folded some TCG globals into TCGContext. The non-const
globals remaining are only set at init time, i.e. before the TCG
threads are spawned. Here is a list of these set-at-init-time globals
under tcg/:

Only written by tcg_context_init:
- indirect_reg_alloc_order
- tcg_op_defs
Only written by tcg_target_init (called from tcg_context_init):
- tcg_target_available_regs
- tcg_target_call_clobber_regs
- arm: arm_arch, use_idiv_instructions
- i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
have_movbe, have_popcnt
- mips: use_movnz_instructions, use_mips32_instructions,
use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
- ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
- s390: tb_ret_addr, s390_facilities
- sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
use_vis3_instructions

Only written by tcg_prologue_init:
- 'struct jit_code_entry one_entry'
- aarch64: tb_ret_addr
- arm: tb_ret_addr
- i386: tb_ret_addr, guest_base_flags
- ia64: tb_ret_addr
- mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr

Backports commit 3468b59e18b179bc63c7ce934de912dfa9596122 from qemu
2018-03-14 14:32:34 -04:00
Emilio G. Cota 078c9e7e3b
tcg: take tb_ctx out of TCGContext
Groundwork for supporting multiple TCG contexts.

Backports commit 44ded3d04821bec57407cc26a8b4db620da2be04 from qemu
2018-03-14 09:18:12 -04:00
Emilio G. Cota 6bc05eeee4
tb hash: track translated blocks with qht
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
MRU is implemented here by adding an intermediate struct
that contains the u32 hash and a pointer to the TB; this
allows us, on an MRU promotion, to copy said struct (that is not
at the head), and put this new copy at the head. After a grace
period, the original non-head struct can be eliminated, and
after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

Host: Intel Xeon E5-2690

28 ++------------+-------------+-------------+-------------+------------++
A***** + + + master **A*** +
27 ++ * xxhash ##B###++
| A******A****** xxhash-rcu $$C$$$ |
26 C$$ A******A****** qht-fixed-nomru*%%D%%%++
D%%$$ A******A******A*qht-dyn-mru A*E****A
25 ++ %%$$ qht-dyn-nomru &&F&&&++
B#####% |
24 ++ #C$$$$$ ++
| B### $ |
| ## C$$$$$$ |
23 ++ # C$$$$$$ ++
| B###### C$$$$$$ %%%D
22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
| D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E
21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
+ E@@@ F&&& + E@ + F&&& + +
20 ++------------+-------------+-------------+-------------+------------++
14 16 18 20 22 24
log2 number of buckets

Host: Intel i7-4790K

14.5 ++------------+------------+-------------+------------+------------++
A** + + + master **A*** +
14 ++ ** xxhash ##B###++
13.5 ++ ** xxhash-rcu $$C$$$++
| qht-fixed-nomru %%D%%% |
13 ++ A****** qht-dyn-mru @@E@@@++
| A*****A******A****** qht-dyn-nomru &&F&&& |
12.5 C$$ A******A******A*****A****** ***A
12 ++ $$ A*** ++
D%%% $$ |
11.5 ++ %% ++
B### %C$$$$$$ |
11 ++ ## D%%%%% C$$$$$ ++
| # % C$$$$$$ |
10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C
10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
+ F&& D%%%%%%B######B######B#####B###@@@D%%% +
9.5 ++------------+------------+-------------+------------+------------++
14 16 18 20 22 24
log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Backports commit 909eaac9bbc2ed4f3a82ce38e905b87d478a3e00 from qemu
2018-03-13 14:16:26 -04:00
Alexey Kardashevskiy b90333a531
memory: Share special empty FlatView
This shares an cached empty FlatView among address spaces. The empty
FV is used every time when a root MR renders into a FV without memory
sections which happens when MR or its children are not enabled or
zero-sized. The empty_view is not NULL to keep the rest of memory
API intact; it also has a dispatch tree for the same reason.

On POWER8 with 255 CPUs, 255 virtio-net, 40 PCI bridges guest this halves
the amount of FlatView's in use (557 -> 260) and dispatch tables
(~800000 -> ~370000). In an unrelated experiment with 112 non-virtio
devices on x86 ("-M pc"), only 4 FlatViews are alive, and about ~2000
are created at startup.

Backports commit 092aa2fc65b7a35121616aad8f39d47b8f921618 from qemu
2018-03-11 22:34:28 -04:00
Alexey Kardashevskiy f2c72dc278
memory: Share FlatView's and dispatch trees between address spaces
This allows sharing flat views between address spaces (AS) when
the same root memory region is used when creating a new address space.
This is done by walking through all ASes and caching one FlatView per
a physical root MR (i.e. not aliased).

This removes search for duplicates from address_space_init_shareable() as
FlatViews are shared elsewhere and keeping as::ref_count correct seems
an unnecessary and useless complication.

This should cause no change and memory use or boot time yet.

Backports commit 967dc9b1194a9281124b2e1ce67b6c3359a2138f from qemu
2018-03-11 22:05:44 -04:00
Lioncash 1591f208c0
memory: Move AddressSpaceDispatch from AddressSpace to FlatView
As we are going to share FlatView's between AddressSpace's,
and AddressSpaceDispatch is a structure to perform quick lookup
in FlatView, this moves ASD to FlatView.

After previosly open coded ASD rendering, we can also remove
as->next_dispatch as the new FlatView pointer is stored
on a stack and set to an AS atomically.

flatview_destroy() is executed under RCU instead of
address_space_dispatch_free() now.

This makes mem_begin/mem_commit to work with ASD and mem_add with FV
as later on mem_add will be taking FV as an argument anyway.

This should cause no behavioural change.

Backports commit 66a6df1dc6d5b28cc3e65db0d71683fbdddc6b62 from qemu
2018-03-11 20:40:24 -04:00
Lioncash cc8fd90124
unicorn_common: Eliminate memory leaks 2018-03-11 20:20:29 -04:00
Peter Crosthwaite ce997e1caf
qom/cpu: Add MemoryRegion property
Add a MemoryRegion property, which if set is used to construct
the CPU's initial (default) AddressSpace.

Backports commit 6731d864f80938e404dc3e5eb7f6b76b891e3e43 from qemu
2018-02-18 21:54:50 -05:00
Lioncash 6d5f465449
uc: Handle freeing of multiple address spaces 2018-02-18 21:36:50 -05:00
xorstream 1aeaf5c40d This code should now build the x86_x64-softmmu part 2. 2017-01-19 22:50:28 +11:00
Chris Eagle fccbcfd4c2 revert to use of g_free to make future qemu integrations easier (#695)
* revert to use of g_free to make future qemu integrations easier

* bracing
2016-12-21 22:28:36 +08:00
Chris Eagle e46545f722 remove glib dependency by provide compatible replacements 2016-12-18 14:56:58 -08:00
danghvu bb8f894872 windows: Remove unnecessary mman inclusion (issue #587) 2016-07-11 13:35:49 -05:00
Hoang-Vu Dang b9a10152f1 memleak: code_gen_buffer using g_free for non-linux 2016-07-11 10:13:13 -05:00
danghvu 117a318188 memleak: missing from refactoring 2016-07-08 12:49:43 -05:00
danghvu 6b9f17f2f7 memleak: refactor unicorn_common.h, move stuff to uc_close 2016-07-08 11:16:23 -05:00
Hoang-Vu Dang de5786f98d Fix memleak: code_gen_buffer 2016-07-05 23:48:02 -05:00
Nguyen Anh Quynh 3a742fb6f6 fix conflicts when merging no-thread to master 2016-04-23 10:06:57 +08:00
Chris Eagle 9467254fc0 strip out per cpu thread code 2016-03-25 17:24:28 -07:00
Nguyen Anh Quynh cfaac6921b c89 2016-02-01 12:05:46 +08:00
danghvu 36e53ad8a1 Fix arm & arm64 memleaks 2016-01-31 16:22:20 -06:00
Nguyen Anh Quynh 580bc7b56a cleanup 2016-01-10 23:10:00 +08:00
farmdve 036763d6ae Fix memory leaks as reported by DrMemory and Valgrind.
ARM and probably the rest of the arches have significant memory leaks as
they have no release interface.

Additionally, DrMemory does not have 64-bit support and thus I can't
test the 64-bit version under Windows. Under Linux valgrind supports
both 32-bit and 64-bit but there are different macros and code for Linux
and Windows.
2016-01-08 01:42:56 +02:00
Ryan Hileman 6d21ebabea implement host-controlled memory mapping for #261 2015-11-27 23:30:36 -08:00
Nguyen Anh Quynh 9f9d57e84f cleaning & indentation 2015-09-03 18:16:49 +08:00
Chris Eagle b27e987932 Add target_page_size member to uc_struct to track TARGET_PAGE_SIZE 2015-08-31 01:00:44 -07:00
Chris Eagle 6beb1b8a13 intermediate commit, working unmap of complete blocks, still need sub-blocks, and cross block 2015-08-29 21:17:30 -07:00
Nguyen Anh Quynh 3b5df362d7 chmod -x <some source code> 2015-08-28 18:12:56 +08:00
Chris Eagle 00944b6cde Add ability to mark memory are read only. Add new API uc_mem_map_ex to allow permissions to be passed. Change MemoryBlock to track created MemoryRegions. Add regress/ro_mem_test.c 2015-08-26 13:29:54 -07:00
pancake c5d99777f4 Use const in uc_mem_write and derivates 2015-08-24 17:02:14 +02:00
Nguyen Anh Quynh 344d016104 import 2015-08-21 15:04:50 +08:00