Commit graph

4541 commits

Author SHA1 Message Date
Emilio G. Cota f41747c9ba
translate-all: use qemu_protect_rwx/none helpers
The helpers require the address and size to be page-aligned, so
do that before calling them.

Backports commit f51f315a676ec913a55ac27be4ef857f9f7ddc5c from qemu
2018-03-14 12:10:28 -04:00
Emilio G. Cota 3fe9866ffe
osdep: introduce qemu_mprotect_rwx/none
Backports commit 5fa64b3130af9a45e7e2a904bde1f8cfb72be5c9 from qemu
2018-03-14 12:10:28 -04:00
Emilio G. Cota 5ad6116f20
tcg: allocate optimizer temps with tcg_malloc
Groundwork for supporting multiple TCG contexts.

While at it, also allocate temps_used directly as a bitmap of the
required size, instead of using a bitmap of TCG_MAX_TEMPS via
TCGTempSet.

Performance-wise we lose about 1.12% in a translation-heavy workload
such as booting+shutting down debian-arm:

Performance counter stats for 'taskset -c 0 arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=die-on-boot.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):

exec time (s) Relative slowdown wrt original (%)
---------------------------------------------------------------
original 20.213321616 0.
tcg_malloc 20.441130078 1.1270214
TCGContext 20.477846517 1.3086662
g_malloc 20.780527895 2.8061013

The other two alternatives shown in the table are:
- TCGContext: embed temps[TCG_MAX_TEMPS] and TCGTempSet used_temps
in TCGContext. This is simple enough but it isn't faster than using
tcg_malloc; moreover, it wastes memory.
- g_malloc: allocate/deallocate both temps and used_temps every time
tcg_optimize is executed.

Backports commit 34184b071817b4f9edbfd1aa2225c196f05a0947 from qemu
2018-03-14 12:10:28 -04:00
Emilio G. Cota 1be7b55bb4
tcg: introduce **tcg_ctxs to keep track of all TCGContext's
Groundwork for supporting multiple TCG contexts.

Note that having n_tcg_ctxs is unnecessary. However, it is
convenient to have it, since it will simplify iterating over the
array: we'll have just a for loop instead of having to iterate
over a NULL-terminated array (which would require n+1 elems)
or having to check with ifdef's for usermode/softmmu.

Backports commit df2cce2968069526553d82331ce9817eaca6b03a from qemu
2018-03-14 12:10:25 -04:00
Emilio G. Cota 9d5b378475
tcg: define tcg_init_ctx and make tcg_ctx a pointer
Groundwork for supporting multiple TCG contexts.

The core of this patch is this change to tcg/tcg.h:

> -extern TCGContext tcg_ctx;
> +extern TCGContext tcg_init_ctx;
> +extern TCGContext *tcg_ctx;

Note that for now we set *tcg_ctx to whatever TCGContext is passed
to tcg_context_init -- in this case &tcg_init_ctx.

Backports commit b1311c4acf503dc9c1a310cc40b64f05b08833dc from qemu
2018-03-14 09:43:58 -04:00
Emilio G. Cota 078c9e7e3b
tcg: take tb_ctx out of TCGContext
Groundwork for supporting multiple TCG contexts.

Backports commit 44ded3d04821bec57407cc26a8b4db620da2be04 from qemu
2018-03-14 09:18:12 -04:00
Emilio G. Cota 16113cbd3c
translate-all: report correct avg host TB size
Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:

- Making the conscious choice of misreporting the total translated code;
doing otherwise would mislead users into thinking "-tb-size" is not
honoured.

- Expanding tb_tree_stats to accurately count the bytes of translated code on
the host, and using this for reporting the average tb host size,
as well as the expansion ratio.

In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.

Backports commit f19c6cc6fc356dab7a766b471ec5eb3058f0afc1 from qemu
2018-03-13 16:22:24 -04:00
Emilio G. Cota 66fa401871
exec-all: rename tb_free to tb_remove
We don't really free anything in this function anymore; we just remove
the TB from the binary search tree.

Backports commit be1e01171b556807198c84feac7cf4bca0d904c2 from qemu
2018-03-13 16:20:41 -04:00
Emilio G. Cota f7c984d21f
translate-all: use a binary search tree to track TBs in TBContext
This is a prerequisite for supporting multiple TCG contexts, since
we will have threads generating code in separate regions of
code_gen_buffer.

For this we need a new field (.size) in struct tb_tc to keep
track of the size of the translated code. This field uses a size_t
to avoid adding a hole to the struct, although really an unsigned
int would have been enough.

The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.

Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer. The next patch in this series renames tb_free
to tb_remove to reflect this.

Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:

Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel img/arm/aarch32-current-linux-kernel-only.img \
-append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):

- Before:

8048.598422 task-clock (msec) # 0.931 CPUs utilized ( +- 0.28% )
16,974 context-switches # 0.002 M/sec ( +- 0.12% )
0 cpu-migrations # 0.000 K/sec
10,125 page-faults # 0.001 M/sec ( +- 1.23% )
35,144,901,879 cycles # 4.367 GHz ( +- 0.14% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
65,758,252,643 instructions # 1.87 insns per cycle ( +- 0.33% )
10,871,298,668 branches # 1350.707 M/sec ( +- 0.41% )
192,322,212 branch-misses # 1.77% of all branches ( +- 0.32% )

8.640869419 seconds time elapsed ( +- 0.57% )

- After:
8146.242027 task-clock (msec) # 0.923 CPUs utilized ( +- 1.23% )
17,016 context-switches # 0.002 M/sec ( +- 0.40% )
0 cpu-migrations # 0.000 K/sec
18,769 page-faults # 0.002 M/sec ( +- 0.45% )
35,660,956,120 cycles # 4.378 GHz ( +- 1.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
65,095,366,607 instructions # 1.83 insns per cycle ( +- 1.73% )
10,803,480,261 branches # 1326.192 M/sec ( +- 1.95% )
195,601,289 branch-misses # 1.81% of all branches ( +- 0.39% )

8.828660235 seconds time elapsed ( +- 0.38% )

Backports commit 2ac01d6dafabd4a726254eea98824c798d416ee4 from qemu
2018-03-13 16:18:29 -04:00
Richard Henderson 35e551dc45
tcg: Remove CF_IGNORE_ICOUNT
Now that we have curr_cflags, we can include CF_USE_ICOUNT
early and then remove it as necessary.

Backports commit 416986d3f97329655e30da7271a2d11c6d707b06 from qemu
2018-03-13 15:28:47 -04:00
Richard Henderson f04beeea78
tcg: Add CF_LAST_IO + CF_USE_ICOUNT to CF_HASH_MASK
These flags are used by target/*/translate.c,
and affect code generation.

Backports commit 0cf8a44c2f56ba884c2f6db47d27fbb24975daa3 from qemu
2018-03-13 15:25:09 -04:00
Emilio G. Cota 4f7dcf149e
cpu-exec: lookup/generate TB outside exclusive region during step_atomic
Now that all code generation has been converted to check CF_PARALLEL, we can
generate !CF_PARALLEL code without having yet set !parallel_cpus --
and therefore without having to be in the exclusive region during
cpu_exec_step_atomic.

While at it, merge cpu_exec_step into cpu_exec_step_atomic.

Backports commit ac03ee5331612e44beb393df2b578c951d27dc0d from qemu
2018-03-13 15:24:02 -04:00
Emilio G. Cota f593db445a
tcg: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

The tb->cflags field is not passed to tcg generation functions. So
we add a field to TCGContext, storing there a copy of tb->cflags.

Most architectures have <= 32 registers, which results in a 4-byte hole
in TCGContext. Use this hole for the new field.

Backports commit e82d5a2460b0e176128027651ff9b104e4bdf5cc from qemu
2018-03-13 15:17:59 -04:00
Emilio G. Cota 915a8a92c8
target/sparc: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

Backports commit 87d757d60d66d5ee1608460b0f1e07e2b758db9c from qemu
2018-03-13 15:14:37 -04:00
Emilio G. Cota 3825687e9f
target/m68k: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

Backports commit f0ddf11b23260f0af84fb529486a8f9ba2d19401 from qemu
2018-03-13 15:13:35 -04:00
Emilio G. Cota 5c1dbf456b
target/i386: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

Backports commit b5e3b4c2aca8eb5a9cfeedfb273af623f17c3731 from qemu
2018-03-13 15:10:44 -04:00
Lioncash 0bfa6bae59
msvc: Add missing qht references 2018-03-13 15:09:30 -04:00
Emilio G. Cota b71769fa5f
target/arm: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

Backports commit 2399d4e7cec22ecf1c51062d2ebfd45220dbaace from qemu
2018-03-13 15:05:45 -04:00
Emilio G. Cota c384da2f47
tcg: convert tb->cflags reads to tb_cflags(tb)
Convert all existing readers of tb->cflags to tb_cflags, so that we
use atomic_read and therefore avoid undefined behaviour in C11.

Note that the remaining setters/getters of the field are protected
by tb_lock, and therefore do not need conversion.

Luckily all readers access the field via 'tb->cflags' (so no foo.cflags,
bar->cflags in the code base), which makes the conversion easily
scriptable:

FILES=$(git grep 'tb->cflags' target include/exec/gen-icount.h \
accel/tcg/translator.c | cut -f1 -d':' | sort | uniq)

perl -pi -e 's/([^.>])tb->cflags/$1tb_cflags(tb)/g' $FILES
perl -pi -e 's/([a-z->.]*)(->|\.)tb->cflags/tb_cflags($1$2tb)/g' $FILES

Then manually fixed the few errors that checkpatch reported.

Compile-tested for all targets.

Backports commit c5a49c63fa26e8825ad101dfe86339ae4c216539 from qemu
2018-03-13 14:57:51 -04:00
Richard Henderson d6ca4d59dc
tcg: Include CF_COUNT_MASK in CF_HASH_MASK
Backports commit cdfef1715c779eb528d633e8b76cbc8a10e71ac8 from qemu
2018-03-13 14:42:42 -04:00
Richard Henderson 5d360366e9
tcg: Add CPUState cflags_next_tb
We were generating code during tb_invalidate_phys_page_range,
check_watchpoint, cpu_io_recompile, and (seemingly) discarding
the TB, assuming that it would magically be picked up during
the next iteration through the cpu_exec loop.

Instead, record the desired cflags in CPUState so that we request
the proper TB so that there is no more magic.

Backports commit 9b990ee5a3cc6aa38f81266fb0c6ef37a36c45b9 from qemu
2018-03-13 14:39:43 -04:00
Emilio G. Cota b5961a139b
tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.

Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.

Backports commit 4e2ca83e71b51577b06b1468e836556912bd5b6e from qemu
2018-03-13 14:32:43 -04:00
Emilio G. Cota 6bc05eeee4
tb hash: track translated blocks with qht
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
MRU is implemented here by adding an intermediate struct
that contains the u32 hash and a pointer to the TB; this
allows us, on an MRU promotion, to copy said struct (that is not
at the head), and put this new copy at the head. After a grace
period, the original non-head struct can be eliminated, and
after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

Host: Intel Xeon E5-2690

28 ++------------+-------------+-------------+-------------+------------++
A***** + + + master **A*** +
27 ++ * xxhash ##B###++
| A******A****** xxhash-rcu $$C$$$ |
26 C$$ A******A****** qht-fixed-nomru*%%D%%%++
D%%$$ A******A******A*qht-dyn-mru A*E****A
25 ++ %%$$ qht-dyn-nomru &&F&&&++
B#####% |
24 ++ #C$$$$$ ++
| B### $ |
| ## C$$$$$$ |
23 ++ # C$$$$$$ ++
| B###### C$$$$$$ %%%D
22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
| D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E
21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
+ E@@@ F&&& + E@ + F&&& + +
20 ++------------+-------------+-------------+-------------+------------++
14 16 18 20 22 24
log2 number of buckets

Host: Intel i7-4790K

14.5 ++------------+------------+-------------+------------+------------++
A** + + + master **A*** +
14 ++ ** xxhash ##B###++
13.5 ++ ** xxhash-rcu $$C$$$++
| qht-fixed-nomru %%D%%% |
13 ++ A****** qht-dyn-mru @@E@@@++
| A*****A******A****** qht-dyn-nomru &&F&&& |
12.5 C$$ A******A******A*****A****** ***A
12 ++ $$ A*** ++
D%%% $$ |
11.5 ++ %% ++
B### %C$$$$$$ |
11 ++ ## D%%%%% C$$$$$ ++
| # % C$$$$$$ |
10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C
10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
+ F&& D%%%%%%B######B######B#####B###@@@D%%% +
9.5 ++------------+------------+-------------+------------+------------++
14 16 18 20 22 24
log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Backports commit 909eaac9bbc2ed4f3a82ce38e905b87d478a3e00 from qemu
2018-03-13 14:16:26 -04:00
Lioncash e45c294405
Backport qht hashtable 2018-03-13 13:55:30 -04:00
Philippe Mathieu-Daudé 4eeb4f7faf
accel/tcg: move atomic_template.h to accel/tcg/ 2018-03-13 12:28:50 -04:00
Thomas Huth 975924bb2e
accel/tcg: move softmmu_template.h to accel/tcg/
The header is only used by accel/tcg/cputlb.c so we can
move it to the accel/tcg/ folder, too.

Backports commit da1849c1eba50aa372f87c7945d7b230eb2b2fb2 from qemu
2018-03-13 12:27:04 -04:00
Lioncash 035f1afa7d
tcg: move tcg backend files into accel/tcg/
move tcg-runtime.c, translate-all.(ch) and translate-common.c into
accel/tcg/ subdirectory and updated related trace-events file.

Backports commit 244f144134d0dd182f1af8654e7f9a79fe770368 and applies
relevant changes made in db432672dc50ed86dda17ac821b7eb07411a90af and
d9bb58e51068dfc48746c6af0179926c8dc05bce from qemu
2018-03-13 11:48:15 -04:00
Lioncash 99dbbf1571
tcg/optimize: Perform comparison pass with qemu
Keeps formatting and code synced
2018-03-12 18:06:29 -04:00
Lioncash 21b0afe218
tcg: Perform comparison pass with qemu
Makes formatting and code consistent with qemu
2018-03-12 18:03:06 -04:00
Lioncash 95d50a02a1
target/mips/translate: Perform comparison pass with qemu
Keeps code and formatting in sync
2018-03-12 17:52:56 -04:00
Lioncash 7db1bff993
target/mips/op_helper: Perform comparison pass with qemu
Keeps code and formatting in sync
2018-03-12 15:25:08 -04:00
Lioncash 48429b2bcb
target/mips/msa_helper: Perform comparison pass with qemu
Keeps code and formatting in sync
2018-03-12 15:15:42 -04:00
Lioncash 4e8a1f8d6b
target/mips/internal: Perform comparison pass with qemu
Keeps code and formatting in sync with qemu
2018-03-12 15:13:17 -04:00
Lioncash 05089ecb12
target/mips/helper: Perform comparison pass with qemu
Keeps code and formatting in sync with qemu
2018-03-12 15:11:52 -04:00
Lioncash 56675f5215
cpu-exec: Resolve potential compilation errors
We need to pass 'uc' to CPU_GET_CLASS
2018-03-12 14:59:21 -04:00
Lioncash e9d9ed5eaa
target/i386/bpt_helper: Perform comparison pass with qemu
Keep formatting and code in sync where applicable
2018-03-12 13:28:50 -04:00
Lioncash fc7eaf7f77
target/i386/svm_helper: Perform comparison pass with qemu
Keep code and formatting in sync where applicable
2018-03-12 13:27:03 -04:00
Lioncash 27c283bb3c
target/i386/smm_helper: Perform comparison pass with qemu
Ensure code and formatting stay in sync where relevant
2018-03-12 13:25:37 -04:00
Lioncash 73426a7e79
target/i386/seg_helper: Perform comparison pass against qemu
Ensure formatting and code stay in sync where relevant
2018-03-12 13:24:36 -04:00
Lioncash a1910954cd
target/i386/mem_helper: Perform comparison pass against qemu
Ensure formatting and relevant code are in order
2018-03-12 13:19:05 -04:00
Lioncash 995ae229a3
target/i386/excp_helper: remove unnecessary comment 2018-03-12 13:16:53 -04:00
Lioncash c1e72be68d
target/i386/fpu_helper: Perform comparison pass against qemu 2018-03-12 13:15:51 -04:00
Lioncash 0d0dd2ba98
target/i386/translate: Perform comparison pass against qemu
Ensure code and formatting match qemu where applicable
2018-03-12 13:12:01 -04:00
Lioncash 83b35aa797
target/sparc/win_helper: Perform comparison pass against qemu
Ensure code and formatting are consistent with qemu
2018-03-12 12:46:59 -04:00
Lioncash 0215431990
target/sparc/mmu_helper: Perform comparison pass against qemu
Ensure code and formatting match qemu
2018-03-12 12:45:18 -04:00
Lioncash 83c0769d90
target/sparc/ldst_helper: Perform comparison pass against qemu
Ensure code and formatting is consistent with qemu
2018-03-12 12:43:14 -04:00
Lioncash a228660860
target/sparc/fop_helper: Perform comparison pass against qemu
Ensure formatting and code is consistent from the backporting
2018-03-12 12:38:21 -04:00
Lioncash 2114d28f7e
target/sparc/cc_helper: Perform a comparison pass against qemu 2018-03-12 12:36:51 -04:00
Lioncash bcc8bc5c18
target/sparc/translate: Perform comparison pass againt main qemu repo
Ensure that formatting and relevant code is organized like qemu
2018-03-12 12:34:49 -04:00
Lioncash b92dd8d299
target/m68k/op_helper: Adjust formatting to be in sync with qemu 2018-03-12 12:26:53 -04:00