We already handle this in the backends, and the lifetime datum
for the TCGOp is already large enough.
Backports commit 1df3caa946e08b387511dfba3a37d78910e51796 from qemu
When SEV is enabled, CPUID 0x8000_001F should provide additional
information regarding the feature (such as which page table bit is used
to mark the pages as encrypted etc).
The details for memory encryption CPUID is available in AMD APM
(https://support.amd.com/TechDocs/24594.pdf) Section E.4.17
Backports relevant parts of commit 6cb8f2a663a47c6e0da17fc4fb9e06abfda2bd48 from qemu
This MSR returns the number of #SMIs that occurred on
CPU since boot.
KVM commit 52797bf9a875 ("KVM: x86: Add emulation of MSR_SMI_COUNT")
introduced support for emulating this MSR.
This commit adds support for QEMU to save/load this
MSR for migration purposes.
Backports the relevant parts of commit e13713db5b609d9a83c9cfc8ba389d4215d4ba29 from qemu
Using a local m68k floatx80_cosh()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit 02f9124ebe26c36f0f7ed58085bd963e4372b2cd from qemu
Using a local m68k floatx80_sinh()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit eee6b892a6063c2807ecf33a2f62a8d7cca7652c from qemu
Using local m68k floatx80_tanh() and floatx80_etoxm1()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit 9937b02965c2a7dbc4b21d98e29b082bab095aa5 from qemu
Using a local m68k floatx80_atanh()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit e3655afa137b2e0999537eef273a2845ba21d68c from qemu
Using a local m68k floatx80_acos()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit c84813b807fc82c68ff6d72387f95b15ad283bf6 from qemu
Using a local m68k floatx80_asin()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit bc20b34e03b51725d7f008551b5f56f1da07ab6a from qemu
Using a local m68k floatx80_atan()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit 8c992abc892c90caf1d4dd5b4482cda052a280ba from qemu
Using a local m68k floatx80_cos()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit 68d0ed37866de2c5cafc4e2589e263961b2e8cd6 from qemu
Using a local m68k floatx80_sin()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit 5add1ac42faffd3d3639101fa778dced693a65a3 from qemu
Using a local m68k floatx80_tan()
[copied from previous:
Written by Andreas Grabher for Previous, NeXT Computer Emulator.]
Backports commit 273401809c8a8330e5430f2c958467efa7079b2c from qemu
Add Intel Processor Trace related definition. It also add
corresponding part to kvm_get/set_msr and vmstate.
Backports commit b77146e9a129bcdb60edc23639211679ae846a92 from qemu
Expose Intel Processor Trace feature to guest.
To make Intel PT live migration safe and get same CPUID information
with same CPU model on diffrent host. CPUID[14] is constant in this
patch. Intel PT use EPT is first supported in IceLake, the CPUID[14]
get on this machine as default value. Intel PT would be disabled
if any machine don't support this minial feature list.
Backports commit e37a5c7fa459558b5020588994707fe3fdd6616e from qemu
Add KVM_HINTS_DEDICATED performance hint, guest checks this feature bit
to determine if they run on dedicated vCPUs, allowing optimizations such
as usage of qspinlocks.
Backports commit be7773268d98176489483a315d3e2323cb0615b9 from qemu
Two or more threads might race while invalidating the same TB. We currently
do not check for this at all despite taking tb_lock, which means we would
wrongly invalidate the same TB more than once. This bug has actually been
hit by users: I recently saw a report on IRC, although I have yet to see
the corresponding test case.
Fix this by using qht_remove as the synchronization point; if it fails,
that means the TB has already been invalidated, and therefore there
is nothing left to do in tb_phys_invalidate.
Note that this solution works now that we still have tb_lock, and will
continue working once we remove tb_lock.
Backports commit cc689485ee3e9dca05765326ee8fd619a6ec48f0 from qemu
This is identical for each target. So, move the initialization to
common code. Move the variable itself out of tcg_ctx and name it
cpu_env to minimize changes within targets.
This also means we can remove tcg_global_reg_new_{ptr,i32,i64},
since there are no longer global-register temps created by targets.
Backports commit 1c2adb958fc07e5b3e81ed21b801c04a15f41f4f from qemu
This enables parallel TCG code generation. However, we do not take
advantage of it yet since tb_lock is still held during tb_gen_code.
In user-mode we use a single TCG context; see the documentation
added to tcg_region_init for the rationale.
Note that targets do not need any conversion: targets initialize a
TCGContext (e.g. defining TCG globals), and after this initialization
has finished, the context is cloned by the vCPU threads, each of
them keeping a separate copy.
TCG threads claim one entry in tcg_ctxs[] by atomically increasing
n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's
of that variable and tcg_ctxs; they are there just to play nice with
analysis tools such as thread sanitizer.
Note that we do not allocate an array of contexts (we allocate
an array of pointers instead) because when tcg_context_init
is called, we do not know yet how many contexts we'll use since
the bool behind qemu_tcg_mttcg_enabled() isn't set yet.
Previous patches folded some TCG globals into TCGContext. The non-const
globals remaining are only set at init time, i.e. before the TCG
threads are spawned. Here is a list of these set-at-init-time globals
under tcg/:
Only written by tcg_context_init:
- indirect_reg_alloc_order
- tcg_op_defs
Only written by tcg_target_init (called from tcg_context_init):
- tcg_target_available_regs
- tcg_target_call_clobber_regs
- arm: arm_arch, use_idiv_instructions
- i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
have_movbe, have_popcnt
- mips: use_movnz_instructions, use_mips32_instructions,
use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
- ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
- s390: tb_ret_addr, s390_facilities
- sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
use_vis3_instructions
Only written by tcg_prologue_init:
- 'struct jit_code_entry one_entry'
- aarch64: tb_ret_addr
- arm: tb_ret_addr
- i386: tb_ret_addr, guest_base_flags
- ia64: tb_ret_addr
- mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr
Backports commit 3468b59e18b179bc63c7ce934de912dfa9596122 from qemu
This is groundwork for supporting multiple TCG contexts.
The naive solution here is to split code_gen_buffer statically
among the TCG threads; this however results in poor utilization
if translation needs are different across TCG threads.
What we do here is to add an extra layer of indirection, assigning
regions that act just like pages do in virtual memory allocation.
(BTW if you are wondering about the chosen naming, I did not want
to use blocks or pages because those are already heavily used in QEMU).
We use a global lock to serialize allocations as well as statistics
reporting (we now export the size of the used code_gen_buffer with
tcg_code_size()). Note that for the allocator we could just use
a counter and atomic_inc; however, that would complicate the gathering
of tcg_code_size()-like stats. So given that the region operations are
not a fast path, a lock seems the most reasonable choice.
The effectiveness of this approach is clear after seeing some numbers.
I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
Note that I'm evaluating this after enabling per-thread TCG (which
is done by a subsequent commit).
* -smp 1, 1 region (entire buffer):
qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367
That is, 8 flushes.
* -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:
qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368
Again, 8 flushes. Note how buffer utilization is not 100%, but it
is close. Smaller region sizes would yield higher utilization,
but we want region allocation to be rare (it acquires a lock), so
we do not want to go too small.
* -smp 8, static partitioning of 8 regions (10 MB per region):
qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364
That is, 20 flushes. Note how a static partitioning approach uses
the code buffer poorly, leading to many unnecessary flushes.
Backports commit e8feb96fcc6c16eab8923332e86ff4ef0e2ac276 from qemu
The helpers require the address and size to be page-aligned, so
do that before calling them.
Backports commit f51f315a676ec913a55ac27be4ef857f9f7ddc5c from qemu
Groundwork for supporting multiple TCG contexts.
While at it, also allocate temps_used directly as a bitmap of the
required size, instead of using a bitmap of TCG_MAX_TEMPS via
TCGTempSet.
Performance-wise we lose about 1.12% in a translation-heavy workload
such as booting+shutting down debian-arm:
Performance counter stats for 'taskset -c 0 arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=die-on-boot.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):
exec time (s) Relative slowdown wrt original (%)
---------------------------------------------------------------
original 20.213321616 0.
tcg_malloc 20.441130078 1.1270214
TCGContext 20.477846517 1.3086662
g_malloc 20.780527895 2.8061013
The other two alternatives shown in the table are:
- TCGContext: embed temps[TCG_MAX_TEMPS] and TCGTempSet used_temps
in TCGContext. This is simple enough but it isn't faster than using
tcg_malloc; moreover, it wastes memory.
- g_malloc: allocate/deallocate both temps and used_temps every time
tcg_optimize is executed.
Backports commit 34184b071817b4f9edbfd1aa2225c196f05a0947 from qemu
Groundwork for supporting multiple TCG contexts.
Note that having n_tcg_ctxs is unnecessary. However, it is
convenient to have it, since it will simplify iterating over the
array: we'll have just a for loop instead of having to iterate
over a NULL-terminated array (which would require n+1 elems)
or having to check with ifdef's for usermode/softmmu.
Backports commit df2cce2968069526553d82331ce9817eaca6b03a from qemu
Groundwork for supporting multiple TCG contexts.
The core of this patch is this change to tcg/tcg.h:
> -extern TCGContext tcg_ctx;
> +extern TCGContext tcg_init_ctx;
> +extern TCGContext *tcg_ctx;
Note that for now we set *tcg_ctx to whatever TCGContext is passed
to tcg_context_init -- in this case &tcg_init_ctx.
Backports commit b1311c4acf503dc9c1a310cc40b64f05b08833dc from qemu
Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:
- Making the conscious choice of misreporting the total translated code;
doing otherwise would mislead users into thinking "-tb-size" is not
honoured.
- Expanding tb_tree_stats to accurately count the bytes of translated code on
the host, and using this for reporting the average tb host size,
as well as the expansion ratio.
In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.
Backports commit f19c6cc6fc356dab7a766b471ec5eb3058f0afc1 from qemu
We don't really free anything in this function anymore; we just remove
the TB from the binary search tree.
Backports commit be1e01171b556807198c84feac7cf4bca0d904c2 from qemu
This is a prerequisite for supporting multiple TCG contexts, since
we will have threads generating code in separate regions of
code_gen_buffer.
For this we need a new field (.size) in struct tb_tc to keep
track of the size of the translated code. This field uses a size_t
to avoid adding a hole to the struct, although really an unsigned
int would have been enough.
The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.
Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer. The next patch in this series renames tb_free
to tb_remove to reflect this.
Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:
Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
-machine type=virt -nographic -smp 1 -m 4096 \
-netdev user,id=unet,hostfwd=tcp::2222-:22 \
-device virtio-net-device,netdev=unet \
-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
-device virtio-blk-device,drive=myblock \
-kernel img/arm/aarch32-current-linux-kernel-only.img \
-append console=ttyAMA0 root=/dev/vda1 \
-name arm,debug-threads=on -smp 1' (10 runs):
- Before:
8048.598422 task-clock (msec) # 0.931 CPUs utilized ( +- 0.28% )
16,974 context-switches # 0.002 M/sec ( +- 0.12% )
0 cpu-migrations # 0.000 K/sec
10,125 page-faults # 0.001 M/sec ( +- 1.23% )
35,144,901,879 cycles # 4.367 GHz ( +- 0.14% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
65,758,252,643 instructions # 1.87 insns per cycle ( +- 0.33% )
10,871,298,668 branches # 1350.707 M/sec ( +- 0.41% )
192,322,212 branch-misses # 1.77% of all branches ( +- 0.32% )
8.640869419 seconds time elapsed ( +- 0.57% )
- After:
8146.242027 task-clock (msec) # 0.923 CPUs utilized ( +- 1.23% )
17,016 context-switches # 0.002 M/sec ( +- 0.40% )
0 cpu-migrations # 0.000 K/sec
18,769 page-faults # 0.002 M/sec ( +- 0.45% )
35,660,956,120 cycles # 4.378 GHz ( +- 1.22% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
65,095,366,607 instructions # 1.83 insns per cycle ( +- 1.73% )
10,803,480,261 branches # 1326.192 M/sec ( +- 1.95% )
195,601,289 branch-misses # 1.81% of all branches ( +- 0.39% )
8.828660235 seconds time elapsed ( +- 0.38% )
Backports commit 2ac01d6dafabd4a726254eea98824c798d416ee4 from qemu
Now that we have curr_cflags, we can include CF_USE_ICOUNT
early and then remove it as necessary.
Backports commit 416986d3f97329655e30da7271a2d11c6d707b06 from qemu
Now that all code generation has been converted to check CF_PARALLEL, we can
generate !CF_PARALLEL code without having yet set !parallel_cpus --
and therefore without having to be in the exclusive region during
cpu_exec_step_atomic.
While at it, merge cpu_exec_step into cpu_exec_step_atomic.
Backports commit ac03ee5331612e44beb393df2b578c951d27dc0d from qemu
Thereby decoupling the resulting translated code from the current state
of the system.
The tb->cflags field is not passed to tcg generation functions. So
we add a field to TCGContext, storing there a copy of tb->cflags.
Most architectures have <= 32 registers, which results in a 4-byte hole
in TCGContext. Use this hole for the new field.
Backports commit e82d5a2460b0e176128027651ff9b104e4bdf5cc from qemu
Thereby decoupling the resulting translated code from the current state
of the system.
Backports commit 87d757d60d66d5ee1608460b0f1e07e2b758db9c from qemu
Thereby decoupling the resulting translated code from the current state
of the system.
Backports commit f0ddf11b23260f0af84fb529486a8f9ba2d19401 from qemu
Thereby decoupling the resulting translated code from the current state
of the system.
Backports commit b5e3b4c2aca8eb5a9cfeedfb273af623f17c3731 from qemu
Thereby decoupling the resulting translated code from the current state
of the system.
Backports commit 2399d4e7cec22ecf1c51062d2ebfd45220dbaace from qemu
Convert all existing readers of tb->cflags to tb_cflags, so that we
use atomic_read and therefore avoid undefined behaviour in C11.
Note that the remaining setters/getters of the field are protected
by tb_lock, and therefore do not need conversion.
Luckily all readers access the field via 'tb->cflags' (so no foo.cflags,
bar->cflags in the code base), which makes the conversion easily
scriptable:
FILES=$(git grep 'tb->cflags' target include/exec/gen-icount.h \
accel/tcg/translator.c | cut -f1 -d':' | sort | uniq)
perl -pi -e 's/([^.>])tb->cflags/$1tb_cflags(tb)/g' $FILES
perl -pi -e 's/([a-z->.]*)(->|\.)tb->cflags/tb_cflags($1$2tb)/g' $FILES
Then manually fixed the few errors that checkpatch reported.
Compile-tested for all targets.
Backports commit c5a49c63fa26e8825ad101dfe86339ae4c216539 from qemu
We were generating code during tb_invalidate_phys_page_range,
check_watchpoint, cpu_io_recompile, and (seemingly) discarding
the TB, assuming that it would magically be picked up during
the next iteration through the cpu_exec loop.
Instead, record the desired cflags in CPUState so that we request
the proper TB so that there is no more magic.
Backports commit 9b990ee5a3cc6aa38f81266fb0c6ef37a36c45b9 from qemu
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.
Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.
Backports commit 4e2ca83e71b51577b06b1468e836556912bd5b6e from qemu