In user-mode emulation Translation Block can consist of 2 guest pages.
In that case QEMU also mprotects 2 host pages that are dedicated for
guest memory, containing instructions. QEMU detects self-modifying code
with SEGFAULT signal processing.
In case if instruction in 1st page is modifying memory of 2nd
page (or vice versa) QEMU will mark 2nd page with PAGE_WRITE,
invalidate TB, generate new TB contatining 1 guest instruction and
exit to CPU loop. QEMU won't call mprotect, and new TB will cause
same SEGFAULT. Page will have both PAGE_WRITE_ORG and PAGE_WRITE
flags, so QEMU will handle the signal as guest binary problem,
and exit with guest SEGFAULT.
Solution is to do following: In case if current TB was invalidated
continue to invalidate TBs from remaining guest pages and mark pages
as PAGE_WRITE. After that disable host page protection with mprotect.
If current tb was invalidated longjmp to main loop. That is more
efficient, since we won't get SEGFAULT when executing new TB.
Backports commit 7399a337e4126f7c8c8af3336726f001378c4798 from qemu
Information is tracked inside the TCGContext structure, and later used
by tracing events with the 'tcg' and 'vcpu' properties.
The 'cpu' field is used to check tracing of translation-time
events ("*_trans"). The 'tcg_env' field is used to pass it to
execution-time events ("*_exec").
Backports commit 7c2550432abe62f53e6df878ceba6ceaf71f0e7e from qemu
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.
The appended addresses this with two changes:
1) Use xxhash as the hash table's hash function. xxhash is a fast,
high-quality hashing function.
2) Feed the hashing function with not just tb_phys, but also pc and flags.
This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.
Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.
BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.
This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time
Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.
Backports commit 42bd32287f3a18d823f2258b813824a39ed7c6d9 from qemu
The function cpu_resume_from_signal() is now always called with a
NULL puc argument, and is rather misnamed since it is never called
from a signal handler. It is essentially forcing an exit to the
top level cpu loop but without raising any exception, so rename
it to cpu_loop_exit_noexc() and drop the useless unused argument.
Backports commit 6886b98036a8f8f5bce8b10756ce080084cef11b from qemu
Since the only caller of page_unprotect() which might cause it to
need to call cpu_resume_from_signal() is handle_cpu_signal() in
the user-mode code, push the longjump handling out to that function.
Since this is the only caller of cpu_resume_from_signal() which
passes a non-NULL puc argument, split the non-NULL handling into
a new cpu_exit_tb_from_sighandler() function. This allows us
to merge the softmmu and usermode implementations of the
cpu_resume_from_signal() function, which are now identical.
Backports commit f213e72f2356b77768b9cb73814a3b26ad5a0099 from qemu
The user-mode-only function tb_invalidate_phys_page() is only
called from two places:
* page_unprotect(), which passes in a non-zero pc, a puc pointer
and the value 'true' for the locked argument
* page_set_flags(), which passes in a zero pc, a NULL puc pointer
and a 'false' locked argument
If the pc is non-zero then we may call cpu_resume_from_signal(),
which does a longjmp out of the calling code (and out of the
signal handler); this is to cover the case of a target CPU with
"precise self-modifying code" (currently only x86) executing
a store instruction which modifies code in the same TB as the
store itself. Rather than doing the longjump directly here,
return a flag to the caller which indicates whether the current
TB was modified, and move the longjump to page_unprotect.
Backports commit 75809229bbf28b371afce14921ff5be98ddc5faa from qemu
mr->ram_block->offset is already aligned to both host and target size
(see qemu_ram_alloc_internal). Remove further masking as it is
unnecessary.
Backports commit e4e697940dff612b789b0858270c20a8b680f78d from qemu
exec-all.h contains TCG-specific definitions. It is not needed outside
TCG-specific files such as translate.c, exec.c or *helper.c.
One generic function had snuck into include/exec/exec-all.h; move it to
include/qom/cpu.h.
Backports commit 63c915526d6a54a95919ebece83fa9ca631b2508 from qemu
This field was used for telling cpu_interrupt() to unlink a chain of TBs
being executed when it worked that way. Now, cpu_interrupt() don't do
this anymore. So we don't need this field anymore.
Backports commit 3213525f8ab48742db09dab18cb9ae6f36a6c921 from qemu
'tb_invalidated_flag' was meant to catch two events:
* some TB has been invalidated by tb_phys_invalidate();
* the whole translation buffer has been flushed by tb_flush().
Then it was checked:
* in cpu_exec() to ensure that the last executed TB can be safely
linked to directly call the next one;
* in cpu_exec_nocache() to decide if the original TB should be provided
for further possible invalidation along with the temporarily
generated TB.
It is always safe to patch an invalidated TB since it is not going to be
used anyway. It is also safe to call tb_phys_invalidate() for an already
invalidated TB. Thus, setting this flag in tb_phys_invalidate() is
simply unnecessary. Moreover, it can prevent from pretty proper linking
of TBs, if any arbitrary TB has been invalidated. So just don't touch it
in tb_phys_invalidate().
If this flag is only used to catch whether tb_flush() has been called
then rename it to 'tb_flushed'. Declare it as 'bool' and stick to using
only 'true' and 'false' to set its value. Also, instead of setting it in
tb_gen_code(), just after tb_flush() has been called, do it right inside
of tb_flush().
In cpu_exec(), this flag is used to track if tb_flush() has been called
and have made 'next_tb' (a reference to the last executed TB) invalid
for linking it to directly call the next TB. tb_flush() can be called
during the CPU execution loop from tb_gen_code(), during TB execution or
by another thread while 'tb_lock' is released. Catch for translation
buffer flush reliably by resetting this flag once before first TB lookup
and each time we find it set before trying to add a direct jump. Don't
touch in in tb_find_physical().
Each vCPU has its own execution loop in multithreaded mode and thus
should have its own copy of the flag to be able to reset it with its own
'next_tb' and don't affect any other vCPU execution thread. So make this
flag per-vCPU and move it to CPUState.
In cpu_exec_nocache(), we only need to check if tb_flush() has been
called from tb_gen_code() called by cpu_exec_nocache() itself. To do
this reliably, preserve the old value of the flag, reset it before
calling tb_gen_code(), check afterwards, and combine the saved value
back to the flag.
This patch is based on the patch "tcg: move tb_invalidated_flag to
CPUState" from Paolo Bonzini <pbonzini@redhat.com>.
Backports commit 6f789be56d3f38e9214dafcfab3bf9be7191f370 from qemu
Unify the code of this function with tb_jmp_remove_from_list(). Making
these functions similar improves their readability. Also this could be a
step towards making this function thread-safe.
Backports commit f9c5b66f487a04d3747dc6997b1503f9258df945 from qemu
Move the code for removing jumps to a TB out of tb_phys_invalidate() to
a separate static inline function tb_jmp_unlink(). This simplifies
tb_phys_invalidate() and improves code structure.
Backports commit 89bba496322d4cf996d42cdd4bb0912231656c3d from qemu
tb_jmp_remove() was only used to remove the TB from a list of all TBs
jumping to the same TB which is n-th jump destination of the given TB.
Put a comment briefly describing the function behavior and rename it to
better reflect its purpose.
Backports commit 133626783aa5a1bf86332fa3e6f7b8efe005f924 from qemu
Initialize TB's direct jump list data fields and reset the jumps before
tb_link_page() puts it into the physical hash table and the physical
page list. So TB is completely initialized before it becomes visible.
This is pure rearrangement of code to a more suitable place, though it
could be a preparation for relaxing the locking scheme in future.
Backports commit 901bc3deb43bf37c85e43955905d003be7ae5fa5 from qemu
These fields do not contain pure pointers to a TranslationBlock
structure. So uintptr_t is the most appropriate type for them.
Also put some asserts to assure that the two least significant bits of
the pointer are always zero before assigning it to jmp_list_first.
Backports commit c37e6d7e3589ecb96914faa21025ad7ba6654aea from qemu
Briefly describe in a comment how direct block chaining is done. It
should help in understanding of the following data fields.
Rename some fields in TranslationBlock and TCGContext structures to
better reflect their purpose (dropping excessive 'tb_' prefix in
TranslationBlock but keeping it in TCGContext):
tb_next_offset => jmp_reset_offset
tb_jmp_offset => jmp_insn_offset
tb_next => jmp_target_addr
jmp_next => jmp_list_next
jmp_first => jmp_list_first
Avoid using a magic constant as an invalid offset which is used to
indicate that there's no n-th jump generated.
Backports commit f309101c26b59641fc1aa8fb2a98a5441cdaea03 from qemu
The setting of tcg_ctx.code_gen_buffer_size is done by the only caller of
size_code_gen_buffer(), which is code_gen_alloc():
$ git grep size_code_gen_buffer
translate-all.c:static inline size_t size_code_gen_buffer(size_t tb_size)
translate-all.c: tcg_ctx.code_gen_buffer_size = size_code_gen_buffer(tb_size);
Backports commit 835154b6e2200460f04719d0028716a37c178368 from qemu
We are inconsistent with the type of tb->flags: usage varies loosely
between int and uint64_t. Settle to uint32_t everywhere, which is
superior to both: at least one target (aarch64) uses the most significant
bit in the u32, and uint64_t is wasteful.
Compile-tested for all targets.
Backports commit 89fee74a0f066dfd73830a7b5fa137e87888c870 from qemu
Since 5e5f07e08 "TCG: Move translation block variables
to new context inside tcg_ctx: tb_ctx" on Feb 1 2013, compilation
of usermode + TB_DEBUG_CHECK has been broken. Fix it.
Backports commit 7e6bd36d61129feb7f667cb09ffec1b7b54b971c from qemu
qemu-log: dfilter-ise exec, out_asm, op and opt_op
This ensures the code generation debug code will honour -dfilter if set.
For the "exec" tracing I've added a new inline macro for efficiency's
sake.
Backports commit d977e1c2dbc9e63454b2000f91954d02543bf43b from qemu
My later debugging patches need access to the origin PC which is held in
the TranslationBlock structure. Pass down the whole structure as it also
holds the information about the code start point.
Backports commit 5bd2ec3d7b47b2252745882795d79aef36380fb7 from qemu
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Backports commit d38ea87ac54af64ef611de434d07c12dc0399216 from qemu
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes.
Backports commit 7b31bbc2e68605ab2f10dc609dd54cf4c7b5f49a from qemu
The TARGET_HAS_ICE #define is intended to indicate whether a target-*
guest CPU implementation supports the breakpoint handling. However,
all our guest CPUs have that support (the only two which do not
define TARGET_HAS_ICE are unicore32 and openrisc, and in both those
cases the bp support is present and the lack of the #define is just
a bug). So remove the #define entirely: all new guest CPU support
should include breakpoint handling as part of the basic implementation.
Backports commit ec53b45bcd1f74f7a4c31331fa6d50b402cd6d26 from qemu
Anthony reported that >4GB guests on Xen with 32bit QEMU broke after
commit 4ed023c ("Round up RAMBlock sizes to host page sizes", 2015-11-05).
In that patch sizes are masked against qemu_host_page_size/mask which
are uintptr_t, and thus 32bit on a 32bit QEMU, even though the ram space
might be bigger than 4GB on Xen.
Since ram_addr_t is not available on user-mode emulation targets, ensure
that we get a sign extension when masking away the low bits of the address.
Remove the ~10 year old scary comment that the type of these variables
is probably wrong, with another equally scary comment. The new comment
however does not have "???" in it, which is arguably an improvement.
For completeness use the alignment macros in linux-user and bsd-user
instead of manually doing an &. linux-user and bsd-user are not affected
by the Xen issue, however.
Backports commit 0c2d70c448b7853a91cfa63659aa3cc6630fb9be from qemu
Restrict the size of code_gen_buffer to 2GB on ppc64, which
lets us assert that everything is reachable with addis+addi
from tb_ret_addr. This lets us use a max of 4 insns for goto_tb
instead of 7.
Emit the indirect branch portion of goto_tb up front, which
means we only have to update two insns to update any link.
With a 64-bit store, we can update the link atomically, which
may be required in future.
Backports commit 5bfd75a35c11dd3aa61c73d0d2cd88137c31519c from qemu
We currently pre-compute an worst case code size for any TB, which
works out to be 122kB. Since the average TB size is near 1kB, this
wastes quite a lot of storage.
Instead, check for overflow in between generating code for each opcode.
The overhead of the check isn't measurable and wastage is minimized.
Backports commit b125f9dc7bd68cd4c57189db4da83b0620b28a72 from qemu
This will catch any overflow of the buffer.
Add a native win32 alternative for alloc_code_gen_buffer;
remove the malloc alternative.
Backports commit f293709c6af7a65a9bcec09cdba7a60183657a3e from qemu
By putting the prologue at the end, we risk overwriting the
prologue should our estimate of maximum TB size. Given the
two different placements of the call to tcg_prologue_init,
move the high water mark computation into tcg_prologue_init.
Backports commit 8163b74938d8b7d12e70597c4553dd0dc49443d5 from qemu
In this case, QEMU might longjmp out of cpu-exec.c and miss the final
cleanup in cpu_exec_nocache. Do this manually through a new compile
flag.
Backports commit d8a499f17ee5f05407874f29f69f0e3e3198a853 from qemu
The gen_opc_* arrays are already redundant with the data stored in
the insn_start arguments. Transition restore_state_to_opc to use
data from the latter.
Backports commit bad729e272387de7dbfa3ec4319036552fc6c107 from qemu
Move this function to common code. It has no arch specific
dependencies. Prepares support for multi-arch where the translate-all
interface needs to be virtualised. One less thing to virtualise.
Backports commit 9b68a7754a892d8deb7696cfe609fe2ec3c6034a from qemu
There is some iffy lock hierarchy going on in translate-all.c. To
fix it, we need to take the mmap_lock in cpu-exec.c. Make the
functions globally available.
Backports commit 8fd19e6cfd5b6cdf028c6ac2ff4157ed831ea3a6 from qemu
All of the core-code usages of this API have the cpu pointer handy so
pass it in. There are only 3 architecture specific usages (2 of which
are commented out) which can just use ENV_GET_CPU() locally to get the
cpu pointer. The reduces core code usage of the CPU env, which brings
us closer to common-obj'ing these core files.
Backports commit bbd77c180d7ff1b04a7661bb878939b2e1d23798 from qemu
This is one of very few things in exec-all with a genuine CPU
architecture dependency. Move these hashing helpers to a new
header to trim exec-all.h down to a near architecture-agnostic
header.
The defs are only used by cpu-exec and translate-all which are both
arch-obj's so the new tb-hash.h has no core code usage.
Backports commit e1b89321bafea9fb33d87852fc91fee579d17dfe from qemu
The tb_check_watchpoint function currently assumes that all memory
access is done either directly through the TCG code or through an
helper which knows its return address. This is obviously wrong as the
helpers use cpu_ldxx/stxx_data functions to access the memory.
Instead of aborting in that case, don't try to retranslate the code, but
assume that the CPU state (and especially the program counter) has been
saved before calling the helper. Then invalidate the TB based on this
address.
Backports commit 8d302e76755b8157373073d7107e31b0b13f80c1 from qemu
is_cpu_write_access is only set if tb_invalidate_phys_page_range is called
from tb_invalidate_phys_page_fast, and hence from notdirty_mem_write.
However:
- the code bitmap can be built directly in tb_invalidate_phys_page_fast
(unconditionally, since is_cpu_write_access would always be passed as 1);
- the virtual address is not needed to mark the page as "not containing
code" (dirty code bitmap = 1), so we can also remove that use of
is_cpu_write_access. For calls of tb_invalidate_phys_page_range
that do not come from notdirty_mem_write, the next call to
notdirty_mem_write will notice that the page does not contain code
anymore, and will fix up the TLB entry.
The parameter needs to remain in order to guard accesses to cpu->mem_io_pc.
Backports commit fc377bcf617a48233a99a9fe0a26247c38b5cb76 from qemu
These days modification of the TLB is done in notdirty_mem_write,
so the virtual address and env pointer as unnecessary.
The new name of the function, tlb_unprotect_code, is consistent with
tlb_protect_code.
Backports commit 9564f52da7eb061326956ed9a468935e3352512d from qemu
Correct MIPS16/microMIPS branch size calculation in PC adjustment
needed:
- to set the value of CP0.ErrorEPC at the entry to the reset exception,
- for the purpose of branch reexecution in the context of device I/O.
Follow the approach taken in `exception_resume_pc' for ordinary, Debug
and NMI exceptions.
MIPS16 and microMIPS branches can be 2 or 4 bytes in size and that has
to be reflected in calculation. Original MIPS ISA branches, which is
where this code originates from, are always 4 bytes long, just as all
original MIPS ISA instructions.
Backports commit c3577479815f5bcf9d38993967bca2115af245d8 from qemu