Commit graph

71 commits

Author SHA1 Message Date
Tony Nguyen a95927de1d cputlb: Byte swap memory transaction attribute
Notice new attribute, byte swap, and force the transaction through the
memory slow path.

Required by architectures that can invert endianness of memory
transaction, e.g. SPARC64 has the Invert Endian TTE bit.

Backports commit a26fc6f5152b47f1d7ed928f9c9d462d01ff1624 from qemu
2020-01-07 19:15:33 -05:00
Tony Nguyen 103d6f51c8 memory: Single byte swap along the I/O path
Now that MemOp has been pushed down into the memory API, and
callers are encoding endianness, we can collapse byte swaps
along the I/O path into the accelerator and target independent
adjust_endianness.

Collapsing byte swaps along the I/O path enables additional endian
inversion logic, e.g. SPARC64 Invert Endian TTE bit, with redundant
byte swaps cancelling out.

Backports commit 9bf825bf3df4ebae3af51566c8088e3f1249a910 from qemu
2020-01-07 19:12:04 -05:00
Tony Nguyen ad8957a4c3 cputlb: Replace size and endian operands for MemOp
Preparation for collapsing the two byte swaps adjust_endianness and
handle_bswap into the former.

Backports commit be5c4787e9a6eed12fd765d9e890f7cc6cd63220 from qemu
2020-01-07 19:03:51 -05:00
Tony Nguyen da98d0da4e memory: Access MemoryRegion with endianness
Preparation for collapsing the two byte swaps adjust_endianness and
handle_bswap into the former.

Call memory_region_dispatch_{read|write} with endianness encoded into
the "MemOp op" operand.

This patch does not change any behaviour as
memory_region_dispatch_{read|write} is yet to handle the endianness.

Once it does handle endianness, callers with byte swaps can collapse
them into adjust_endianness.

Backports commit d5d680cacc66ef7e3c02c81dc8f3a34eabce6dfe from qemu
2020-01-07 18:54:11 -05:00
Tony Nguyen 3b777a2332 cputlb: Access MemoryRegion with MemOp
The memory_region_dispatch_{read|write} operand "unsigned size" is
being converted into a "MemOp op".

Convert interfaces by using no-op size_memop.

After all interfaces are converted, size_memop will be implemented
and the memory_region_dispatch_{read|write} operand "unsigned size"
will be converted into a "MemOp op".

As size_memop is a no-op, this patch does not change any behaviour.

Backports commit 4cbb198eefef41bbca703605c78875fd4fec6ef6 from qemu
2020-01-07 18:26:29 -05:00
Tony Nguyen f75368cd0f
tcg: TCGMemOp is now accelerator independent MemOp
Preparation for collapsing the two byte swaps, adjust_endianness and
handle_bswap, along the I/O path.

Target dependant attributes are conditionalized upon NEED_CPU_H.

Backports commit 14776ab5a12972ea439c7fb2203a4c15a09094b4 from qemu
2019-11-28 03:01:12 -05:00
Emilio G. Cota f4be234ab8
atomic_template: fix indentation in GEN_ATOMIC_HELPER
Backports commit 358f6348df5ad785c7c18be659d4ff9a2174635f from qemu
2019-11-28 02:38:07 -05:00
Lioncash 802c626145
Revert "cputlb: Filter flushes on already clean tlbs"
This reverts commit 5ab9723787.
2019-06-30 19:21:20 -04:00
Richard Henderson a1396b12f6
tcg: Fix typos in helper_gvec_sar{8,32,64}v
The loop is written with scalars, not vectors.
Use the correct type when incrementing.

Fixes: 5ee5c14cacd

Backports commit 899f08ad1d1231dbbfa67298413f05ed2679fb02 from qemu
2019-06-13 16:09:16 -04:00
Alex Bennée 938f8465a0
cputlb: cast size_t to target_ulong before using for address masks
While size_t is defined to happily access the biggest host object this
isn't the case when generating masks for 64 bit guests on 32 bit
hosts. Otherwise we end up truncating the address when we fall back to
our unaligned helper.

Fixes: https://bugs.launchpad.net/qemu/+bug/1831545

Backports commit ab7a2009df66241a3742cbdfe8f9a1f66c6af21f from qemu
2019-06-13 16:07:01 -04:00
Alex Bennée 9aef73f5fb
cputlb: use uint64_t for interim values for unaligned load
When running on 32 bit TCG backends a wide unaligned load ends up
truncating data before returning to the guest. We specifically have
the return type as uint64_t to avoid any premature truncation so we
should use the same for the interim types.

Fixes: https://bugs.launchpad.net/qemu/+bug/1830872
Fixes: eed5664238e

Backports commit 8c79b288513587e960b6b7257a9d955d5592f209 from qemu
2019-06-13 16:06:22 -04:00
Richard Henderson d7ea41c3a3
cpu: Move icount_decr to CPUNegativeOffsetState
Amusingly, we had already ignored the comment to keep this value
at the end of CPUState. This restores the minimum negative offset
from TCG_AREG0 for code generation.

For the couple of uses within qom/cpu.c, without NEED_CPU_H, add
a pointer from the CPUState object to the IcountDecr object within
CPUNegativeOffsetState.

Backports commit 5e1401969b25f676fee6b1c564441759cf967a43 from qemu
2019-06-13 15:34:28 -04:00
Richard Henderson fbf91a6535
cpu: Replace ENV_GET_CPU with env_cpu
Now that we have both ArchCPU and CPUArchState, we can define
this generically instead of via macro in each target's cpu.h.

Backports commit 29a0af618ddd21f55df5753c3e16b0625f534b3c from qemu
2019-06-12 11:16:16 -04:00
Lioncash 5ab9723787
cputlb: Filter flushes on already clean tlbs
Especially for guests with large numbers of tlbs, like ARM or PPC,
we may well not use all of them in between flush operations.
Remember which tlbs have been used since the last flush, and
avoid any useless flushing.

Backports much of 3d1523ced6060cdfe9e768a814d064067ccabfe5 from qemu
along with a bunch of updating changes.
2019-06-10 20:42:15 -04:00
Richard Henderson ca58be9cb4
tcg: Add support for vector bitwise select
This operation performs d = (b & a) | (c & ~a), and is present
on a majority of host vector units. Include gvec expanders.

Backports commit 38dc12947ec9106237f9cdbd428792c985cd86ae from qemu
2019-05-24 18:15:10 -04:00
Richard Henderson 2a4a7b9391
tcg: Use tlb_fill probe from tlb_vaddr_to_host
Most of the existing users would continue around a loop which
would fault the tlb entry in via a normal load/store.

But for AArch64 SVE we have an existing emulation bug wherein we
would mark the first element of a no-fault vector load as faulted
(within the FFR, not via exception) just because we did not have
its address in the TLB. Now we can properly only mark it as faulted
if there really is no valid, readable translation, while still not
raising an exception. (Note that beyond the first element of the
vector, the hardware may report a fault for any reason whatsoever;
with at least one element loaded, forward progress is guaranteed.)

Backports commit 4811e9095c0491bc6f5450e5012c9c4796b9e59d from qemu
2019-05-16 18:27:03 -04:00
Richard Henderson dab0061a0d
tcg: Use CPUClass::tlb_fill in cputlb.c
We can now use the CPUClass hook instead of a named function.

Create a static tlb_fill function to avoid other changes within
cputlb.c. This also isolates the asserts within. Remove the
named tlb_fill function from all of the targets.

Backports commit c319dc13579a92937bffe02ad2c9f1a550e73973 from qemu
2019-05-16 17:35:37 -04:00
Richard Henderson 6d5e7856ff
tcg: Add support for vector absolute value
Backports commit bcefc90208f8a1d6f619d61c2647281d92277015 from qemu
2019-05-16 16:33:43 -04:00
Richard Henderson 8c17687934
tcg: Add gvec expanders for variable shift
The gvec expanders perform a modulo on the shift count. If the target
requires alternate behaviour, then it cannot use the generic gvec
expanders anyway, and will have to have its own custom code.

Backports commit 5ee5c14cacda27e904cd6b0d9e7ffe1acff42838 from qemu
2019-05-16 15:51:09 -04:00
Richard Henderson 9a02741c13
cputlb: Do unaligned store recursion to outermost function
This is less tricky than for loads, because we always fall
back to single byte stores to implement unaligned stores.

Backports commit 4601f8d10d7628bcaf2a8179af36e04b42879e91 from qemu
2019-05-14 07:45:15 -04:00
Richard Henderson bcab6f1719
cputlb: Do unaligned load recursion to outermost function
If we attempt to recurse from load_helper back to load_helper,
even via intermediary, we do not get all of the constants
expanded away as desired.

But if we recurse back to the original helper (or a shim that
has a consistent function signature), the operands are folded
away as desired.

Backports commit 2dd926067867c2dd19e66d31a7990e8eea7258f6 from qemu
2019-05-14 07:43:31 -04:00
Richard Henderson f12f36aebd
cputlb: Drop attribute flatten
Going to approach this problem via __attribute__((always_inline))
instead, but full conversion will take several steps.

Backports commit fc1bc777910dc14a3db4e2ad66f3e536effc297d from qemu
2019-05-14 07:33:39 -04:00
Richard Henderson 7991cd601f
cputlb: Move TLB_RECHECK handling into load/store_helper
Having this in io_readx/io_writex meant that we forgot to
re-compute index after tlb_fill. It also means we can use
the normal aligned memory load path. It also fixes a bug
in that we had cached a use of index across a tlb_fill.

Backports commit f1be36969de2fb9b6b64397db1098f115210fcd9 from qemu
2019-05-14 07:28:15 -04:00
Alex Bennée ccee796272
accel/tcg: demacro cputlb
Instead of expanding a series of macros to generate the load/store
helpers we move stuff into common functions and rely on the compiler
to eliminate the dead code for each variant.

Backports commit eed5664238ea5317689cf32426d9318686b2b75c from qemu
2019-05-14 07:28:11 -04:00
Richard Henderson 8fdd009a9d
tcg: Remove CF_IGNORE_ICOUNT
Now that we have curr_cflags, we can include CF_USE_ICOUNT
early and then remove it as necessary.

Backports commit 416986d3f97329655e30da7271a2d11c6d707b06 from qemu
2019-05-06 00:57:09 -04:00
Emilio G. Cota b1b069e8ad
cpu-exec: lookup/generate TB outside exclusive region during step_atomic
Now that all code generation has been converted to check CF_PARALLEL, we can
generate !CF_PARALLEL code without having yet set !parallel_cpus --
and therefore without having to be in the exclusive region during
cpu_exec_step_atomic.

While at it, merge cpu_exec_step into cpu_exec_step_atomic.

Backports commit ac03ee5331612e44beb393df2b578c951d27dc0d from qemu
2019-05-06 00:52:43 -04:00
Emilio G. Cota c1e26c4e35
tcg: check CF_PARALLEL instead of parallel_cpus
Thereby decoupling the resulting translated code from the current state
of the system.

The tb->cflags field is not passed to tcg generation functions. So
we add a field to TCGContext, storing there a copy of tb->cflags.

Most architectures have <= 32 registers, which results in a 4-byte hole
in TCGContext. Use this hole for the new field.

Backports commit e82d5a2460b0e176128027651ff9b104e4bdf5cc from qemu
2019-05-06 00:52:08 -04:00
Richard Henderson 30c0950567
tcg: Add CPUState cflags_next_tb
We were generating code during tb_invalidate_phys_page_range,
check_watchpoint, cpu_io_recompile, and (seemingly) discarding
the TB, assuming that it would magically be picked up during
the next iteration through the cpu_exec loop.

Instead, record the desired cflags in CPUState so that we request
the proper TB so that there is no more magic.

Backports commit 9b990ee5a3cc6aa38f81266fb0c6ef37a36c45b9 from qemu
2019-05-04 22:30:22 -04:00
Richard Henderson ee1ddf4a92
tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.

Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.

Backports commit 4e2ca83e71b51577b06b1468e836556912bd5b6e from qemu
2019-05-04 22:22:06 -04:00
Shahab Vahedi 7f59d62f4a
cputlb: Fix io_readx() to respect the access_type
This change adapts io_readx() to its input access_type. Currently
io_readx() treats any memory access as a read, although it has an
input argument "MMUAccessType access_type". This results in:

1) Calling the tlb_fill() only with MMU_DATA_LOAD
2) Considering only entry->addr_read as the tlb_addr

Buglink: https://bugs.launchpad.net/qemu/+bug/1825359

Backports commit ef5dae6805cce7b59d129d801bdc5db71bcbd60d from qemu
2019-04-30 10:11:11 -04:00
Richard Henderson 434b3ab9ec
tcg: Restart after TB code generation overflow
If a TB generates too much code, try again with fewer insns.

Fixes: https://bugs.launchpad.net/bugs/1824853

Backports commit 6e6c4efed995d9eca6ae0cfdb2252df830262f50 from qemu
2019-04-30 09:52:57 -04:00
Richard Henderson bca82cde84
tcg: Hoist max_insns computation to tb_gen_code
In order to handle TB's that translate to too much code, we
need to place the control of the length of the translation
in the hands of the code gen master loop.

Backports commit 8b86d6d25807e13a63ab6ea879f976b9f18cc45a from qemu
2019-04-30 09:49:57 -04:00
Lioncash 9dfe2b527b
cpu-exec: Synchronize with qemu 2019-04-26 16:07:51 -04:00
Lioncash 5daabe55a4
cputlb: Synchronize with qemu
Synchronizes the code with Qemu to reduce a few differences.
2019-04-26 15:48:45 -04:00
Lioncash 4a64ebf95e
tcg: Synchronize with qemu 2019-04-26 09:32:20 -04:00
Lioncash d844d7cc9d
exec: Backport tb_cflags accessor 2019-04-22 06:12:59 -04:00
Lioncash ccf16bc572
softmmu_template: Fix invalid argument to tlb_fill in helper_be_st_name
This should be passing in the page2 value like in the little-endian
handler.
2019-04-18 08:43:14 -04:00
Emilio G. Cota f31764dd5b
cputlb: update TLB entry/index after tlb_fill
We are failing to take into account that tlb_fill() can cause a
TLB resize, which renders prior TLB entry pointers/indices stale.
Fix it by re-doing the TLB entry lookups immediately after tlb_fill.

Fixes: 86e1eff8bc ("tcg: introduce dynamic TLB sizing", 2019-01-28)

Backports commit 6d967cb86d5b4a60ba15b497126b621ce9ca6609 from qemu
2019-02-12 11:48:48 -05:00
Thomas Huth 85bc48fecd
tcg: Fix LGPL version number
It's either "GNU *Library* General Public version 2" or "GNU Lesser
General Public version *2.1*", but there was no "version 2.0" of the
"Lesser" library. So assume that version 2.1 is meant here.

Backports commit fb0343d5b4dd4b9b9e96e563d913a3e0c709fe4e from qemu
2019-02-03 17:55:28 -05:00
Richard Henderson fb684825c8
tcg: Add opcodes for vector minmax arithmetic
Backports commit dd0a0fcdd8848c2a18970c44a62bd8f394c2b495 from qemu
2019-01-29 16:24:52 -05:00
Richard Henderson e08d0feee4
tcg: Add gvec expanders for nand, nor, eqv
Backports commit f550805d8309500d642f640af8d9928958465478 from qemu
2019-01-29 15:57:28 -05:00
Lioncash 977ad292b3
accel/translate-all: Get rid of variable shadowing 2019-01-28 09:17:37 -05:00
Lioncash ce8697f978
accel/translate-all: Convert a void* cast into an unsigned char* cast
Strictly speaking, as far as the standard care, performing pointer
arithmetic on a void* type is ill formed. This is a GNU extension that
allows this. Instead, just use unsigned char* which preserves the same
behavior.
2019-01-28 09:14:33 -05:00
Peter Maydell 1301becdab
tcg: Support MMU protection regions smaller than TARGET_PAGE_SIZE
Add support for MMU protection regions that are smaller than
TARGET_PAGE_SIZE. We do this by marking the TLB entry for those
pages with a flag TLB_RECHECK. This flag causes us to always
take the slow-path for accesses. In the slow path we can then
special case them to always call tlb_fill() again, so we have
the correct information for the exact address being accessed.

This change allows us to handle reading and writing from small
regions; we cannot deal with execution from the small region.

Backports commit 55df6fcf5476b44bc1b95554e686ab3e91d725c5 from qemu
2018-11-16 21:35:54 -05:00
Lioncash 3a0ab1a64a
Partial backport of: exec.c: Handle IOMMUs in address_space_translate_for_iotlb()
We just want the parameter changes here.

Partial backport of commit 1f871c5e6b0f30644a60a81a6a7aadb3afb030ac from
qemu
2018-11-16 21:24:55 -05:00
Emilio G. Cota 1677898a09
cputlb: read CPUTLBEntry.addr_write atomically
Updates can come from other threads, so readers that do not
take tlb_lock must use atomic_read to avoid undefined
behaviour (UB).

This completes the conversion to tlb_lock. This conversion results
on average in no performance loss, as the following experiments
(run on an Intel i7-6700K CPU @ 4.00GHz) show.

1. aarch64 bootup+shutdown test:

- Before:
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

7487.087786 task-clock (msec) # 0.998 CPUs utilized ( +- 0.12% )
31,574,905,303 cycles # 4.217 GHz ( +- 0.12% )
57,097,908,812 instructions # 1.81 insns per cycle ( +- 0.08% )
10,255,415,367 branches # 1369.747 M/sec ( +- 0.08% )
173,278,962 branch-misses # 1.69% of all branches ( +- 0.18% )

7.504481349 seconds time elapsed ( +- 0.14% )

- After:
Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

7462.441328 task-clock (msec) # 0.998 CPUs utilized ( +- 0.07% )
31,478,476,520 cycles # 4.218 GHz ( +- 0.07% )
57,017,330,084 instructions # 1.81 insns per cycle ( +- 0.05% )
10,251,929,667 branches # 1373.804 M/sec ( +- 0.05% )
173,023,787 branch-misses # 1.69% of all branches ( +- 0.11% )

7.474970463 seconds time elapsed ( +- 0.07% )

2. SPEC06int:
SPEC06int (test set)
[Y axis: Speedup over master]
1.15 +-+----+------+------+------+------+------+-------+------+------+------+------+------+------+----+-+
| |
1.1 +-+.................................+++.............................+ tlb-lock-v2 (m+++x) +-+
| +++ | +++ tlb-lock-v3 (spinl|ck) |
| +++ | | +++ +++ | | |
1.05 +-+....+++...........####.........|####.+++.|......|.....###....+++...........+++....###.........+-+
| ### ++#| # |# |# ***### +++### +++#+# | +++ | #|# ### |
1 +-+++***+#++++####+++#++#++++++++++#++#+*+*++#++++#+#+****+#++++###++++###++++###++++#+#++++#+#+++-+
| *+* # #++# *** # #### *** # * *++# ****+# *| * # ****|# |# # #|# #+# # # |
0.95 +-+..*.*.#....#..#.*|*..#...#..#.*|*..#.*.*..#.*|.*.#.*++*.#.*++*+#.****.#....#+#....#.#..++#.#..+-+
| * * # # # *|* # # # *|* # * * # *++* # * * # * * # * |* # ++# # # # *** # |
| * * # ++# # *+* # # # *|* # * * # * * # * * # * * # *++* # **** # ++# # * * # |
0.9 +-+..*.*.#...|#..#.*.*..#.++#..#.*|*..#.*.*..#.*..*.#.*..*.#.*..*.#.*..*.#.*.|*.#...|#.#..*.*.#..+-+
| * * # *** # * * # |# # *+* # * * # * * # * * # * * # * * # *++* # |# # * * # |
0.85 +-+..*.*.#..*|*..#.*.*..#.***..#.*.*..#.*.*..#.*..*.#.*..*.#.*..*.#.*..*.#.*..*.#.****.#..*.*.#..+-+
| * * # *+* # * * # *|* # * * # * * # * * # * * # * * # * * # * * # * |* # * * # |
| * * # * * # * * # *+* # * * # * * # * * # * * # * * # * * # * * # * |* # * * # |
0.8 +-+..*.*.#..*.*..#.*.*..#.*.*..#.*.*..#.*.*..#.*..*.#.*..*.#.*..*.#.*..*.#.*..*.#.*++*.#..*.*.#..+-+
| * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.75 +-+--***##--***###-***###-***###-***###-***###-****##-****##-****##-****##-****##-****##--***##--+-+
400.perlben401.bzip2403.gcc429.m445.gob456.hmme45462.libqua464.h26471.omnet473483.xalancbmkgeomean

png: https://imgur.com/a/BHzpPTW

Notes:
- tlb-lock-v2 corresponds to an implementation with a mutex.
- tlb-lock-v3 corresponds to the current implementation, i.e.
a spinlock and a single lock acquisition in tlb_set_page_with_attrs.

Backports commit 403f290c0603f35f2d09c982bf5549b6d0803ec1 from qemu
2018-10-23 15:37:43 -04:00
Richard Henderson d74e00a30a
tcg: Split CONFIG_ATOMIC128
GCC7+ will no longer advertise support for 16-byte __atomic operations
if only cmpxchg is supported, as for x86_64. Fortunately, x86_64 still
has support for __sync_compare_and_swap_16 and we can make use of that.
AArch64 does not have, nor ever has had such support, so open-code it.

Backports commit e6cd4bb59b8154fa00da611200beef7eb4e8ec56 from qemu
2018-10-23 15:17:39 -04:00
Richard Henderson c911ea7128
tcg: Add tlb_index and tlb_entry helpers
Isolate the computation of an index from an address into a
helper before we change that function.

Backports commit 383beda9cf32f795616c3b93f7d6154d70372d4b from qemu
2018-10-23 15:04:27 -04:00
Emilio G. Cota dfb3954571
exec: introduce tlb_init
Paves the way for the addition of a per-TLB lock.

Backports commit 5005e2537d090bee87aca3b924dcd17920fd146a from qemu
2018-10-23 14:41:29 -04:00
Richard Henderson e01deeb9ba
tcg: Implement CPU_LOG_TB_NOCHAIN during expansion
Rather than test NOCHAIN before linking, do not emit the
goto_tb opcode at all. We already do this for goto_ptr.

Backports commit d7f425fdea991f052241c6479acd9feae834063b from qemu
2018-10-23 14:35:12 -04:00