Commit graph

183 commits

Author SHA1 Message Date
Yang Zhong 1135db176f
tcg: add CONFIG_TCG guards in headers
Add CONFIG_TCG around TLB-related functions and structure declarations.
Some of these functions are defined in ./accel/tcg/cputlb.c, which will
not be linked in if TCG is disabled, and have no stubs; therefore, their
callers will also be compiled out for --disable-tcg.

Backports commit b11ec7f2e44b285a3967d629b55d1a6970b06787 from qemu
2018-03-03 21:37:52 -05:00
Yang Zhong d70c141675
tcg: move page_size_init() function
translate-all.c will be disabled if tcg is disabled in the build,
so page_size_init() function and related variables will be moved
to exec.c file.

Backports commit a0be0c585f5dcc4d50a37f6a20d3d625c5ef3a2c from qemu
2018-03-03 21:30:08 -05:00
Thomas Huth cf5d583ef0
cpu: Introduce a wrapper for tlb_flush() that can be used in common code
Commit 1f5c00cfdb8114c ("qom/cpu: move tlb_flush to cpu_common_reset")
moved the call to tlb_flush() from the target-specific reset handlers
into the common code qom/cpu.c file, and protected the call with
"#ifdef CONFIG_SOFTMMU" to avoid that it is called for linux-user
only targets. But since qom/cpu.c is common code, CONFIG_SOFTMMU is
*never* defined here, so the tlb_flush() was simply never executed
anymore. Fix it by introducing a wrapper for tlb_flush() in a file
that is re-compiled for each target, i.e. in translate-all.c.

Backports commit 2cd53943115be5118b5b2d4b80ee0a39c94c4f73 from qemu
2018-03-03 21:24:55 -05:00
Emilio G. Cota 1a4e5da043
gen-icount: use tcg_ctx.tcg_env instead of cpu_env
We are relying on cpu_env being defined as a global, yet most
targets (i.e. all but arm/a64) have it defined as a local variable.
Luckily all of them use the same "cpu_env" name, but really
compilation shouldn't break if the name of that local variable
changed.

Fix it by using tcg_ctx.tcg_env, which all targets set in their
translate_init function. This change also helps paving the way
for the upcoming "translation loop common to all targets" work.

Backports commit 53f6672bcf57d82b794a2cc3a3469be7d35c8653 from qemu
2018-03-03 21:08:58 -05:00
Richard Henderson 68275ba6f3
tcg/arm: Use indirect branch for goto_tb
Backports commit 3fb53fb4d12f2e7833bd1659e6013237b130ef20 from qemu
2018-03-03 17:11:18 -05:00
Emilio G. Cota d3ada2feb5
tcg: allocate TB structs before the corresponding translated code
Allocating an arbitrarily-sized array of tbs results in either
(a) a lot of memory wasted or (b) unnecessary flushes of the code
cache when we run out of TB structs in the array.

An obvious solution would be to just malloc a TB struct when needed,
and keep the TB array as an array of pointers (recall that tb_find_pc()
needs the TB array to run in O(log n)).

Perhaps a better solution, which is implemented in this patch, is to
allocate TB's right before the translated code they describe. This
results in some memory waste due to padding to have code and TBs in
separate cache lines--for instance, I measured 4.7% of padding in the
used portion of code_gen_buffer when booting aarch64 Linux on a
host with 64-byte cache lines. However, it can allow for optimizations
in some host architectures, since TCG backends could safely assume that
the TB and the corresponding translated code are very close to each
other in memory. See this message by rth for a detailed explanation:

https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html
Subject: Re: GSoC 2017 Proposal: TCG performance enhancements

Backports commit 6e3b2bfd6af488a896f7936e99ef160f8f37e6f2 from qemu
2018-03-03 17:05:49 -05:00
Emilio G. Cota 7d0440dec4
tb-hash: improve tb_jmp_cache hash function in user mode
Optimizations to cross-page chaining and indirect branches make
performance more sensitive to the hit rate of tb_jmp_cache.
The constraint of reserving some bits for the page number
lowers the achievable quality of the hashing function.

However, user-mode does not have this requirement. Thus,
with this change we use for user-mode a hashing function that
is both faster and of better quality than the previous one.

Measurements:

Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.

- SPECint06 (test set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

2.2x +-+--------------------------------------------------------------------------------------------------------------+-+
| |
| jr |
2x +jr+multhash +....................................................+++++...................................+-+
| jr+hash |$$$ |
| |$+$ |
| ### $ |
1.8x +-+......................................................................#|#.$...................................+-+
| ++#+# $ |
| |# # $ |
1.6x +-+....................................................................***.#.$....................++$$$..........+-+
| $$$ *+* # $ |$+$ |
| ++$$$ ### $ * * # $ +++|$ $ |
| ++###+$ # # $ * * # $ ### ****## $ |
1.4x +-+...................***+#.$.........***.#.$..........................*.*.#.$...........#+#$$.*++*|#.$..........+-+
| *+* # $ * * # $ * * # $ # # $ * *+# $ |
| * * # $ +++++ * * # $ * * # $ *** # $ * * # $ ###$$ |
1.2x +-+...................*.*.#.$.***##$$.*.*.#.$..........................*.*.#.$.........*.*.#.$.*..*.#.$.***+#+$..+-+
| * * # $ *+* # $ * * # $ +++ * * # $ ++###$$ * * # $ * * # $ * * # $ |
| ***##$$ * * # $ * * # $ * * # $ ***##$$ ++### * * # $ *** #+$ * * # $ * * # $ * * # $ |
| *+*+#+$ ***##$$$ * * # $ * * # $ * * # $ *+* # $ ++####$$ ***+# * * # $ * * # $ * * # $ * * # $ * * # $ |
1x +-++-*+*+#+$+*+*+#-+$+*+*-#+$+*+*+#+$+*+*+#+$+*-*+#+$+***++#+$+*+*+#$$+*+*+#+$+*+*+#+$+*+*-#+$+*+-*+#+$+*+*+#+$-++-+
| * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ |
| * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ * * # $ |
0.8x +-+--***##$$-***##$$$-***##$$-***##$$-***##$$-***##$$-***###$$-***##$$-***##$$-***##$$-***##$$-****##$$-***##$$--+-+
astar bzip2 gcc gobmk h264ref hmmlibquantum mcf omnetpperlbench sjengxalancbmk hmean
png: http://imgur.com/4UXTrEc

Here I also tried the hash function suggested by Paolo ("multhash"):

return ((uint64_t) (pc * 2654435761) >> 32) & (TB_JMP_CACHE_SIZE - 1);

As you can see it is just as good as the other new function ("hash"),
which is what I ended up going with.

- SPECint06 (train set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

2.6x +-+--------------------------------------------------------------------------------------------------------------+-+
| |
| jr ### |
2.4x +jr+hash...........................................................................................#.#...........+-+
| # # |
| # # |
2.2x +-+................................................................................................#.#...........+-+
| # # |
| # # |
2x +-+................................................................................................#.#...........+-+
| **** # |
| * * # |
1.8x +-+.............................................................................................*..*.#...........+-+
| +++ * * # |
| #### #### * * # |
1.6x +-+......................................####.............................#..#.****..#..........*..*.#...........+-+
| +++ #++# **** # * * # #### * * # |
| ### # # * * # * * # # # * * # |
1.4x +-+...................****+#..........****..#..........................*..*..#.*..*..#....#..#..*..*.#...........+-+
| *++* # * * # * * # * * # *** # * * # #### |
| * * # #### * * # * * # * * # * * # * * # **** # |
1.2x +-+...................*..*.#..****++#.*..*..#..........................*..*..#.*..*..#..*.*..#..*..*.#..*..*..#..+-+
| ****### * * # * * # * * # * * # * * # * * # * * # * * # |
| * * # ***### * * # * * # * * # ****## * * # * * # * * # * * # * * # |
1x +-+--****###--***###--****##--****###-****###--***###--***###--****##--****###-****###--***###--****##--****###--+-+
astar bzip2 gcc gobmk h264ref hmmlibquantum mcf omnetpperlbench sjengxalancbmk hmean
png: http://imgur.com/ArCbHqo

- NBench, x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

1.12x +-+-------------------------------------------------------------------------------------------------------------+-+
| |
| jr +++ |
1.1x +jr+hash...........................................................####.........................................+-+
| +++#| # |
| | #++# |
1.08x +-+................................+++................+++.+++..*****..#.........................................+-+
| | +++ | | * | * # |
| | | | | *+++* # |
1.06x +-+................................****###.............|...|...*...*..#.........................+++.............+-+
| *| * |# ****### * * # | |
| *| *++# *| * |# * * # #### |
1.04x +-+................................*++*..#............*|.*.|#..*...*..#........................#.|#.............+-+
| * * # *++*++# * * # +++#++# |
| * * # * * # * * # | # # +++#### |
1.02x +-+................................*..*..#......+++...*..*..#..*...*..#.....................****..#..*****++#...+-+
| +++ * * # +++ | * * # * * # +++ *| * # *+++* # |
| +++ | +++ +++ ++++++ * * # *****### * * # * * # | +++ ++++++ *++* # * * # |
1x +-++-+++++####++****###++++-+####+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-+++####-+*****###++*++*++#++*+-+*++#+-++-+
| *****| # *++* |# *****| # * * # * *++# * * # * * # **** |# * * # * * # * * # |
| * | *| # * *++# * | *++# * * # * * # * * # * * # *| *++# * * # * * # * * # |
0.98x +-+...*.|.*++#..*..*..#..*+++*..#..*..*..#..*...*..#..*..*..#..*...*..#..*++*..#..*...*..#..*..*..#..*...*..#...+-+
| *+++* # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # |
| * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.96x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
ASSIGNMENT BITFIELD FOURFP EMULATION HUFFMAN LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT hmean
png: http://imgur.com/ZXFX0hJ

- NBench, arm-linux-user. Host: Intel i7-4790K @ 4.00GHz

1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
| #### |
| jr # # +++ |
1.25x +jr+hash.....................#..#...........................................####................................+-+
| # # # # |
| # # # # |
1.2x +-+..........................#..#...........................................#..#................................+-+
| # # # # |
| # # # # |
1.15x +-+..........................#..#...........................................#..#................................+-+
| # # #### # # |
| # # # # # # |
1.1x +-+..........................#..#..................................#..#.....#..#................................+-+
| # # # # # # +++ |
| # # #### # # # # #### |
1.05x +-+..........................#..#...............#..#.....####......#..#.....#..#.........................#..#...+-+
| # # # # # # # # # # +++ # # |
| +++ ***** # #### ***** # # # +++# # **** # ****### # # |
1x +-++-+*****###++****+++++*+-+*++#+-****++#-+*+++*-+#+++++#++#++*****++#+-*++*++#-+*****-++++*++*++#++*****++#+-++-+
| * * # * * | * * # * * # * * # **** # * * # * * # * *### * *++# * * # |
| * * # * *### * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.95x +-+...*...*..#..*..*.|#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#...+-+
| * * # * * |# * * # * * # * * # * * # * * # * * # * * # * * # * * # |
| * * # * * |# * * # * * # * * # * * # * * # * * # * * # * * # * * # |
0.9x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
ASSIGNMENT BITFIELD FOURFP EMULATION HUFFMAN LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT hmean
png: http://imgur.com/FfD27ey

Backports commit 6f1653180f5701c6a8f1b35b89a80b1e3260928e from qemu
2018-03-03 14:11:29 -05:00
Emilio G. Cota 8f4f15e5f5
tcg: Introduce goto_ptr opcode and tcg_gen_lookup_and_goto_ptr
Instead of exporting goto_ptr directly to TCG frontends, export
tcg_gen_lookup_and_goto_ptr(), which calls goto_ptr with the pointer
returned by the lookup_tb_ptr() helper. This is the only use case
we have for goto_ptr and lookup_tb_ptr, so having this function is
very convenient. Furthermore, it trivially allows us to avoid calling
the lookup helper if goto_ptr is not implemented by the backend.

Backports commit cedbcb01529cb6cf9a2289cdbebbc63f6149fc18 from qemu
2018-03-02 21:05:18 -05:00
Peter Xu fce1b469e5
memory: tune last param of iommu_ops.translate()
This patch converts the old "is_write" bool into IOMMUAccessFlags. The
difference is that "is_write" can only express either read/write, but
sometimes what we really want is "none" here (neither read nor write).
Replay is an good example - during replay, we should not check any RW
permission bits since thats not an actual IO at all.

Backports commit bf55b7afce53718ef96f4e6616da62c0ccac37dd from qemu
2018-03-02 18:59:12 -05:00
Paolo Bonzini c27870520a
exec: revert MemoryRegionCache
MemoryRegionCache did not know about virtio support for IOMMUs (because the
two features were developed at the same time). Revert MemoryRegionCache
to "normal" address_space_* operations for 2.9, as it is simpler than
undoing the virtio patches.

Backports commit 90c4fe5fc517a045e7a7cf2f23472e114042ca29 from qemu
2018-03-02 14:30:41 -05:00
Dr. David Alan Gilbert 55d79cf4c0
RAMBlocks: qemu_ram_is_shared
Provide a helper to say whether a RAMBlock was created as a
shared mapping.

Backports commit 463a4ac23bcf0f0b65c850fa66f5ae6e43edd243 from qemu
2018-03-02 13:05:35 -05:00
Dr. David Alan Gilbert 5dfbee8930
memory_region: Fix name comments
The 'name' parameter to memory_region_init_* had been marked as debug
only, however vmstate_region_ram uses it as a parameter to
qemu_ram_set_idstr to set RAMBlock names and these form part of the
migration stream.

Backports commit e8f5fe2de125a0bfbefbaa6a69af81f4817cb7a0 from qemu
2018-03-02 13:01:23 -05:00
Yongji Xie 23f5b17a08
memory: Introduce DEVICE_HOST_ENDIAN for ram device
At the moment ram device's memory regions are DEVICE_NATIVE_ENDIAN. It's
incorrect. This memory region is backed by a MMIO area in host, so the
uint64_t data that MemoryRegionOps read from/write to this area should be
host-endian rather than target-endian. Hence, current code does not work
when target and host endianness are different which is the most common case
on PPC64. To fix it, this introduces DEVICE_HOST_ENDIAN for the ram device.

This has been tested on PPC64 BE/LE host/guest in all possible combinations
including TCG.

Backports commit c99a29e702528698c0ce2590f06ca7ff239f7c39 from qemu
2018-03-02 11:24:32 -05:00
Alex Bennée 454932263c
cputlb and arm/sparc targets: convert mmuidx flushes from varg to bitmap
While the vargs approach was flexible the original MTTCG ended up
having munge the bits to a bitmap so the data could be used in
deferred work helpers. Instead of hiding that in cputlb we push the
change to the API to make it take a bitmap of MMU indexes instead.

For ARM some the resulting flushes end up being quite long so to aid
readability I've tended to move the index shifting to a new line so
all the bits being or-ed together line up nicely, for example:

tlb_flush_page_by_mmuidx(other_cs, pageaddr,
(1 << ARMMMUIdx_S1SE1) |
(1 << ARMMMUIdx_S1SE0));

Backports commit 0336cbf8532935d8e23c2aabf3e2ce2c0697b6ac from qemu
2018-03-02 10:12:40 -05:00
Alex Bennée e3e57ca08e
cputlb: drop flush_global flag from tlb_flush
We have never has the concept of global TLB entries which would avoid
the flush so we never actually use this flag. Drop it and make clear
that tlb_flush is the sledge-hammer it has always been.

Backports commit  d10eb08f5d8389c814b554d01aa2882ac58221bf from qemu
2018-03-01 19:36:04 -05:00
Jason Wang 29932d0719
memory: handle alias in memory_region_is_iommu()
Backports commit 12d37882f0c0def5dee1c21be5d8fea9c21baada from qemu
2018-03-01 13:06:18 -05:00
Jason Wang fdca6292a1
exec: introduce address_space_get_iotlb_entry()
This patch introduces a helper to query the iotlb entry for a
possible iova. This will be used by later device IOTLB API to enable
the capability for a dataplane (e.g vhost) to query the IOTLB.

Backports commit 052c8fa9983f553fdfa0d61034774070dd639c2b from qemu
2018-03-01 13:05:08 -05:00
Paolo Bonzini 81ad780e5e
exec: introduce MemoryRegionCache
Device models often have to perform multiple access to a single
memory region that is known in advance, but would to use "DMA-style"
functions instead of address_space_map/unmap. This can happen
for example when the data has to undergo endianness conversion.
Introduce a new data structure to cache the result of
address_space_translate without forcing usage of a host address
like address_space_map does.

Backports commit 1f4e496e1fc2eb6c8bf377a0f9695930c380bfd3 from qemu
2018-03-01 10:50:30 -05:00
Paolo Bonzini 88ad0f4f6e
exec: introduce memory_ldst.inc.c
Templatize the address_space_* and *_phys functions, so that we can add
similar functions in the next patch that work with a lightweight,
cache-like version of address_space_map/unmap.

Backports commit 0ce265ffef87f19f4dd1ff0663e09a63d66ae408 from qemu
2018-03-01 09:59:34 -05:00
Paolo Bonzini 9404dbf74e
cpu-exec: fix icount out-of-bounds access
When icount is active, tb_add_jump is surprisingly called with an
out of bounds basic block index. I have no idea how that can work,
but it does not seem like a good idea. Clear *last_tb for all
TB_EXIT_ICOUNT_EXPIRED cases, even when all you have to do is
refill icount_extra.

Backports commit d8dea6fbcbed177ca5d23ab77b3834a9437f0e88 from qemu
2018-03-01 09:17:26 -05:00
Bobby Bingham d46e52d9d0
cpu_ldst.h: use correct guest address parameter
In the user emulation code path, tlb_vaddr_to_host erronesously passed
vaddr as the guest address to be translated, instead of addr, the parameter
which actually contained the guest address.

This resulted in incorrect addresses being used when emulating block copy
(mvc/mvpg) and block clear (xc) instructions for the s390x target.

Backports commit c2a85316902e67530da9d6548139fcce73c0cac6 from qemu
2018-03-01 08:56:37 -05:00
Paolo Bonzini 9d64a89acf
tcg: comment on which functions have to be called with tb_lock held
softmmu requires more functions to be thread-safe, because translation
blocks can be invalidated from e.g. notdirty callbacks. Probably the
same holds for user-mode emulation, it's just that no one has ever
tried to produce a coherent locking there.

This patch will guide the introduction of more tb_lock and tb_unlock
calls for system emulation.

Note that after this patch some (most) of the mentioned functions are
still called outside tb_lock/tb_unlock. The next one will rectify this.

Backports commit 7d7500d99895f888f97397ef32bb536bb0df3b74 from qemu
2018-02-28 10:26:28 -05:00
Alex Bennée 7aab0bd9a6
translate-all: add DEBUG_LOCKING asserts
This adds asserts to check the locking on the various translation
engines structures. There are two sets of structures that are protected
by locks.

The first the l1map and PageDesc structures used to track which
translation blocks are associated with which physical addresses. In
user-mode this is covered by the mmap_lock.

The second case are TB context related structures which are protected by
tb_lock which is also user-mode only.

Currently the asserts do nothing in SoftMMU mode but this will change
for MTTCG.

Backports commit 301e40ed8005306c009978be295ed9a4b725178b from qemu
2018-02-28 08:56:15 -05:00
Yongbok Kim 79e4c001a9
softmmu: Add probe_write()
Probe for whether the specified guest write access is permitted.
If it is not permitted then an exception will be taken in the same
way as if this were a real write access (and we will not return).
Otherwise the function will return, and there will be a valid
entry in the TLB for this access.

Backports commit 3b4afc9e75ab1a95f33e41f462921093f8a109c4 from qemu
2018-02-27 12:20:50 -05:00
Richard Henderson e35aacd5ae
tcg: Add EXCP_ATOMIC
When we cannot emulate an atomic operation within a parallel
context, this exception allows us to stop the world and try
again in a serial context.

Backports commit fdbc2b5722f6092e47181a947c90fd4bdcc1c121 from qemu

Also backports parts of commit 02d57ea115b7669f588371c86484a2e8ebc369be
2018-02-27 11:57:58 -05:00
Peter Maydell db8b0a82b1
cpu: Support a target CPU having a variable page size
Support target CPUs having a page size which isn't knownn
at compile time. To use this, the CPU implementation should:
* define TARGET_PAGE_BITS_VARY
* not define TARGET_PAGE_BITS
* define TARGET_PAGE_BITS_MIN to the smallest value it
might possibly want for TARGET_PAGE_BITS
* call set_preferred_target_page_bits() in its realize
function to indicate the actual preferred target page
size for the CPU (and report any error from it)

In CONFIG_USER_ONLY, the CPU implementation should continue
to define TARGET_PAGE_BITS appropriately for the guest
OS page size.

Machines which want to take advantage of having the page
size something larger than TARGET_PAGE_BITS_MIN must
set the MachineClass minimum_page_bits field to a value
which they guarantee will be no greater than the preferred
page size for any CPU they create.

Note that changing the target page size by setting
minimum_page_bits is a migration compatibility break
for that machine.

For debugging purposes, attempts to use TARGET_PAGE_SIZE
before it has been finally confirmed will assert.

Backports commit 20bccb82ff3ea09bcb7c4ee226d3160cab15f7da from qemu
2018-02-26 12:29:08 -05:00
Paolo Bonzini eb75004013
memory: add a per-AddressSpace list of listeners
This speeds up MEMORY_LISTENER_CALL noticeably. Right now,
with many PCI devices you have N regions added to M AddressSpaces
(M = # PCI devices with bus-master enabled) and each call looks
up the whole listener list, with at least M listeners in it.
Because most of the regions in N are BARs, which are also roughly
proportional to M, the whole thing is O(M^3). This changes it
to O(M^2), which is the best we can do without rewriting the
whole thing.

Backports commit 9a54635dcb51a3fcf7507af630168f514a8cd4e7 from qemu
2018-02-26 10:46:50 -05:00
Paolo Bonzini 4b06e8bbb7
memory: eliminate global MemoryListeners
There is none, so just drop the code.

Backports commit d45fa784cd0c111131696808d1168259d66b7519 from qemu
2018-02-26 10:19:28 -05:00
Richard Henderson 66d79ac959
tcg: Merge GETPC and GETRA
The return address argument to the softmmu template helpers was
confused. In the legacy case, we wanted to indicate that there
is no return address, and so passed in NULL. However, we then
immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero
value, indicating the presence of an (invalid) return address.

Push the GETPC_ADJ subtraction down to the only point it's required:
immediately before use within cpu_restore_state_from_tb, after all
NULL pointer checks have been completed.

This makes GETPC and GETRA identical. Remove GETRA as the lesser
used macro, replacing all uses with GETPC.

Backports commit 01ecaf438b1eb46abe23392c8ce5b7628b0c8cf5 from qemu
2018-02-26 02:54:44 -05:00
Paolo Bonzini 30845ae475
tcg: Prepare TB invalidation for lockless TB lookup
When invalidating a translation block, set an invalid flag into the
TranslationBlock structure first. It is also necessary to check whether
the target TB is still valid after acquiring 'tb_lock' but before calling
tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in
future. Note that we don't have to check 'last_tb'; an already invalidated
TB will not be executed anyway and it is thus safe to patch it.

Backports commit 6d21e4208f382dd8ca1f7995a6dd9ea7ca281163 from qemu
2018-02-26 01:48:13 -05:00
Alex Williamson fe66c2e088
memory: Don't use memcpy for ram_device regions
With a vfio assigned device we lay down a base MemoryRegion registered
as an IO region, giving us read & write accessors. If the region
supports mmap, we lay down a higher priority sub-region MemoryRegion
on top of the base layer initialized as a RAM device pointer to the
mmap. Finally, if we have any quirks for the device (ie. address
ranges that need additional virtualization support), we put another IO
sub-region on top of the mmap MemoryRegion. When this is flattened,
we now potentially have sub-page mmap MemoryRegions exposed which
cannot be directly mapped through KVM.

This is as expected, but a subtle detail of this is that we end up
with two different access mechanisms through QEMU. If we disable the
mmap MemoryRegion, we make use of the IO MemoryRegion and service
accesses using pread and pwrite to the vfio device file descriptor.
If the mmap MemoryRegion is enabled and results in one of these
sub-page gaps, QEMU handles the access as RAM, using memcpy to the
mmap. Using either pread/pwrite or the mmap directly should be
correct, but using memcpy causes us problems. I expect that not only
does memcpy not necessarily honor the original width and alignment in
performing a copy, but it potentially also uses processor instructions
not intended for MMIO spaces. It turns out that this has been a
problem for Realtek NIC assignment, which has such a quirk that
creates a sub-page mmap MemoryRegion access.

To resolve this, we disable memory_access_is_direct() for ram_device
regions since QEMU assumes that it can use memcpy for those regions.
Instead we access through MemoryRegionOps, which replaces the memcpy
with simple de-references of standard sizes to the host memory.

With this patch we attempt to provide unrestricted access to the RAM
device, allowing byte through qword access as well as unaligned
access. The assumption here is that accesses initiated by the VM are
driven by a device specific driver, which knows the device
capabilities. If unaligned accesses are not supported by the device,
we don't want them to work in a VM by performing multiple aligned
accesses to compose the unaligned access. A down-side of this
philosophy is that the xp command from the monitor attempts to use
the largest available access weidth, unaware of the underlying
device. Using memcpy had this same restriction, but at least now an
operator can dump individual registers, even if blocks of device
memory may result in access widths beyond the capabilities of a
given device (RTL NICs only support up to dword).

Backports commit 1b16ded6a512809f99c133a97f19026fe612b2de from qemu
2018-02-25 23:06:36 -05:00
Alex Williamson 5db45219c9
memory: Replace skip_dump flag with ram_device
Setting skip_dump on a MemoryRegion allows us to modify one specific
code path, but the restriction we're trying to address encompasses
more than that. If we have a RAM MemoryRegion backed by a physical
device, it not only restricts our ability to dump that region, but
also affects how we should manipulate it. Here we recognize that
MemoryRegions do not change to sometimes allow dumps and other times
not, so we replace setting the skip_dump flag with a new initializer
so that we know exactly the type of region to which we're applying
this behavior.

Backports commit ca83f87a66d19fdaabf23d4f5ebb49396fe232c1 from qemu
2018-02-25 23:00:45 -05:00
Richard Henderson 1547048a22
tcg: Reorg TCGOp chaining
Instead of using -1 as end of chain, use 0, and link through the 0
entry as a fully circular double-linked list.

Backports commit dcb8e75870e2de199db853697f8839cb603beefe from qemu
2018-02-25 21:44:50 -05:00
Igor Mammedov 62c89b9cd4
exec: Reduce CONFIG_USER_ONLY ifdeffenery
Backports commit 1bc7e522d9cf1b58f2de9c8f1737be0bb5129c35 from qemu
2018-02-25 20:57:48 -05:00
Markus Armbruster c2ffbc575d
Clean up decorations and whitespace around header guards
Cleaned up with scripts/clean-header-guards.pl.

Backports commit 175de52487ce0b0c78daa4cdf41a5a465a168a25 from qemu
2018-02-25 04:26:02 -05:00
Markus Armbruster 1275b9b459
Clean up ill-advised or unusual header guards
Cleaned up with scripts/clean-header-guards.pl.

Backports commit 2a6a4076e117113ebec97b1821071afccfdfbc96 from qemu
2018-02-25 04:22:46 -05:00
Markus Armbruster 9ae2fc4d9e
Clean up header guards that don't match their file name
Header guard symbols should match their file name to make guard
collisions less likely. Offenders found with
scripts/clean-header-guards.pl -vn.

Cleaned up with scripts/clean-header-guards.pl, followed by some
renaming of new guard symbols picked by the script to better ones.

Backports commit 121d07125bb6d7079c7ebafdd3efe8c3a01cc440 from qemu
2018-02-25 04:18:42 -05:00
Markus Armbruster 60e8836b74
Use #include "..." for our own headers, <...> for others
Tracked down with an ugly, brittle and probably buggy Perl script.

Also move includes converted to <...> up so they get included before
ours where that's obviously okay.

Backports commit a9c94277f07d19d3eb14f199c3e93491aa3eae0e from qemu
2018-02-25 04:10:33 -05:00
Sergey Sorokin d1e4ac0451
Fix confusing argument names in some common functions
There are functions tlb_fill(), cpu_unaligned_access() and
do_unaligned_access() that are called with access type and mmu index
arguments. But these arguments are named 'is_write' and 'is_user' in their
declarations. The patches fix the arguments to avoid a confusion.

Backports commit b35399bb4e9968296a12303b00f9f2066470e987 from qemu
2018-02-25 03:58:27 -05:00
Sergey Sorokin e4d123caa9
tcg: Improve the alignment check infrastructure
Some architectures (e.g. ARMv8) need the address which is aligned
to a size more than the size of the memory access.
To support such check it's enough the current costless alignment
check implementation in QEMU, but we need to support
an alignment size specifying.

Backports commit 1f00b27f17518a1bcb4cedca49eaec96a4d560bd from qemu
2018-02-25 02:23:28 -05:00
Peter Maydell efc6cc2b83
memory: Assert that memory_region_init_rom_device() ops aren't NULL
It doesn't make sense to pass a NULL ops argument to
memory_region_init_rom_device(), because the effect will
be that if the guest tries to write to the memory region
then QEMU will segfault. Catch the bug earlier by sanity
checking the arguments to this function, and remove the
misleading documentation that suggests that passing NULL
might be sensible.

Backports commit 39e0b03dec518254fabd2acff29548d3f1d2b754 from qemu
2018-02-25 00:29:52 -05:00
Peter Maydell 334e951ec1
memory: Provide memory_region_init_rom()
Provide a new helper function memory_region_init_rom() for memory
regions which are read-only (and unlike those created by
memory_region_init_rom_device() don't have special behaviour
for writes). This has the same behaviour as calling
memory_region_init_ram() and then memory_region_set_readonly()
(which is what we do today in boards with pure ROMs) but is a
more easily discoverable API for the purpose.

Backports commit a1777f7f6462c66e1ee6e98f0d5c431bfe988aa5 from qemu
2018-02-25 00:28:17 -05:00
Alexey Kardashevskiy 7187d77cfa
memory: Add MemoryRegionIOMMUOps.notify_started/stopped callbacks
The IOMMU driver may change behavior depending on whether a notifier
client is present. In the case of POWER, this represents a change in
the visibility of the IOTLB, for other drivers such as intel-iommu and
future AMD-Vi emulation, notifier support is not yet enabled and this
provides the opportunity to flag that incompatibility.

Backports commit d22d8956b185c002b50a4d0883aff61f857347ef from qemu
2018-02-25 00:23:00 -05:00
Alexey Kardashevskiy 096ca207af
memory: Add reporting of supported page sizes
Every IOMMU has some granularity which MemoryRegionIOMMUOps::translate
uses when translating, however this information is not available outside
the translate context for various checks.

This adds a get_min_page_size callback to MemoryRegionIOMMUOps and
a wrapper for it so IOMMU users (such as VFIO) can know the minimum
actual page size supported by an IOMMU.

As IOMMU MR represents a guest IOMMU, this uses TARGET_PAGE_SIZE
as fallback.

This removes vfio_container_granularity() and uses new helper in
memory_region_iommu_replay() when replaying IOMMU mappings on added
IOMMU memory region.

Backports the relevant parts of commit f682e9c244af7166225f4a50cc18ff296bb9d43e from qemu
2018-02-24 19:23:28 -05:00
Emilio G. Cota ae3e22a689
tb hash: hash phys_pc, pc, and flags with xxhash
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Backports commit 42bd32287f3a18d823f2258b813824a39ed7c6d9 from qemu
2018-02-24 18:00:14 -05:00
Emilio G. Cota 9ef9de9cf8
exec: add tb_hash_func5, derived from xxhash
This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Backports commit dc8b295d05ec35a8c032f9abca421772347ba5d4 from qemu
2018-02-24 17:36:35 -05:00
Peter Maydell d7dccff836
cpu-exec: Rename cpu_resume_from_signal() to cpu_loop_exit_noexc()
The function cpu_resume_from_signal() is now always called with a
NULL puc argument, and is rather misnamed since it is never called
from a signal handler. It is essentially forcing an exit to the
top level cpu loop but without raising any exception, so rename
it to cpu_loop_exit_noexc() and drop the useless unused argument.

Backports commit 6886b98036a8f8f5bce8b10756ce080084cef11b from qemu
2018-02-24 17:25:28 -05:00
Peter Maydell 8d0faac1dc
qemu-common.h: Drop WORDS_ALIGNED define
The WORDS_ALIGNED #define is not used anywhere, and hasn't been since
2013 when commit 612d590ebc6cef rewrote the various ld<type>_<endian>_p
functions to not use it. Remove the #define and the comment describing it.
Also remove the line in the comment about TARGET_WORDS_ALIGNED, since
it has never actually existed.

Backports commit 0d5c21f2b3bf1e0b562a2c74e353d2e03f2f50ef from qemu
2018-02-24 17:01:55 -05:00
Paolo Bonzini 8df5ad80b1
exec: hide mr->ram_addr from qemu_get_ram_ptr users
Let users of qemu_get_ram_ptr and qemu_ram_ptr_length pass in an
address that is relative to the MemoryRegion. This basically means
what address_space_translate returns.

Because the semantics of the second parameter change, rename the
function to qemu_map_ram_ptr.

Backports commit 0878d0e11ba8013dd759c6921cbf05ba6a41bd71 from qemu
2018-02-24 16:17:49 -05:00
Paolo Bonzini b2e1b34bcc
memory: split memory_region_from_host from qemu_ram_addr_from_host
Move the old qemu_ram_addr_from_host to memory_region_from_host and
make it return an offset within the region. For qemu_ram_addr_from_host
return the ram_addr_t directly, similar to what it was before
commit 1b5ec23 ("memory: return MemoryRegion from qemu_ram_addr_from_host",
2013-07-04).

Backports commit 07bdaa4196b51bc7ffa7c3f74e9e4a9dc8a7966a from qemu
2018-02-24 16:06:49 -05:00