unicorn/qemu
Emilio G. Cota 6bc05eeee4
tb hash: track translated blocks with qht
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
MRU is implemented here by adding an intermediate struct
that contains the u32 hash and a pointer to the TB; this
allows us, on an MRU promotion, to copy said struct (that is not
at the head), and put this new copy at the head. After a grace
period, the original non-head struct can be eliminated, and
after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

Host: Intel Xeon E5-2690

28 ++------------+-------------+-------------+-------------+------------++
A***** + + + master **A*** +
27 ++ * xxhash ##B###++
| A******A****** xxhash-rcu $$C$$$ |
26 C$$ A******A****** qht-fixed-nomru*%%D%%%++
D%%$$ A******A******A*qht-dyn-mru A*E****A
25 ++ %%$$ qht-dyn-nomru &&F&&&++
B#####% |
24 ++ #C$$$$$ ++
| B### $ |
| ## C$$$$$$ |
23 ++ # C$$$$$$ ++
| B###### C$$$$$$ %%%D
22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
| D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@@@E
21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
+ E@@@ F&&& + E@ + F&&& + +
20 ++------------+-------------+-------------+-------------+------------++
14 16 18 20 22 24
log2 number of buckets

Host: Intel i7-4790K

14.5 ++------------+------------+-------------+------------+------------++
A** + + + master **A*** +
14 ++ ** xxhash ##B###++
13.5 ++ ** xxhash-rcu $$C$$$++
| qht-fixed-nomru %%D%%% |
13 ++ A****** qht-dyn-mru @@E@@@++
| A*****A******A****** qht-dyn-nomru &&F&&& |
12.5 C$$ A******A******A*****A****** ***A
12 ++ $$ A*** ++
D%%% $$ |
11.5 ++ %% ++
B### %C$$$$$$ |
11 ++ ## D%%%%% C$$$$$ ++
| # % C$$$$$$ |
10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $$$C
10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
+ F&& D%%%%%%B######B######B#####B###@@@D%%% +
9.5 ++------------+------------+-------------+------------+------------++
14 16 18 20 22 24
log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Backports commit 909eaac9bbc2ed4f3a82ce38e905b87d478a3e00 from qemu
2018-03-13 14:16:26 -04:00
..
accel tb hash: track translated blocks with qht 2018-03-13 14:16:26 -04:00
crypto crypto: Clean up includes 2018-02-19 00:47:40 -05:00
default-configs arm64eb: add support for ARM64 big endian. 2017-04-24 23:30:01 +08:00
docs docs: clarify memory region lifecycle 2018-02-12 15:11:21 -05:00
fpu softfloat: fix crash on int conversion of SNaN 2018-03-09 11:40:17 -05:00
hw target/arm: Make 'any' CPU just an alias for 'max' 2018-03-12 10:11:49 -04:00
include tb hash: track translated blocks with qht 2018-03-13 14:16:26 -04:00
qapi qapi: Move qapi-schema.json to qapi/, rename generated files 2018-03-09 11:35:11 -05:00
qobject qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
qom machine: Set MachineClass::name automatically 2018-03-11 14:38:58 -04:00
scripts qapi: Move qapi-schema.json to qapi/, rename generated files 2018-03-09 11:35:11 -05:00
target target/mips/translate: Perform comparison pass with qemu 2018-03-12 17:52:56 -04:00
tcg tcg: move tcg backend files into accel/tcg/ 2018-03-13 11:48:15 -04:00
util Backport qht hashtable 2018-03-13 13:55:30 -04:00
aarch64.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
aarch64eb.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
accel.c clean-up: removed duplicate #includes 2018-02-28 08:51:56 -05:00
arm.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
armeb.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
CODING_STYLE import 2015-08-21 15:04:50 +08:00
configure tcg: move tcg backend files into accel/tcg/ 2018-03-13 11:48:15 -04:00
COPYING import 2015-08-21 15:04:50 +08:00
COPYING.LIB import 2015-08-21 15:04:50 +08:00
cpus.c Include qapi/error.h exactly where needed 2018-03-07 12:26:38 -05:00
exec.c exec: Drop unnecessary code for unicorn 2018-03-12 10:11:46 -04:00
gen_all_header.sh arm64eb: add support for ARM64 big endian. 2017-04-24 23:30:01 +08:00
glib_compat.c memory: Share FlatView's and dispatch trees between address spaces 2018-03-11 22:05:44 -04:00
HACKING import 2015-08-21 15:04:50 +08:00
header_gen.py target/mips/translate: Perform comparison pass with qemu 2018-03-12 17:52:56 -04:00
ioport.c hw: remove pio_addr_t 2018-02-24 02:43:16 -05:00
LICENSE import 2015-08-21 15:04:50 +08:00
m68k.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
Makefile qapi: Don't create useless directory qapi-generated 2018-03-09 11:36:49 -05:00
Makefile.objs qapi: Move qapi-schema.json to qapi/, rename generated files 2018-03-09 11:35:11 -05:00
Makefile.target tcg: move tcg backend files into accel/tcg/ 2018-03-13 11:48:15 -04:00
memory.c memory: Share special empty FlatView 2018-03-11 22:34:28 -04:00
memory_ldst.inc.c exec: Drop unnecessary code for unicorn 2018-03-12 10:11:46 -04:00
memory_mapping.c include/qemu/osdep.h: Don't include qapi/error.h 2018-02-21 23:08:18 -05:00
mips.h target/mips/translate: Perform comparison pass with qemu 2018-03-12 17:52:56 -04:00
mips64.h target/mips/translate: Perform comparison pass with qemu 2018-03-12 17:52:56 -04:00
mips64el.h target/mips/translate: Perform comparison pass with qemu 2018-03-12 17:52:56 -04:00
mipsel.h target/mips/translate: Perform comparison pass with qemu 2018-03-12 17:52:56 -04:00
powerpc.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
qemu-timer.c timer/cpus: fix some typos and update some comments 2018-02-25 23:21:57 -05:00
rules.mak build-sys: silence make by default or V=0 2018-03-06 08:58:03 -05:00
sparc.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
sparc64.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00
unicorn_common.h tb hash: track translated blocks with qht 2018-03-13 14:16:26 -04:00
VERSION import 2015-08-21 15:04:50 +08:00
vl.c machine: Eliminate QEMUMachine and qemu_register_machine() 2018-03-11 15:22:25 -04:00
vl.h import 2015-08-21 15:04:50 +08:00
x86_64.h qdict: Introduce qdict_rename_keys() 2018-03-12 10:11:48 -04:00