Commit graph

6200 commits

Author SHA1 Message Date
Philippe Mathieu-Daudé 24c56c65a3
qemu/compiler: Define QEMU_NONSTRING
GCC 8 introduced the -Wstringop-truncation checker to detect truncation by
the strncat and strncpy functions (closely related to -Wstringop-overflow,
which detect buffer overflow by string-modifying functions declared in
<string.h>).

In tandem of -Wstringop-truncation, the "nonstring" attribute was added:

The nonstring variable attribute specifies that an object or member
declaration with type array of char, signed char, or unsigned char,
or pointer to such a type is intended to store character arrays that
do not necessarily contain a terminating NUL. This is useful in detecting
uses of such arrays or pointers with functions that expect NUL-terminated
strings, and to avoid warnings when such an array or pointer is used as
an argument to a bounded string manipulation function such as strncpy.

From the GCC manual: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-nonstring-variable-attribute

Add the QEMU_NONSTRING macro which checks if the compiler supports this
attribute.

Backports commit 1daff2f8193496b0e5e0ab56dc48c570c81f804e from qemu
2019-01-22 15:06:09 -05:00
Vitaly Kuznetsov b6cc2c4e06
i386/kvm: add a comment explaining why .feat_names are commented out for Hyper-V feature bits
Hyper-V .feat_names are, unlike hardware features, commented out and it is
not obvious why we do that. Document the current status quo.

Backports commit abd5fc4c862d033a989552914149f01c9476bb16 from qemu
2019-01-14 15:02:35 -05:00
Vitaly Kuznetsov 2873612479
i386/kvm: expose HV_CPUID_ENLIGHTMENT_INFO.EAX and HV_CPUID_NESTED_FEATURES.EAX as feature words
It was found that QMP users of QEMU (e.g. libvirt) may need
HV_CPUID_ENLIGHTMENT_INFO.EAX/HV_CPUID_NESTED_FEATURES.EAX information. In
particular, 'hv_tlbflush' and 'hv_evmcs' enlightenments are only exposed in
HV_CPUID_ENLIGHTMENT_INFO.EAX.

HV_CPUID_NESTED_FEATURES.EAX is exposed for two reasons: convenience
(we don't need to export it from hyperv_handle_properties() and as
future-proof for Enlightened MSR-Bitmap, PV EPT invalidation and
direct virtual flush features.

Backports commit a2b107dbbd342ff2077aa5af705efaf68c375459 from qemu
2019-01-14 15:01:13 -05:00
Eduardo Habkost 3eb700bec7
x86: host-phys-bits-limit option
Backports part of commit 258fe08bd341d2e230676228307294e41f33002c from
qemu. Namely, just adding the struct member.
2019-01-14 14:56:57 -05:00
Paolo Bonzini bf6192276b
target/i386: Disable MPX support on named CPU models
MPX support is being phased out by Intel; GCC has dropped it, Linux
is also going to do that. Even though KVM will have special code
to support MPX after the kernel proper stops enabling it in XCR0,
we probably also want to deprecate that in a few years. As a start,
do not enable it by default for any named CPU model starting with
the 4.0 machine types; this include Skylake, Icelake and Cascadelake.

Backports commit ecb85fe48cacb2f8740186e81f2f38a2e02bd963 from qemu
2019-01-14 14:54:40 -05:00
Borislav Petkov 152fdb49de
target-i386: Reenable RDTSCP support on Opteron_G[345] CPU models CPU models
The missing functionality was added ~3 years ago with the Linux commit

46896c73c1a4 ("KVM: svm: add support for RDTSCP")

so reenable RDTSCP support on those CPU models.

Opteron_G2 - being family 15, model 6, doesn't have RDTSCP support
(the real hardware doesn't have it. K8 got RDTSCP support with the NPT
models, i.e., models >= 0x40).

Document the host's minimum required kernel version, while at it.

Backports commit 483c6ad426dbab72d912fe4793d7d558671aa727 from qemu
2019-01-14 14:50:22 -05:00
Marc-André Lureau 585ebf50f7
build-sys: build with Vista API by default
Both qemu & qga build with Vista API by default already, by defining
_WIN32_WINNT 0x0600. Set it globally in osdep.h instead.

This replaces WINVER by _WIN32_WINNT in osdep.h. WINVER doesn't seem
to be really useful these days.
(see also https://blogs.msdn.microsoft.com/oldnewthing/20070411-00/?p=27283)

Backports commit 56cdca1d7a6a9c8ce28287b8c986ac9ea87ba603 from qemu
2019-01-13 20:28:51 -05:00
Marc-André Lureau 9ff8b70682
build-sys: move windows defines in osdep.h header
This removes some clutter in compilation logging, and allows some
easier tweaking per compilation unit/CFLAGS overriding.

Note that we can't move those define in os-win32.h, since they must be
set before the first system headers are included.

Backports commit 007e722c349839f430f10639ba8c94fe43acfe50 from qemu
2019-01-13 20:27:27 -05:00
Roman Bolshakov 03beb4f15a
qemu-thread: Don't block SEGV, ILL and FPE
If any of these signals happen on macOS, they are not delivered to other
threads and signalfd_compat receives nothing. Indeed, POSIX reference
and sigprocmask(2) note that an attempt to block the signals results in
undefined behaviour. SEGV and FPE can't also be received by signalfd(2)
on Linux.

An ability to retrieve SIGBUS via signalfd(2) is used by QEMU for
memory preallocation therefore we can't unblock it without consequences.
But it's important to leave a remark that the signal is lost on macOS.

Backports commit 21a43af0f18335af4abb1959aa28ee9d159a2d43 from qemu
2019-01-13 19:50:32 -05:00
Peter Maydell 55bc017af4
target/arm: Emit barriers for A32/T32 load-acquire/store-release insns
Now that MTTCG is here, the comment in the 32-bit Arm decoder that
"Since the emulation does not have barriers, the acquire/release
semantics need no special handling" is no longer true. Emit the
correct barriers for the load-acquire/store-release insns, as
we already do in the A64 decoder.

Backports commit 96c552958dbb63453b5f02bea6e704006d50e39a from qemu
2019-01-13 19:48:27 -05:00
Richard Henderson 254f882efc
target/arm: SVE brk[ab] merging does not have s bit
While brk[ab] zeroing has a flags setting option, the merging variant
does not. Retain the same argument structure, to share expansion but
force the flag zero and do not decode bit 22.

Backports commit 407e6ce7f1f428cb242d424cd35381a77b5b2071 from qemu
2019-01-13 19:39:34 -05:00
Richard Henderson 4d8b7a9967
target/arm: Convert ARM_TBFLAG_* to FIELDs
Use "register" TBFLAG_ANY to indicate shared state between
A32 and A64, and "registers" TBFLAG_A32 & TBFLAG_A64 for
fields that are specific to the given cpu state.

Move ARM_TBFLAG_BE_DATA to shared state, instead of its current
placement within "Bit usage when in AArch32 state".

Backports commit aad821ac4faad369fad8941d25e59edf2514246b from qemu
2019-01-13 19:21:18 -05:00
Fredrik Noring ee4b59e981
target/mips: Support R5900 three-operand MADD1 and MADDU1 instructions
The three-operand MADD and MADDU are specific to R5900 cores.

Backports commit a95c4c26f1dc233987350e7cb1cf62d46ade5ce5 from qemu
2019-01-05 08:07:56 -05:00
Philippe Mathieu-Daudé 76bc93690f
target/mips: Support R5900 three-operand MADD and MADDU instructions
The three-operand MADD and MADDU are specific to Sony R5900 core,
and Toshiba TX19/TX39/TX79 cores as well.

The "32-Bit TX System RISC TX39 Family Architecture manual"
is available at https://wiki.qemu.org/File:DSAE0022432.pdf

Backports commit 3b948f053fc588154d95228da8a6561c61c66104 from qemu
2019-01-05 08:03:43 -05:00
Aleksandar Markovic 5729c803a7
target/mips: MXU: Add handler for an align instruction
Add translation handler for S32ALNI MXU instruction.

Backports commit 79f5fee7a3c53494c7ca4bc18c72944f5e2d5c2f from qemu
2019-01-05 08:00:09 -05:00
Aleksandar Markovic 94956d81f6
target/mips: MXU: Add handlers for max/min instructions
Add translation handlers for six max/min MXU instructions.

Backports commit bb84cbf38505bd1d800fdddcd81407a99e5c2142 from qemu
2019-01-05 07:55:39 -05:00
Aleksandar Markovic bf7da7bf57
target/mips: MXU: Add handlers for logic instructions
Add translation handlers for four logic MXU instructions.

It should be noted that there is an error in MXU documentation (dated
June 2017) regarding opcodes for this group of instructions. This was
confirmed by running tests on hardware, and also by looking up other
related public source trees (binutils, Android NDK). In initial MXU
patches to QEMU, opcodes for MXU logic instructions were created to
be in accordance with the MXU documentation, therefore the error from
was propagated. This patch corrects that, changing the involved code.
Besides that, as MXU was designed and implemented only for 32-bit
CPUs, corresponding preprosessor conditions were added around MXU
code, which allows more flexible implementation of MXU handlers.

Backports commit b621f0187ef789aeef733cf79e5ac83984752394 from qemu
2019-01-05 07:48:08 -05:00
Aleksandar Markovic ba253dd0d3
target/mips: MXU: Improve the comment containing MXU overview
Improve textual description of MXU extension. These are mostly
comment formatting changes.

Backports commit 84e2c895b12fb7056daeb7e5094656eae7b50d3d from qemu
2019-01-05 07:39:47 -05:00
Aleksandar Markovic 57bb979ce8
target/mips: MXU: Add generic naming for optn2 constants
Add generic naming involving generig suffixes OPTN0, OPTN1, OPTN2,
OPTN3 for four optn2 constants. Existing suffixes WW, LW, HW, XW
are not quite appropriate for some instructions using optn2.
2019-01-05 07:35:49 -05:00
Aleksandar Markovic b5e1ea2e08
target/mips: MXU: Add missing opcodes/decoding for LX* instructions
Add missing opcodes and decoding engine for LXB, LXH, LXW, LXBU,
and LXHU instructions. They were for some reason forgotten in
previous commits. The MXU opcode list and decoding engine should
be now complete.

Backports commit c233bf07af7cf2358b69c38150dbd2e3e4a399b6 from qemu
2019-01-05 07:34:07 -05:00
Paul Burton 1c6732b053
atomics: Set ATOMIC_REG_SIZE=8 for MIPS n32
ATOMIC_REG_SIZE is currently defined as the default sizeof(void *) for
all MIPS host builds, including those using the n32 ABI. n32 is the
MIPS64 ILP32 ABI and as such tcg/mips/tcg-target.h defines
TCG_TARGET_REG_BITS as 64 for n32 builds. If we attempt to build QEMU
for an n32 host with support for a 64b target architecture then
TCG_OVERSIZED_GUEST is 0 and accel/tcg/cputlb.c attempts to use
atomic_* functions. This fails because ATOMIC_REG_SIZE is 4, causing
the calls to QEMU_BUILD_BUG_ON(sizeof(*ptr) > ATOMIC_REG_SIZE) in the
various atomic_* functions to generate errors.

Fix this by defining ATOMIC_REG_SIZE as 8 for all MIPS64 builds, which
will cover both n32 (ILP32) & n64 (LP64) ABIs in much the same was as
we already do for x86_64/x32.

Backports commit c5b00c1684f3317e887c7401b58dde54c2b05354 from qemu
2019-01-05 07:26:14 -05:00
Richard Henderson 6ed82f77b4
tcg: Improve call argument loading
Free the argument register only after we have verified that the
temporary is not already in that register. This case is likely
now that we are back propagating the preferred register.

Backports commit 4250da10923347c9ee907f8d72bd93dfa5ee8742 from qemu
2019-01-05 07:24:08 -05:00
Richard Henderson 64843e8c09
tcg: Record register preferences during liveness
With these preferences, we can arrange for function call arguments to
be computed into the proper registers instead of requiring extra moves.

Backports commit 25f49c5f1508ddf081ce89fa6bbfd87a51eea37b from qemu
2019-01-05 07:22:57 -05:00
Richard Henderson c2be1cee79
tcg: Add TCG_OPF_BB_EXIT
Use this to notice the opcodes that exit the TB, which implies
that local temps are really dead and need not be synced.

Previously we so marked the true end of the TB, but that was
immediately overwritten by the la_bb_end invoked by any
TCG_OPF_BB_END opcode, like exit_tb.

Backports commit ae36a246ed1a0e96c6c4f478f03d047dfa3a8898 from qemu
2019-01-05 07:09:38 -05:00
Richard Henderson 63cf164724
tcg: Split out more subroutines from liveness_pass_1
Backports commit f65a061c39cc4f9d088201031050e42eb23d5b2a from qemu
2019-01-05 07:07:49 -05:00
Richard Henderson c348ceba56
tcg: Rename and adjust liveness_pass_1 helpers
No need for a "tcg_" prefix for a static function; we already
have another "la_" prefix for indicating liveness analysis.
Pass in nb_globals and nb_temps, as we will already have them
in registers for other loops within the parent function.

Backports commit 2616c8082143373e794b62444bf81754f50dbf6b from qemu
2019-01-05 07:05:58 -05:00
Richard Henderson b356212b33
tcg: Dump register preference info with liveness
Backports commit 1894f69a612b35c2a39b44a824da04d74bfe324a from qemu
2019-01-05 07:00:21 -05:00
Richard Henderson 494d802781
tcg: Improve register allocation for matching constraints
Try harder to honor the output_pref. When we're forced to allocate
a second register for the input, it does not need to use the input
constraint; that will be honored by the register we allocate for the
output and a move is already required.

Backports commit d62816f2db439b2dd761c674f0256f21d9dd2ed0 from qemu
2019-01-05 06:57:56 -05:00
Richard Henderson 83a7de2566
tcg: Add output_pref to TCGOp
Allocate storage for, but do not yet fill in, per-opcode
preferences for the output operands. Pass it in to the
register allocation routines for output operands.

Backports commit 69e3706d2b473815e382552e729d12590339e0ac from qemu
2019-01-05 06:54:40 -05:00
Richard Henderson 19bde1a9cf
tcg: Add preferred_reg argument to tcg_reg_alloc_do_movi
Pass this through to temp_sync.

Backports commit ba87719cd267e6f07b17f6cda08246bf483146d4 from qemu
2019-01-05 06:51:55 -05:00
Richard Henderson c3aa567b03
tcg: Add preferred_reg argument to temp_sync
Pass this through to tcg_reg_alloc.

Backports commit 98b4e186c1ccb8f1868c61a33a3be8c2b82654f3 from qemu
2019-01-05 06:50:22 -05:00
Richard Henderson 96b6640f3b
tcg: Add preferred_reg argument to temp_load
Pass this through to tcg_reg_alloc.

Backports commit b722452aefb089e003b16946a4d73bad1fd3b79b from qemu
2019-01-05 06:48:19 -05:00
Richard Henderson 5e73b27607
tcg: Add preferred_reg argument to tcg_reg_alloc
This new argument will aid register allocation by indicating how
the temporary will be used in future. If the preference cannot
be satisfied, fall back to the constraints of the current insn.

Short circuit the preference when it cannot be satisfied or if
it does not further constrain the operation.

With an eye toward optimizing function call sequences, optimize
for the preferred_reg set containing a single register.

For the moment, all users pass 0 for preference.

Backports commit b016486e7baddb43cfc1e51909b05cde9cf82e0c from qemu
2019-01-05 06:45:15 -05:00
Richard Henderson 6aea2880d2
tcg: Add reachable_code_pass
Delete trivially dead code that follows unconditional branches and
noreturn helpers. These can occur either via optimization or via
the structure of a target's translator following an exception.

Backports commit b4fc67c7afd2c338d6e7c73a7f428dfe05ae0603 from qemu
2019-01-05 06:41:16 -05:00
Richard Henderson 26ab4d6560
tcg: Reference count labels
Increment when adding branches, and decrement when removing them.

Backports commit d88a117eaa39b1d0eb1a79fe84c81840a39eb233 from qemu
2019-01-05 06:39:20 -05:00
Richard Henderson 80b4bef1cc
tcg: Add TCG_CALL_NO_RETURN
Remember which helpers have been marked noreturn.

Backports commit 15d7409260498505e991e7b9d87118627165e613 from qemu
2019-01-05 06:35:21 -05:00
Richard Henderson 7dbbf58653
tcg: Renumber TCG_CALL_* flags
Previously, the low 4 bits were used for TCG_CALL_TYPE_MASK,
which was removed in 6a18ae2d2947532d5c26439548afa0481c4529f9.

Backports commit 3b50352b05eeafeb95cccd770f7aaba00bbdf6fe from qemu
2019-01-05 06:32:52 -05:00
Marc-André Lureau ba1f54804a
qapi: fix flat union on uncovered branches conditionals
Default branches variant should use the member conditional.

This fixes compilation with --disable-replication.

Fixes: 335d10cd8e2c3bb6067804b095aaf6371fc1983e

Backports commit ce1a1aec47877a281d69dbc2e65f86bfe8fea231 from qemu
2018-12-19 10:53:29 -05:00
Lioncash f8435ca3a6
Temporarily disable tcg_debug_assert()
Backporting 6fa2cef205a60b5c5c3b058f53852416b885c455 by Thomas Huth
started invoking assertions on clang. This means Unicorn is doing
something silly. This should be tracked down, but in the meantime,
restore behavior to allow tests to still be run.
2018-12-19 10:50:48 -05:00
Emilio G. Cota 8276a4dc66
hardfloat: implement float32/64 comparison
Performance results for fp-bench:

Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- before:
cmp-single: 110.98 MFlops
cmp-double: 107.12 MFlops
- after:
cmp-single: 506.28 MFlops
cmp-double: 524.77 MFlops

Note that flattening both eq and eq_signaling versions
would give us extra performance (695v506, 615v524 Mflops
for single/double, respectively) but this would emit two
essentially identical functions for each eq/signaling pair,
which is a waste.

Aggregate performance improvement for the last few patches:
[ all charts in png: https://imgur.com/a/4yV8p ]

1. Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

qemu-aarch64 NBench score; higher is better
Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

16 +-+-----------+-------------+----===-------+---===-------+-----------+-+
14 +-+..........................@@@&&.=.......@@@&&.=...................+-+
12 +-+..........................@.@.&.=.......@.@.&.=.....+befor=== +-+
10 +-+..........................@.@.&.=.......@.@.&.=.....+ad@@&& = +-+
8 +-+.......................$$$%.@.&.=.......@.@.&.=.....+ @@u& = +-+
6 +-+............@@@&&=+***##.$%.@.&.=***##$$%+@.&.=..###$$%%@i& = +-+
4 +-+.......###$%%.@.&=.*.*.#.$%.@.&.=*.*.#.$%.@.&.=+**.#+$ +@m& = +-+
2 +-+.....***.#$.%.@.&=.*.*.#.$%.@.&.=*.*.#.$%.@.&.=.**.#+$+sqr& = +-+
0 +-+-----***##$%%@@&&=-***##$$%@@&&==***##$$%@@&&==-**##$$%+cmp==-----+-+
FOURIER NEURAL NELU DECOMPOSITION gmean

qemu-aarch64 SPEC06fp (test set) speedup over QEMU 4c2c1015905
Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
error bars: 95% confidence interval

4.5 +-+---+-----+----+-----+-----+-&---+-----+----+-----+-----+-----+----+-----+-----+-----+-----+----+-----+---+-+
4 +-+..........................+@@+...........................................................................+-+
3.5 +-+..............%%@&.........@@..............%%@&............................................+++dsub +-+
2.5 +-+....&&+.......%%@&.......+%%@..+%%&+..@@&+.%%@&....................................+%%&+.+%@&++%%@& +-+
2 +-+..+%%&..+%@&+.%%@&...+++..%%@...%%&.+$$@&..%%@&..%%@&.......+%%&+.%%@&+......+%%@&.+%%&++$$@&++d%@& %%@&+-+
1.5 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&*+f%@&**$%@&+-+
0.5 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&+sqr@&**$%@&+-+
0 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&*+cmp&**$%@&+-+
410.bw416.gam433.434.z435.436.cac437.lesli444.447.de450.so453454.ca459.GemsF465.tont470.lb4482.sphinxgeomean

2. Host: ARM Aarch64 A57 @ 2.4GHz

qemu-aarch64 NBench score; higher is better
Host: Applied Micro X-Gene, Aarch64 A57 @ 2.4 GHz

5 +-+-----------+-------------+-------------+-------------+-----------+-+
4.5 +-+........................................@@@&==...................+-+
3 4 +-+..........................@@@&==........@.@&.=.....+before +-+
3 +-+..........................@.@&.=........@.@&.=.....+ad@@@&== +-+
2.5 +-+.....................##$$%%.@&.=........@.@&.=.....+ @m@& = +-+
2 +-+............@@@&==.***#.$.%.@&.=.***#$$%%.@&.=.***#$$%%d@& = +-+
1.5 +-+.....***#$$%%.@&.=.*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#+$ +f@& = +-+
0.5 +-+.....*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#+$+sqr& = +-+
0 +-+-----***#$$%%@@&==-***#$$%%@@&==-***#$$%%@@&==-***#$$%+cmp==-----+-+
FOURIER NEURAL NLU DECOMPOSITION gmean
2018-12-19 10:45:22 -05:00
Emilio G. Cota f7549fc13e
hardfloat: implement float32/64 square root
Performance results for fp-bench:

Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- before:
sqrt-single: 42.30 MFlops
sqrt-double: 22.97 MFlops
- after:
sqrt-single: 311.42 MFlops
sqrt-double: 311.08 MFlops

Here USE_FP makes a huge difference for f64's, with throughput
going from ~200 MFlops to ~300 MFlops.

Backports commit f131bae8a7b7ed1928cc94c69df291db609c316a from qemu
2018-12-19 10:43:23 -05:00
Emilio G. Cota 3cf836ca83
hardfloat: implement float32/64 fused multiply-add
Performance results for fp-bench:

1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- before:
fma-single: 74.73 MFlops
fma-double: 74.54 MFlops
- after:
fma-single: 203.37 MFlops
fma-double: 169.37 MFlops

2. ARM Aarch64 A57 @ 2.4GHz
- before:
fma-single: 23.24 MFlops
fma-double: 23.70 MFlops
- after:
fma-single: 66.14 MFlops
fma-double: 63.10 MFlops

3. IBM POWER8E @ 2.1 GHz
- before:
fma-single: 37.26 MFlops
fma-double: 37.29 MFlops
- after:
fma-single: 48.90 MFlops
fma-double: 59.51 MFlops

Here having 3FP64 set to 1 pays off for x86_64:
[1] 170.15 vs [0] 153.12 MFlops

Backports commit ccf770ba7396c240ca8a1564740083742dd04c08 from qemu
2018-12-19 10:42:00 -05:00
Emilio G. Cota 95781d2bb5
hardfloat: implement float32/64 division
Performance results for fp-bench:

1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- before:
div-single: 34.84 MFlops
div-double: 34.04 MFlops
- after:
div-single: 275.23 MFlops
div-double: 216.38 MFlops

2. ARM Aarch64 A57 @ 2.4GHz
- before:
div-single: 9.33 MFlops
div-double: 9.30 MFlops
- after:
div-single: 51.55 MFlops
div-double: 15.09 MFlops

3. IBM POWER8E @ 2.1 GHz
- before:
div-single: 25.65 MFlops
div-double: 24.91 MFlops
- after:
div-single: 96.83 MFlops
div-double: 31.01 MFlops

Here setting 2FP64_USE_FP to 1 pays off for x86_64:
[1] 215.97 vs [0] 62.15 MFlops

Backports commit 4a6295613f533a6841de5968c50e1ca36748807e from qemu
2018-12-19 10:40:00 -05:00
Emilio G. Cota 93991714fb
hardfloat: implement float32/64 multiplication
Performance results for fp-bench:

1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- before:
mul-single: 126.91 MFlops
mul-double: 118.28 MFlops
- after:
mul-single: 258.02 MFlops
mul-double: 197.96 MFlops

2. ARM Aarch64 A57 @ 2.4GHz
- before:
mul-single: 37.42 MFlops
mul-double: 38.77 MFlops
- after:
mul-single: 73.41 MFlops
mul-double: 76.93 MFlops

3. IBM POWER8E @ 2.1 GHz
- before:
mul-single: 58.40 MFlops
mul-double: 59.33 MFlops
- after:
mul-single: 60.25 MFlops
mul-double: 94.79 MFlops

Backports commit 2dfabc86e656e835c67954c60e143ecd33e15817 from qemu
2018-12-19 10:38:33 -05:00
Emilio G. Cota 0862d9c462
hardfloat: implement float32/64 addition and subtraction
Performance results (single and double precision) for fp-bench:

1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- before:
add-single: 135.07 MFlops
add-double: 131.60 MFlops
sub-single: 130.04 MFlops
sub-double: 133.01 MFlops
- after:
add-single: 443.04 MFlops
add-double: 301.95 MFlops
sub-single: 411.36 MFlops
sub-double: 293.15 MFlops

2. ARM Aarch64 A57 @ 2.4GHz
- before:
add-single: 44.79 MFlops
add-double: 49.20 MFlops
sub-single: 44.55 MFlops
sub-double: 49.06 MFlops
- after:
add-single: 93.28 MFlops
add-double: 88.27 MFlops
sub-single: 91.47 MFlops
sub-double: 88.27 MFlops

3. IBM POWER8E @ 2.1 GHz
- before:
add-single: 72.59 MFlops
add-double: 72.27 MFlops
sub-single: 75.33 MFlops
sub-double: 70.54 MFlops
- after:
add-single: 112.95 MFlops
add-double: 201.11 MFlops
sub-single: 116.80 MFlops
sub-double: 188.72 MFlops

Note that the IBM and ARM machines benefit from having
HARDFLOAT_2F{32,64}_USE_FP set to 0. Otherwise their performance
can suffer significantly:
- IBM Power8:
add-single: [1] 54.94 vs [0] 116.37 MFlops
add-double: [1] 58.92 vs [0] 201.44 MFlops
- Aarch64 A57:
add-single: [1] 80.72 vs [0] 93.24 MFlops
add-double: [1] 82.10 vs [0] 88.18 MFlops

On the Intel machine, having 2F64 set to 1 pays off, but it
doesn't for 2F32:
- Intel i7-6700K:
add-single: [1] 285.79 vs [0] 426.70 MFlops
add-double: [1] 302.15 vs [0] 278.82 MFlops

Backports commit 1b615d482094e0123d187f0ad3c676ba8eb9d0a3 from qemu
2018-12-19 10:36:55 -05:00
Emilio G. Cota bca8e39e3c
fpu: introduce hardfloat
The appended paves the way for leveraging the host FPU for a subset
of guest FP operations. For most guest workloads (e.g. FP flags
aren't ever cleared, inexact occurs often and rounding is set to the
default [to nearest]) this will yield sizable performance speedups.

The approach followed here avoids checking the FP exception flags register.
See the added comment for details.

This assumes that QEMU is running on an IEEE754-compliant FPU and
that the rounding is set to the default (to nearest). The
implementation-dependent specifics of the FPU should not matter; things
like tininess detection and snan representation are still dealt with in
soft-fp. However, this approach will break on most hosts if we compile
QEMU with flags that break IEEE compatibility. There is no way to detect
all of these flags at compilation time, but at least we check for
-ffast-math (which defines __FAST_MATH__) and disable hardfloat
(plus emit a #warning) when it is set.

This patch just adds common code. Some operations will be migrated
to hardfloat in subsequent patches to ease bisection.

Note: some architectures (at least PPC, there might be others) clear
the status flags passed to softfloat before most FP operations. This
precludes the use of hardfloat, so to avoid introducing a performance
regression for those targets, we add a flag to disable hardfloat.
In the long run though it would be good to fix the targets so that
at least the inexact flag passed to softfloat is indeed sticky.

Backports commit a94b783952cc493cb241aabb1da8c7a830385baa from qemu
2018-12-19 10:32:32 -05:00
Emilio G. Cota 5d3ccde625
softfloat: add float{32,64}_is_zero_or_normal
These will gain some users very soon.

Backports commit 315df0d193929b167b9d7be4665d5f2c0e2427e0 from qemu
2018-12-19 10:31:10 -05:00
Emilio G. Cota a9d9005399
softfloat: rename canonicalize to sf_canonicalize
glibc >= 2.25 defines canonicalize in commit eaf5ad0
(Add canonicalize, canonicalizef, canonicalizel., 2016-10-26).

Given that we'll be including <math.h> soon, prepare
for this by prefixing our canonicalize() with sf_ to avoid
clashing with the libc's canonicalize().

Backports commit f9943c7f766678af36d31076b78e466256f4871b from qemu
2018-12-19 10:30:38 -05:00
Emilio G. Cota 3a8f7d6d84
softfloat: add float{32,64}_is_{de,}normal
This paves the way for upcoming work.

Backports commit 588e6dfd8774e6da56b6995611655fbe59ff564a from qemu
2018-12-19 10:30:33 -05:00
Emilio G. Cota 3d0359c0f5
xxhash: match output against the original xxhash32
Change the order in which we extract a/b and c/d to
match the output of the upstream xxhash32.

Tested with:
https://github.com/cota/xxhash/tree/qemu

Backports commit b7c2cd08a6f68010ad27c9c0bf2fde02fb743a0e from qemu
2018-12-18 06:09:01 -05:00