Convert the Neon VRINT-with-specified-rounding-mode insns to gvec,
and use this to implement the fp16 versions.
Backports 18725916b1438b54d6d6533980833d2251a20b7c
Convert the Neon VCVT with-specified-rounding-mode instructions
to gvec, and use this to implement fp16 support for them.
Backports ca88a6efdf4ce96b646a896059f9bd324c2cebc4
Convert the Neon VCVT float<->fixed-point insns to a
gvec style, in preparation for adding fp16 support.
Backports 7b959c5890deb9a6d71bc6800006a0eae0a84c60
Convert the Neon float-integer VCVT insns to gvec, and use this
to implement fp16 support for them.
Note that unlike the VFP int<->fp16 VCVT insns we converted
earlier and which convert to/from a 32-bit integer, these
Neon insns convert to/from 16-bit integers. So we can use
the existing vfp conversion helpers for the f32<->u32/i32
case but need to provide our own for f16<->u16/i16.
Backports 7782a9afec81d1efe23572135c1ed777691ccde5
Convert the Neon pairwise fp ops to use a single gvic-style
helper to do the full operation instead of one helper call
for each 32-bit part. This allows us to use the same
framework to implement the fp16.
Backports 1dc587ee9bfe804406eb3e0bacf47a80644d8abc
Convert the Neon VRSQRTS insn to using a gvec helper,
and use this to implement the fp16 case.
As with VRECPS, we adjust the phrasing of the new implementation
slightly so that the fp32 version parallels the fp16 one.
Backports 40fde72dda2da8d55b820fa6c5efd85814be2023
Convert the Neon VRECPS insn to using a gvec helper, and
use this to implement the fp16 case.
The phrasing of the new float32_recps_nf() is slightly different from
the old recps_f32() so that it parallels the f16 version; for f16 we
can't assume that flush-to-zero is always enabled.
Backports ac8c62c4e5a3f24e6d47f52ec1bfb20994caefa5
Convert the neon floating-point vector compare-vs-0 insns VCEQ0,
VCGT0, VCLE0, VCGE0 and VCLT0 to use a gvec helper, and use this to
implement the fp16 case.
Backport 635187aaa92f21ab001e2868e803b3c5460261ca
Convert the neon floating-point vector operations VFMA and VFMS
to use a gvec helper, and use this to implement the fp16 case.
This is the last use of do_3same_fp() so we can now delete
that function.
Backports commit cf722d75b329ef3f86b869e7e68cbfb1607b3bde
Convert the Neon floating-point VMLA and VMLS insns over to using a
gvec helper, and use this to implement the fp16 case.
Backports e5adc70665ecaf4009c2fb8d66775ea718a85abd
Convert the Neon floating point VMAXNM and VMINNM insns to
using a gvec helper and use this to implement the fp16 case.
Backports e22705bb941d82d6c2a09e8b2031084326902be3
Convert the Neon float-point VMAX and VMIN insns over to using
a gvec helper, and use this to implement the fp16 case.
Backport e43268c54b6cbcb197d179409df7126e81f8cd52
Convert the neon floating-point vector absolute comparison ops
VACGE and VACGT over to using a gvec hepler and use this to
implement the fp16 case.
Backports bb2741da186ebaebc7d5189372be4401e1ff9972
Convert the Neon floating-point vector comparison ops VCEQ,
VCGE and VCGT over to using a gvec helper and use this to
implement the fp16 case.
(We put the float16_ceq() etc functions above the DO_2OP()
macro definition because later when we convert the
compare-against-zero instructions we'll want their
definitions to be visible at that point in the source file.)
Backports ad505db233b89b7fd4b5a98b6f0e8ac8d05b11db
Implement FP16 support for the Neon insns which use the DO_3S_FP_GVEC
macro: VADD, VSUB, VABD, VMUL.
For VABD this requires us to implement a new gvec_fabd_h helper
using the machinery we have already for the other helpers.
Backport e4a6d4a69e239becfd83bdcd996476e7b8e1138d
Now the VFP_CONV_FIX macros can handle fp16's distinction between the
width of the operation and the width of the type used to pass operands,
use the macros rather than the open-coded functions.
This creates an extra six helper functions, all of which we are going
to need for the AArch32 VFP fp16 instructions.
Backports commit 414ba270c4fb758d987adf37ae9bfe531715c604
Implement VFP fp16 for VABS, VNEG and VSQRT. This is all
the fp16 insns that use the DO_VFP_2OP macro, because there
is no fp16 version of VMOV_reg.
Notes:
* the gen_helper_vfp_negh already exists as we needed to create
it for the fp16 multiply-add insns
* as usual we need to use the f16 version of the fp_status;
this is only relevant for VSQRT
Backports ce2d65a5d191380756cdac7a1fd1ba76bd1621cf
Implement fp16 versions of the VFP VMLA, VMLS, VNMLS, VNMLA, VNMUL
instructions. (These are all the remaining ones which we implement
via do_vfp_3op_[hsd]p().)
Backports commit e7cb0ded52c6d7b86585b09935fe7caeb9e38b69
Implmeent VFP fp16 support for simple binary-operator VFP insns VADD,
VSUB, VMUL, VDIV, VMINNM and VMAXNM:
* make the VFP_BINOP() macro generate float16 helpers as well as
float32 and float64
* implement a do_vfp_3op_hp() function similar to the existing
do_vfp_3op_sp()
* add decode for the half-precision insn patterns
Note that the VFP_BINOP macro use creates a couple of unused helper
functions vfp_maxh and vfp_minh, but they're small so it's not worth
splitting the BINOP operations into "needs halfprec" and "no
halfprec" groups.
Backports commit 120a0eb3ea23a5b06fae2f3daebd46a4035864cf
Vector AMOs operate as if aq and rl bits were zero on each element
with regard to ordering relative to other instructions in the same hart.
Vector AMOs provide no ordering guarantee between element operations
in the same vector AMO instruction
Backports 268fcca66bde62257960ec8d859de374315a5e3d
The unit-stride fault-only-fault load instructions are used to
vectorize loops with data-dependent exit conditions(while loops).
These instructions execute as a regular load except that they
will only take a trap on element 0.
Backports commit 022b4ecf775ffeff522eaea4f0d94edcfe00a0a9 from qemu
Vector indexed operations add the contents of each element of the
vector offset operand specified by vs2 to the base effective address
to give the effective address of each element.
Backports f732560e3551c0823cee52efba993fbb8f689a36
Vector strided operations access the first memory element at the base address,
and then access subsequent elements at address increments given by the byte
offset contained in the x register specified by rs2.
Vector unit-stride operations access elements stored contiguously in memory
starting from the base effective address. It can been seen as a special
case of strided operations.
Backports 751538d5da557e5c10e5045c2d27639580ea54a7
vsetvl and vsetvli are two configure instructions for vl, vtype. TB flags
should update after configure instructions. The (ill, lmul, sew ) of vtype
and the bit of (VSTART == 0 && VL == VLMAX) will be placed within tb_flags.
Backports 2b7168fc43fb270fb89e1dddc17ef54714712f3a from qemu
The temp that gets assigned to clean_addr has been allocated with
new_tmp_a64, which means that it will be freed at the end of the
instruction. Freeing it earlier leads to assertion failure.
The loop creates a complication, in which we allocate a new local
temp, which does need freeing, and the final code path is shared
between the loop and non-loop.
Fix this complication by adding new_tmp_a64_local so that the new
local temp is freed at the end, and can be treated exactly like
the non-loop path.
Fixes: bba87d0a0f4
Backports commit 4b4dc9750a0aa0b9766bd755bf6512a84744ce8a from qemu
Because the elements are non-sequential, we cannot eliminate many
tests straight away like we can for sequential operations. But
we often have the PTE details handy, so we can test for Tagged.
Backports commit d28d12f008ee44dc2cc2ee5d8f673be9febc951e from qemu
Because the elements are sequential, we can eliminate many tests all
at once when the tag hits TCMA, or if the page(s) are not Tagged.
Backports commit aa13f7c3c378fa41366b9fcd6c29af1c3d81126a from qemu
Because the elements are sequential, we can eliminate many tests all
at once when the tag hits TCMA, or if the page(s) are not Tagged.
Backports commit 71b9f3948c75bb97641a3c8c7de96d1cb47cdc07 from qemu