Rearrange the arithmetic so that we are agnostic about the total size
of the vector and the size of the element. This will allow us to index
up to the 32nd byte and with 16-byte elements.
Backports commit 66f2dbd783d0b6172043e3679171421b2d0bac11 from qemu