cleanup qemu docs

This commit is contained in:
Nguyen Anh Quynh 2017-01-18 15:23:40 +08:00
parent 1f154abfa6
commit 326a9a5fba
44 changed files with 0 additions and 7845 deletions

View file

@ -1,104 +0,0 @@
/*
* This model describes the interaction between aio_set_dispatching()
* and aio_notify().
*
* Author: Paolo Bonzini <pbonzini@redhat.com>
*
* This file is in the public domain. If you really want a license,
* the WTFPL will do.
*
* To simulate it:
* spin -p docs/aio_notify.promela
*
* To verify it:
* spin -a docs/aio_notify.promela
* gcc -O2 pan.c
* ./a.out -a
*/
#define MAX 4
#define LAST (1 << (MAX - 1))
#define FINAL ((LAST << 1) - 1)
bool dispatching;
bool event;
int req, done;
active proctype waiter()
{
int fetch, blocking;
do
:: done != FINAL -> {
// Computing "blocking" is separate from execution of the
// "bottom half"
blocking = (req == 0);
// This is our "bottom half"
atomic { fetch = req; req = 0; }
done = done | fetch;
// Wait for a nudge from the other side
do
:: event == 1 -> { event = 0; break; }
:: !blocking -> break;
od;
dispatching = 1;
// If you are simulating this model, you may want to add
// something like this here:
//
// int foo; foo++; foo++; foo++;
//
// This only wastes some time and makes it more likely
// that the notifier process hits the "fast path".
dispatching = 0;
}
:: else -> break;
od
}
active proctype notifier()
{
int next = 1;
int sets = 0;
do
:: next <= LAST -> {
// generate a request
req = req | next;
next = next << 1;
// aio_notify
if
:: dispatching == 0 -> sets++; event = 1;
:: else -> skip;
fi;
// Test both synchronous and asynchronous delivery
if
:: 1 -> do
:: req == 0 -> break;
od;
:: 1 -> skip;
fi;
}
:: else -> break;
od;
printf("Skipped %d event_notifier_set\n", MAX - sets);
}
#define p (done == FINAL)
never {
do
:: 1 // after an arbitrarily long prefix
:: p -> break // p becomes true
od;
do
:: !p -> accept: break // it then must remains true forever after
od
}

View file

@ -1,352 +0,0 @@
CPUs perform independent memory operations effectively in random order.
but this can be a problem for CPU-CPU interaction (including interactions
between QEMU and the guest). Multi-threaded programs use various tools
to instruct the compiler and the CPU to restrict the order to something
that is consistent with the expectations of the programmer.
The most basic tool is locking. Mutexes, condition variables and
semaphores are used in QEMU, and should be the default approach to
synchronization. Anything else is considerably harder, but it's
also justified more often than one would like. The two tools that
are provided by qemu/atomic.h are memory barriers and atomic operations.
Macros defined by qemu/atomic.h fall in three camps:
- compiler barriers: barrier();
- weak atomic access and manual memory barriers: atomic_read(),
atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_read_barrier_depends();
- sequentially consistent atomic access: everything else.
COMPILER MEMORY BARRIER
=======================
barrier() prevents the compiler from moving the memory accesses either
side of it to the other side. The compiler barrier has no direct effect
on the CPU, which may then reorder things however it wishes.
barrier() is mostly used within qemu/atomic.h itself. On some
architectures, CPU guarantees are strong enough that blocking compiler
optimizations already ensures the correct order of execution. In this
case, qemu/atomic.h will reduce stronger memory barriers to simple
compiler barriers.
Still, barrier() can be useful when writing code that can be interrupted
by signal handlers.
SEQUENTIALLY CONSISTENT ATOMIC ACCESS
=====================================
Most of the operations in the qemu/atomic.h header ensure *sequential
consistency*, where "the result of any execution is the same as if the
operations of all the processors were executed in some sequential order,
and the operations of each individual processor appear in this sequence
in the order specified by its program".
qemu/atomic.h provides the following set of atomic read-modify-write
operations:
void atomic_inc(ptr)
void atomic_dec(ptr)
void atomic_add(ptr, val)
void atomic_sub(ptr, val)
void atomic_and(ptr, val)
void atomic_or(ptr, val)
typeof(*ptr) atomic_fetch_inc(ptr)
typeof(*ptr) atomic_fetch_dec(ptr)
typeof(*ptr) atomic_fetch_add(ptr, val)
typeof(*ptr) atomic_fetch_sub(ptr, val)
typeof(*ptr) atomic_fetch_and(ptr, val)
typeof(*ptr) atomic_fetch_or(ptr, val)
typeof(*ptr) atomic_xchg(ptr, val
typeof(*ptr) atomic_cmpxchg(ptr, old, new)
all of which return the old value of *ptr. These operations are
polymorphic; they operate on any type that is as wide as an int.
Sequentially consistent loads and stores can be done using:
atomic_fetch_add(ptr, 0) for loads
atomic_xchg(ptr, val) for stores
However, they are quite expensive on some platforms, notably POWER and
ARM. Therefore, qemu/atomic.h provides two primitives with slightly
weaker constraints:
typeof(*ptr) atomic_mb_read(ptr)
void atomic_mb_set(ptr, val)
The semantics of these primitives map to Java volatile variables,
and are strongly related to memory barriers as used in the Linux
kernel (see below).
As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
be reordered with each other, and it is also not possible to reorder
"normal" accesses around them.
However, and this is the important difference between
atomic_mb_read/atomic_mb_set and sequential consistency, it is important
for both threads to access the same volatile variable. It is not the
case that everything visible to thread A when it writes volatile field f
becomes visible to thread B after it reads volatile field g. The store
and load have to "match" (i.e., be performed on the same volatile
field) to achieve the right semantics.
These operations operate on any type that is as wide as an int or smaller.
WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
=============================================
Compared to sequentially consistent atomic access, programming with
weaker consistency models can be considerably more complicated.
In general, if the algorithm you are writing includes both writes
and reads on the same side, it is generally simpler to use sequentially
consistent primitives.
When using this model, variables are accessed with atomic_read() and
atomic_set(), and restrictions to the ordering of accesses is enforced
using the smp_rmb(), smp_wmb(), smp_mb() and smp_read_barrier_depends()
memory barriers.
atomic_read() and atomic_set() prevents the compiler from using
optimizations that might otherwise optimize accesses out of existence
on the one hand, or that might create unsolicited accesses on the other.
In general this should not have any effect, because the same compiler
barriers are already implied by memory barriers. However, it is useful
to do so, because it tells readers which variables are shared with
other threads, and which are local to the current thread or protected
by other, more mundane means.
Memory barriers control the order of references to shared memory.
They come in four kinds:
- smp_rmb() guarantees that all the LOAD operations specified before
the barrier will appear to happen before all the LOAD operations
specified after the barrier with respect to the other components of
the system.
In other words, smp_rmb() puts a partial ordering on loads, but is not
required to have any effect on stores.
- smp_wmb() guarantees that all the STORE operations specified before
the barrier will appear to happen before all the STORE operations
specified after the barrier with respect to the other components of
the system.
In other words, smp_wmb() puts a partial ordering on stores, but is not
required to have any effect on loads.
- smp_mb() guarantees that all the LOAD and STORE operations specified
before the barrier will appear to happen before all the LOAD and
STORE operations specified after the barrier with respect to the other
components of the system.
smp_mb() puts a partial ordering on both loads and stores. It is
stronger than both a read and a write memory barrier; it implies both
smp_rmb() and smp_wmb(), but it also prevents STOREs coming before the
barrier from overtaking LOADs coming after the barrier and vice versa.
- smp_read_barrier_depends() is a weaker kind of read barrier. On
most processors, whenever two loads are performed such that the
second depends on the result of the first (e.g., the first load
retrieves the address to which the second load will be directed),
the processor will guarantee that the first LOAD will appear to happen
before the second with respect to the other components of the system.
However, this is not always true---for example, it was not true on
Alpha processors. Whenever this kind of access happens to shared
memory (that is not protected by a lock), a read barrier is needed,
and smp_read_barrier_depends() can be used instead of smp_rmb().
Note that the first load really has to have a _data_ dependency and not
a control dependency. If the address for the second load is dependent
on the first load, but the dependency is through a conditional rather
than actually loading the address itself, then it's a _control_
dependency and a full read barrier or better is required.
This is the set of barriers that is required *between* two atomic_read()
and atomic_set() operations to achieve sequential consistency:
| 2nd operation |
|-----------------------------------------|
1st operation | (after last) | atomic_read | atomic_set |
---------------+--------------+-------------+------------|
(before first) | | none | smp_wmb() |
---------------+--------------+-------------+------------|
atomic_read | smp_rmb() | smp_rmb()* | ** |
---------------+--------------+-------------+------------|
atomic_set | none | smp_mb()*** | smp_wmb() |
---------------+--------------+-------------+------------|
* Or smp_read_barrier_depends().
** This requires a load-store barrier. How to achieve this varies
depending on the machine, but in practice smp_rmb()+smp_wmb()
should have the desired effect. For example, on PowerPC the
lwsync instruction is a combined load-load, load-store and
store-store barrier.
*** This requires a store-load barrier. On most machines, the only
way to achieve this is a full barrier.
You can see that the two possible definitions of atomic_mb_read()
and atomic_mb_set() are the following:
1) atomic_mb_read(p) = atomic_read(p); smp_rmb()
atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v); smp_mb()
2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_rmb()
atomic_mb_set(p, v) = smp_wmb(); atomic_set(p, v);
Usually the former is used, because smp_mb() is expensive and a program
normally has more reads than writes. Therefore it makes more sense to
make atomic_mb_set() the more expensive operation.
There are two common cases in which atomic_mb_read and atomic_mb_set
generate too many memory barriers, and thus it can be useful to manually
place barriers instead:
- when a data structure has one thread that is always a writer
and one thread that is always a reader, manual placement of
memory barriers makes the write side faster. Furthermore,
correctness is easy to check for in this case using the "pairing"
trick that is explained below:
thread 1 thread 1
------------------------- ------------------------
(other writes)
smp_wmb()
atomic_mb_set(&a, x) atomic_set(&a, x)
smp_wmb()
atomic_mb_set(&b, y) atomic_set(&b, y)
=>
thread 2 thread 2
------------------------- ------------------------
y = atomic_mb_read(&b) y = atomic_read(&b)
smp_rmb()
x = atomic_mb_read(&a) x = atomic_read(&a)
smp_rmb()
- sometimes, a thread is accessing many variables that are otherwise
unrelated to each other (for example because, apart from the current
thread, exactly one other thread will read or write each of these
variables). In this case, it is possible to "hoist" the implicit
barriers provided by atomic_mb_read() and atomic_mb_set() outside
a loop. For example, the above definition atomic_mb_read() gives
the following transformation:
n = 0; n = 0;
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]);
smp_rmb();
Similarly, atomic_mb_set() can be transformed as follows:
smp_mb():
smp_wmb();
for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
smp_mb();
The two tricks can be combined. In this case, splitting a loop in
two lets you hoist the barriers out of the loops _and_ eliminate the
expensive smp_mb():
smp_wmb();
for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++)
atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
atomic_mb_set(&b[i], false); smb_wmb();
} for (i = 0; i < 10; i++)
atomic_set(&a[i], false);
smp_mb();
The other thread can still use atomic_mb_read()/atomic_mb_set()
Memory barrier pairing
----------------------
A useful rule of thumb is that memory barriers should always, or almost
always, be paired with another barrier. In the case of QEMU, however,
note that the other barrier may actually be in a driver that runs in
the guest!
For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
both count as read barriers. A read barriers shall pair with a write
barrier or a full barrier; a write barrier shall pair with a read
barrier or a full barrier. A full barrier can pair with anything.
For example:
thread 1 thread 2
=============== ===============
a = 1;
smp_wmb();
b = 2; x = b;
smp_rmb();
y = a;
Note that the "writing" thread are accessing the variables in the
opposite order as the "reading" thread. This is expected: stores
before the write barrier will normally match the loads after the
read barrier, and vice versa. The same is true for more than 2
access and for data dependency barriers:
thread 1 thread 2
=============== ===============
b[2] = 1;
smp_wmb();
x->i = 2;
smp_wmb();
a = x; x = a;
smp_read_barrier_depends();
y = x->i;
smp_read_barrier_depends();
z = b[y];
smp_wmb() also pairs with atomic_mb_read(), and smp_rmb() also pairs
with atomic_mb_set().
COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
============================================
Here is a list of differences between Linux kernel atomic operations
and memory barriers, and the equivalents in QEMU:
- atomic operations in Linux are always on a 32-bit int type and
use a boxed atomic_t type; atomic operations in QEMU are polymorphic
and use normal C types.
- atomic_read and atomic_set in Linux give no guarantee at all;
atomic_read and atomic_set in QEMU include a compiler barrier
(similar to the ACCESS_ONCE macro in Linux).
- most atomic read-modify-write operations in Linux return void;
in QEMU, all of them return the old value of the variable.
- different atomic read-modify-write operations in Linux imply
a different set of memory barriers; in QEMU, all of them enforce
sequential consistency, which means they imply full memory barriers
before and after the operation.
- Linux does not have an equivalent of atomic_mb_read() and
atomic_mb_set(). In particular, note that set_mb() is a little
weaker than atomic_mb_set().
SOURCES
=======
* Documentation/memory-barriers.txt from the Linux kernel
* "The JSR-133 Cookbook for Compiler Writers", available at
http://g.oswego.edu/dl/jmm/cookbook.html

View file

@ -1,161 +0,0 @@
Block I/O error injection using blkdebug
----------------------------------------
Copyright (C) 2014 Red Hat Inc
This work is licensed under the terms of the GNU GPL, version 2 or later. See
the COPYING file in the top-level directory.
The blkdebug block driver is a rule-based error injection engine. It can be
used to exercise error code paths in block drivers including ENOSPC (out of
space) and EIO.
This document gives an overview of the features available in blkdebug.
Background
----------
Block drivers have many error code paths that handle I/O errors. Image formats
are especially complex since metadata I/O errors during cluster allocation or
while updating tables happen halfway through request processing and require
discipline to keep image files consistent.
Error injection allows test cases to trigger I/O errors at specific points.
This way, all error paths can be tested to make sure they are correct.
Rules
-----
The blkdebug block driver takes a list of "rules" that tell the error injection
engine when to fail an I/O request.
Each I/O request is evaluated against the rules. If a rule matches the request
then its "action" is executed.
Rules can be placed in a configuration file; the configuration file
follows the same .ini-like format used by QEMU's -readconfig option, and
each section of the file represents a rule.
The following configuration file defines a single rule:
$ cat blkdebug.conf
[inject-error]
event = "read_aio"
errno = "28"
This rule fails all aio read requests with ENOSPC (28). Note that the errno
value depends on the host. On Linux, see
/usr/include/asm-generic/errno-base.h for errno values.
Invoke QEMU as follows:
$ qemu-system-x86_64
-drive if=none,cache=none,file=blkdebug:blkdebug.conf:test.img,id=drive0 \
-device virtio-blk-pci,drive=drive0,id=virtio-blk-pci0
Rules support the following attributes:
event - which type of operation to match (e.g. read_aio, write_aio,
flush_to_os, flush_to_disk). See the "Events" section for
information on events.
state - (optional) the engine must be in this state number in order for this
rule to match. See the "State transitions" section for information
on states.
errno - the numeric errno value to return when a request matches this rule.
The errno values depend on the host since the numeric values are not
standarized in the POSIX specification.
sector - (optional) a sector number that the request must overlap in order to
match this rule
once - (optional, default "off") only execute this action on the first
matching request
immediately - (optional, default "off") return a NULL BlockAIOCB
pointer and fail without an errno instead. This
exercises the code path where BlockAIOCB fails and the
caller's BlockCompletionFunc is not invoked.
Events
------
Block drivers provide information about the type of I/O request they are about
to make so rules can match specific types of requests. For example, the qcow2
block driver tells blkdebug when it accesses the L1 table so rules can match
only L1 table accesses and not other metadata or guest data requests.
The core events are:
read_aio - guest data read
write_aio - guest data write
flush_to_os - write out unwritten block driver state (e.g. cached metadata)
flush_to_disk - flush the host block device's disk cache
See block/blkdebug.c:event_names[] for the full list of events. You may need
to grep block driver source code to understand the meaning of specific events.
State transitions
-----------------
There are cases where more power is needed to match a particular I/O request in
a longer sequence of requests. For example:
write_aio
flush_to_disk
write_aio
How do we match the 2nd write_aio but not the first? This is where state
transitions come in.
The error injection engine has an integer called the "state" that always starts
initialized to 1. The state integer is internal to blkdebug and cannot be
observed from outside but rules can interact with it for powerful matching
behavior.
Rules can be conditional on the current state and they can transition to a new
state.
When a rule's "state" attribute is non-zero then the current state must equal
the attribute in order for the rule to match.
For example, to match the 2nd write_aio:
[set-state]
event = "write_aio"
state = "1"
new_state = "2"
[inject-error]
event = "write_aio"
state = "2"
errno = "5"
The first write_aio request matches the set-state rule and transitions from
state 1 to state 2. Once state 2 has been entered, the set-state rule no
longer matches since it requires state 1. But the inject-error rule now
matches the next write_aio request and injects EIO (5).
State transition rules support the following attributes:
event - which type of operation to match (e.g. read_aio, write_aio,
flush_to_os, flush_to_disk). See the "Events" section for
information on events.
state - (optional) the engine must be in this state number in order for this
rule to match
new_state - transition to this state number
Suspend and resume
------------------
Exercising code paths in block drivers may require specific ordering amongst
concurrent requests. The "breakpoint" feature allows requests to be halted on
a blkdebug event and resumed later. This makes it possible to achieve
deterministic ordering when multiple requests are in flight.
Breakpoints on blkdebug events are associated with a user-defined "tag" string.
This tag serves as an identifier by which the request can be resumed at a later
point.
See the qemu-io(1) break, resume, remove_break, and wait_break commands for
details.

View file

@ -1,69 +0,0 @@
= Block driver correctness testing with blkverify =
== Introduction ==
This document describes how to use the blkverify protocol to test that a block
driver is operating correctly.
It is difficult to test and debug block drivers against real guests. Often
processes inside the guest will crash because corrupt sectors were read as part
of the executable. Other times obscure errors are raised by a program inside
the guest. These issues are extremely hard to trace back to bugs in the block
driver.
Blkverify solves this problem by catching data corruption inside QEMU the first
time bad data is read and reporting the disk sector that is corrupted.
== How it works ==
The blkverify protocol has two child block devices, the "test" device and the
"raw" device. Read/write operations are mirrored to both devices so their
state should always be in sync.
The "raw" device is a raw image, a flat file, that has identical starting
contents to the "test" image. The idea is that the "raw" device will handle
read/write operations correctly and not corrupt data. It can be used as a
reference for comparison against the "test" device.
After a mirrored read operation completes, blkverify will compare the data and
raise an error if it is not identical. This makes it possible to catch the
first instance where corrupt data is read.
== Example ==
Imagine raw.img has 0xcd repeated throughout its first sector:
$ ./qemu-io -c 'read -v 0 512' raw.img
00000000: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
00000010: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
[...]
000001e0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
000001f0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
read 512/512 bytes at offset 0
512.000000 bytes, 1 ops; 0.0000 sec (97.656 MiB/sec and 200000.0000 ops/sec)
And test.img is corrupt, its first sector is zeroed when it shouldn't be:
$ ./qemu-io -c 'read -v 0 512' test.img
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[...]
000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
read 512/512 bytes at offset 0
512.000000 bytes, 1 ops; 0.0000 sec (81.380 MiB/sec and 166666.6667 ops/sec)
This error is caught by blkverify:
$ ./qemu-io -c 'read 0 512' blkverify:a.img:b.img
blkverify: read sector_num=0 nb_sectors=4 contents mismatch in sector 0
A more realistic scenario is verifying the installation of a guest OS:
$ ./qemu-img create raw.img 16G
$ ./qemu-img create -f qcow2 test.qcow2 16G
$ x86_64-softmmu/qemu-system-x86_64 -cdrom debian.iso \
-drive file=blkverify:raw.img:test.qcow2
If the installation is aborted when blkverify detects corruption, use qemu-io
to explore the contents of the disk image at the sector in question.

View file

@ -1,43 +0,0 @@
= Bootindex property =
Block and net devices have bootindex property. This property is used to
determine the order in which firmware will consider devices for booting
the guest OS. If the bootindex property is not set for a device, it gets
lowest boot priority. There is no particular order in which devices with
unset bootindex property will be considered for booting, but they will
still be bootable.
== Example ==
Let's assume we have a QEMU machine with two NICs (virtio, e1000) and two
disks (IDE, virtio):
qemu -drive file=disk1.img,if=none,id=disk1
-device ide-drive,drive=disk1,bootindex=4
-drive file=disk2.img,if=none,id=disk2
-device virtio-blk-pci,drive=disk2,bootindex=3
-netdev type=user,id=net0 -device virtio-net-pci,netdev=net0,bootindex=2
-netdev type=user,id=net1 -device e1000,netdev=net1,bootindex=1
Given the command above, firmware should try to boot from the e1000 NIC
first. If this fails, it should try the virtio NIC next; if this fails
too, it should try the virtio disk, and then the IDE disk.
== Limitations ==
1. Some firmware has limitations on which devices can be considered for
booting. For instance, the PC BIOS boot specification allows only one
disk to be bootable. If boot from disk fails for some reason, the BIOS
won't retry booting from other disk. It can still try to boot from
floppy or net, though.
2. Sometimes, firmware cannot map the device path QEMU wants firmware to
boot from to a boot method. It doesn't happen for devices the firmware
can natively boot from, but if firmware relies on an option ROM for
booting, and the same option ROM is used for booting from more then one
device, the firmware may not be able to ask the option ROM to boot from
a particular device reliably. For instance with the PC BIOS, if a SCSI HBA
has three bootable devices target1, target3, target5 connected to it,
the option ROM will have a boot method for each of them, but it is not
possible to map from boot method back to a specific target. This is a
shortcoming of the PC BIOS boot specification.

View file

@ -1,181 +0,0 @@
QEMU CCID Device Documentation.
Contents
1. USB CCID device
2. Building
3. Using ccid-card-emulated with hardware
4. Using ccid-card-emulated with certificates
5. Using ccid-card-passthru with client side hardware
6. Using ccid-card-passthru with client side certificates
7. Passthrough protocol scenario
8. libcacard
1. USB CCID device
The USB CCID device is a USB device implementing the CCID specification, which
lets one connect smart card readers that implement the same spec. For more
information see the specification:
Universal Serial Bus
Device Class: Smart Card
CCID
Specification for
Integrated Circuit(s) Cards Interface Devices
Revision 1.1
April 22rd, 2005
Smartcards are used for authentication, single sign on, decryption in
public/private schemes and digital signatures. A smartcard reader on the client
cannot be used on a guest with simple usb passthrough since it will then not be
available on the client, possibly locking the computer when it is "removed". On
the other hand this device can let you use the smartcard on both the client and
the guest machine. It is also possible to have a completely virtual smart card
reader and smart card (i.e. not backed by a physical device) using this device.
2. Building
The cryptographic functions and access to the physical card is done via NSS.
Installing NSS:
In redhat/fedora:
yum install nss-devel
In ubuntu/debian:
apt-get install libnss3-dev
(not tested on ubuntu)
Configuring and building:
./configure --enable-smartcard && make
3. Using ccid-card-emulated with hardware
Assuming you have a working smartcard on the host with the current
user, using NSS, qemu acts as another NSS client using ccid-card-emulated:
qemu -usb -device usb-ccid -device ccid-card-emulated
4. Using ccid-card-emulated with certificates stored in files
You must create the CA and card certificates. This is a one time process.
We use NSS certificates:
mkdir fake-smartcard
cd fake-smartcard
certutil -N -d sql:$PWD
certutil -S -d sql:$PWD -s "CN=Fake Smart Card CA" -x -t TC,TC,TC -n fake-smartcard-ca
certutil -S -d sql:$PWD -t ,, -s "CN=John Doe" -n id-cert -c fake-smartcard-ca
certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (signing)" --nsCertType smime -n signing-cert -c fake-smartcard-ca
certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (encryption)" --nsCertType sslClient -n encryption-cert -c fake-smartcard-ca
Note: you must have exactly three certificates.
You can use the emulated card type with the certificates backend:
qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,db=sql:$PWD,cert1=id-cert,cert2=signing-cert,cert3=encryption-cert
To use the certificates in the guest, export the CA certificate:
certutil -L -r -d sql:$PWD -o fake-smartcard-ca.cer -n fake-smartcard-ca
and import it in the guest:
certutil -A -d /etc/pki/nssdb -i fake-smartcard-ca.cer -t TC,TC,TC -n fake-smartcard-ca
In a Linux guest you can then use the CoolKey PKCS #11 module to access
the card:
certutil -d /etc/pki/nssdb -L -h all
It will prompt you for the PIN (which is the password you assigned to the
certificate database early on), and then show you all three certificates
together with the manually imported CA cert:
Certificate Nickname Trust Attributes
fake-smartcard-ca CT,C,C
John Doe:CAC ID Certificate u,u,u
John Doe:CAC Email Signature Certificate u,u,u
John Doe:CAC Email Encryption Certificate u,u,u
If this does not happen, CoolKey is not installed or not registered with
NSS. Registration can be done from Firefox or the command line:
modutil -dbdir /etc/pki/nssdb -add "CAC Module" -libfile /usr/lib64/pkcs11/libcoolkeypk11.so
modutil -dbdir /etc/pki/nssdb -list
5. Using ccid-card-passthru with client side hardware
on the host specify the ccid-card-passthru device with a suitable chardev:
qemu -chardev socket,server,host=0.0.0.0,port=2001,id=ccid,nowait -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
on the client run vscclient, built when you built QEMU:
vscclient <qemu-host> 2001
6. Using ccid-card-passthru with client side certificates
This case is not particularly useful, but you can use it to debug
your setup if #4 works but #5 does not.
Follow instructions as per #4, except run QEMU and vscclient as follows:
Run qemu as per #5, and run vscclient from the "fake-smartcard"
directory as follows:
qemu -chardev socket,server,host=0.0.0.0,port=2001,id=ccid,nowait -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
vscclient -e "db=\"sql:$PWD\" use_hw=no soft=(,Test,CAC,,id-cert,signing-cert,encryption-cert)" <qemu-host> 2001
7. Passthrough protocol scenario
This is a typical interchange of messages when using the passthru card device.
usb-ccid is a usb device. It defaults to an unattached usb device on startup.
usb-ccid expects a chardev and expects the protocol defined in
cac_card/vscard_common.h to be passed over that.
The usb-ccid device can be in one of three modes:
* detached
* attached with no card
* attached with card
A typical interchange is: (the arrow shows who started each exchange, it can be client
originated or guest originated)
client event | vscclient | passthru | usb-ccid | guest event
----------------------------------------------------------------------------------------------
| VSC_Init | | |
| VSC_ReaderAdd | | attach |
| | | | sees new usb device.
card inserted -> | | | |
| VSC_ATR | insert | insert | see new card
| | | |
| VSC_APDU | VSC_APDU | | <- guest sends APDU
client<->physical | | | |
card APDU exchange| | | |
client response ->| VSC_APDU | VSC_APDU | | receive APDU response
...
[APDU<->APDU repeats several times]
...
card removed -> | | | |
| VSC_CardRemove | remove | remove | card removed
...
[(card insert, apdu's, card remove) repeat]
...
kill/quit | | | |
vscclient | | | |
| VSC_ReaderRemove | | detach |
| | | | usb device removed.
8. libcacard
Both ccid-card-emulated and vscclient use libcacard as the card emulator.
libcacard implements a completely virtual CAC (DoD standard for smart
cards) compliant card and uses NSS to retrieve certificates and do
any encryption. The backend can then be a real reader and card, or
certificates stored in files.
For documentation of the library see docs/libcacard.txt.

View file

@ -1,37 +0,0 @@
###########################################################################
#
# You can pass this file directly to qemu using the -readconfig
# command line switch.
#
# This config file creates a EHCI adapter with companion UHCI
# controllers as multifunction device in PCI slot "1d".
#
# Specify "bus=ehci.0" when creating usb devices to hook them up
# there.
#
[device "ehci"]
driver = "ich9-usb-ehci1"
addr = "1d.7"
multifunction = "on"
[device "uhci-1"]
driver = "ich9-usb-uhci1"
addr = "1d.0"
multifunction = "on"
masterbus = "ehci.0"
firstport = "0"
[device "uhci-2"]
driver = "ich9-usb-uhci2"
addr = "1d.1"
multifunction = "on"
masterbus = "ehci.0"
firstport = "2"
[device "uhci-3"]
driver = "ich9-usb-uhci3"
addr = "1d.2"
multifunction = "on"
masterbus = "ehci.0"
firstport = "4"

View file

@ -1,239 +0,0 @@
# Specification for the fuzz testing tool
#
# Copyright (C) 2014 Maria Kustova <maria.k@catit.be>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
Image fuzzer
============
Description
-----------
The goal of the image fuzzer is to catch crashes of qemu-io/qemu-img
by providing to them randomly corrupted images.
Test images are generated from scratch and have valid inner structure with some
elements, e.g. L1/L2 tables, having random invalid values.
Test runner
-----------
The test runner generates test images, executes tests utilizing generated
images, indicates their results and collects all test related artifacts (logs,
core dumps, test images, backing files).
The test means execution of all available commands under test with the same
generated test image.
By default, the test runner generates new tests and executes them until
keyboard interruption. But if a test seed is specified via the '--seed' runner
parameter, then only one test with this seed will be executed, after its finish
the runner will exit.
The runner uses an external image fuzzer to generate test images. An image
generator should be specified as a mandatory parameter of the test runner.
Details about interactions between the runner and fuzzers see "Module
interfaces".
The runner activates generation of core dumps during test executions, but it
assumes that core dumps will be generated in the current working directory.
For comprehensive test results, please, set up your test environment
properly.
Paths to binaries under test (SUTs) qemu-img and qemu-io are retrieved from
environment variables. If the environment check fails the runner will
use SUTs installed in system paths.
qemu-img is required for creation of backing files, so it's mandatory to set
the related environment variable if it's not installed in the system path.
For details about environment variables see qemu-iotests/check.
The runner accepts a JSON array of fields expected to be fuzzed via the
'--config' argument, e.g.
'[["feature_name_table"], ["header", "l1_table_offset"]]'
Each sublist can have one or two strings defining image structure elements.
In the latter case a parent element should be placed on the first position,
and a field name on the second one.
The runner accepts a list of commands under test as a JSON array via
the '--command' argument. Each command is a list containing a SUT and all its
arguments, e.g.
runner.py -c '[["qemu-io", "$test_img", "-c", "write $off $len"]]'
/tmp/test ../qcow2
For variable arguments next aliases can be used:
- $test_img for a fuzzed img
- $off for an offset in the fuzzed image
- $len for a data size
Values for last two aliases will be generated based on a size of a virtual
disk of the generated image.
In case when no commands are specified the runner will execute commands from
the default list:
- qemu-img check
- qemu-img info
- qemu-img convert
- qemu-io -c read
- qemu-io -c write
- qemu-io -c aio_read
- qemu-io -c aio_write
- qemu-io -c flush
- qemu-io -c discard
- qemu-io -c truncate
Qcow2 image generator
---------------------
The 'qcow2' generator is a Python package providing 'create_image' method as
a single public API. See details in 'Test runner/image fuzzer' chapter of
'Module interfaces'.
Qcow2 contains two submodules: fuzz.py and layout.py.
'fuzz.py' contains all fuzzing functions, one per image field. It's assumed
that after code analysis every field will have own constraints for its value.
For now only universal potentially dangerous values are used, e.g. type limits
for integers or unsafe symbols as '%s' for strings. For bitmasks random amount
of bits are set to ones. All fuzzed values are checked on non-equality to the
current valid value of the field. In case of equality the value will be
regenerated.
'layout.py' creates a random valid image, fuzzes a random subset of the image
fields by 'fuzz.py' module and writes a fuzzed image to the file specified.
If a fuzzer configuration is specified, then it has the next interpretation:
1. If a list contains a parent image element only, then some random portion
of fields of this element will be fuzzed every test.
The same behavior is applied for the entire image if no configuration is
used. This case is useful for the test specialization.
2. If a list contains a parent element and a field name, then a field
will be always fuzzed for every test. This case is useful for regression
testing.
The generator can create header fields, header extensions, L1/L2 tables and
refcount table and blocks.
Module interfaces
-----------------
* Test runner/image fuzzer
The runner calls an image generator specifying the path to a test image file,
path to a backing file and its format and a fuzzer configuration.
An image generator is expected to provide a
'create_image(test_img_path, backing_file_path=None,
backing_file_format=None, fuzz_config=None)'
method that creates a test image, writes it to the specified file and returns
the size of the virtual disk.
The file should be created if it doesn't exist or overwritten otherwise.
fuzz_config has a form of a list of lists. Every sublist can have one
or two elements: first element is a name of a parent image element, second one
if exists is a name of a field in this element.
Example,
[['header', 'l1_table_offset'],
['header', 'nb_snapshots'],
['feature_name_table']]
Random seed is set by the runner at every test execution for the regression
purpose, so an image generator is not recommended to modify it internally.
Overall fuzzer requirements
===========================
Input data:
----------
- image template (generator)
- work directory
- action vector (optional)
- seed (optional)
- SUT and its arguments (optional)
Fuzzer requirements:
-------------------
1. Should be able to inject random data
2. Should be able to select a random value from the manually pregenerated
vector (boundary values, e.g. max/min cluster size)
3. Image template should describe a general structure invariant for all
test images (image format description)
4. Image template should be autonomous and other fuzzer parts should not
rely on it
5. Image template should contain reference rules (not only block+size
description)
6. Should generate the test image with the correct structure based on an image
template
7. Should accept a seed as an argument (for regression purpose)
8. Should generate a seed if it is not specified as an input parameter.
9. The same seed should generate the same image for the same action vector,
specified or generated.
10. Should accept a vector of actions as an argument (for test reproducing and
for test case specification, e.g. group of tests for header structure,
group of test for snapshots, etc)
11. Action vector should be randomly generated from the pool of available
actions, if it is not specified as an input parameter
12. Pool of actions should be defined automatically based on an image template
13. Should accept a SUT and its call parameters as an argument or select them
randomly otherwise. As far as it's expected to be rarely changed, the list
of all possible test commands can be available in the test runner
internally.
14. Should support an external cancellation of a test run
15. Seed should be logged (for regression purpose)
16. All files related to a test result should be collected: a test image,
SUT logs, fuzzer logs and crash dumps
17. Should be compatible with python version 2.4-2.7
18. Usage of external libraries should be limited as much as possible.
Image formats:
-------------
Main target image format is qcow2, but support of image templates should
provide an ability to add any other image format.
Effectiveness:
-------------
The fuzzer can be controlled via template, seed and action vector;
it makes the fuzzer itself invariant to an image format and test logic.
It should be able to perform rather complex and precise tests, that can be
specified via an action vector. Otherwise, knowledge about an image structure
allows the fuzzer to generate the pool of all available areas can be fuzzed
and randomly select some of them and so compose its own action vector.
Also complexity of a template defines complexity of the fuzzer, so its
functionality can be varied from simple model-independent fuzzing to smart
model-based one.
Glossary:
--------
Action vector is a sequence of structure elements retrieved from an image
format, each of them will be fuzzed for the test image. It's a subset of
elements of the action pool. Example: header, refcount table, etc.
Action pool is all available elements of an image structure that generated
automatically from an image template.
Image template is a formal description of an image structure and relations
between image blocks.
Test image is an output image of the fuzzer defined by the current seed and
action vector.

View file

@ -1,483 +0,0 @@
This file documents the CAC (Common Access Card) library in the libcacard
subdirectory.
Virtual Smart Card Emulator
This emulator is designed to provide emulation of actual smart cards to a
virtual card reader running in a guest virtual machine. The emulated smart
cards can be representations of real smart cards, where the necessary functions
such as signing, card removal/insertion, etc. are mapped to real, physical
cards which are shared with the client machine the emulator is running on, or
the cards could be pure software constructs.
The emulator is structured to allow multiple replaceable or additional pieces,
so it can be easily modified for future requirements. The primary envisioned
modifications are:
1) The socket connection to the virtual card reader (presumably a CCID reader,
but other ISO-7816 compatible readers could be used). The code that handles
this is in vscclient.c.
2) The virtual card low level emulation. This is currently supplied by using
NSS. This emulation could be replaced by implementations based on other
security libraries, including but not limitted to openssl+pkcs#11 library,
raw pkcs#11, Microsoft CAPI, direct opensc calls, etc. The code that handles
this is in vcard_emul_nss.c.
3) Emulation for new types of cards. The current implementation emulates the
original DoD CAC standard with separate pki containers. This emulator lives in
cac.c. More than one card type emulator could be included. Other cards could
be emulated as well, including PIV, newer versions of CAC, PKCS #15, etc.
--------------------
Replacing the Socket Based Virtual Reader Interface.
The current implementation contains a replaceable module vscclient.c. The
current vscclient.c implements a sockets interface to the virtual ccid reader
on the guest. CCID commands that are pertinent to emulation are passed
across the socket, and their responses are passed back along that same socket.
The protocol that vscclient uses is defined in vscard_common.h and connects
to a qemu ccid usb device. Since this socket runs as a client, vscclient.c
implements a program with a main entry. It also handles argument parsing for
the emulator.
An application that wants to use the virtual reader can replace vscclient.c
with its own implementation that connects to its own CCID reader. The calls
that the CCID reader can call are:
VReaderList * vreader_get_reader_list();
This function returns a list of virtual readers. These readers may map to
physical devices, or simulated devices depending on vcard the back end. Each
reader in the list should represent a reader to the virtual machine. Virtual
USB address mapping is left to the CCID reader front end. This call can be
made any time to get an updated list. The returned list is a copy of the
internal list that can be referenced by the caller without locking. This copy
must be freed by the caller with vreader_list_delete when it is no longer
needed.
VReaderListEntry *vreader_list_get_first(VReaderList *);
This function gets the first entry on the reader list. Along with
vreader_list_get_next(), vreader_list_get_first() can be used to walk the
reader list returned from vreader_get_reader_list(). VReaderListEntries are
part of the list themselves and do not need to be freed separately from the
list. If there are no entries on the list, it will return NULL.
VReaderListEntry *vreader_list_get_next(VReaderListEntry *);
This function gets the next entry in the list. If there are no more entries
it will return NULL.
VReader * vreader_list_get_reader(VReaderListEntry *)
This function returns the reader stored in the reader List entry. Caller gets
a new reference to a reader. The caller must free its reference when it is
finished with vreader_free().
void vreader_free(VReader *reader);
This function frees a reference to a reader. Readers are reference counted
and are automatically deleted when the last reference is freed.
void vreader_list_delete(VReaderList *list);
This function frees the list, all the elements on the list, and all the
reader references held by the list.
VReaderStatus vreader_power_on(VReader *reader, char *atr, int *len);
This function simulates a card power on. A virtual card does not care about
the actual voltage and other physical parameters, but it does care that the
card is actually on or off. Cycling the card causes the card to reset. If
the caller provides enough space, vreader_power_on will return the ATR of
the virtual card. The amount of space provided in atr should be indicated
in *len. The function modifies *len to be the actual length of of the
returned ATR.
VReaderStatus vreader_power_off(VReader *reader);
This function simulates a power off of a virtual card.
VReaderStatus vreader_xfer_bytes(VReader *reader, unsigne char *send_buf,
int send_buf_len,
unsigned char *receive_buf,
int receive_buf_len);
This function sends a raw apdu to a card and returns the card's response.
The CCID front end should return the response back. Most of the emulation
is driven from these APDUs.
VReaderStatus vreader_card_is_present(VReader *reader);
This function returns whether or not the reader has a card inserted. The
vreader_power_on, vreader_power_off, and vreader_xfer_bytes will return
VREADER_NO_CARD.
const char *vreader_get_name(VReader *reader);
This function returns the name of the reader. The name comes from the card
emulator level and is usually related to the name of the physical reader.
VReaderID vreader_get_id(VReader *reader);
This function returns the id of a reader. All readers start out with an id
of -1. The application can set the id with vreader_set_id.
VReaderStatus vreader_get_id(VReader *reader, VReaderID id);
This function sets the reader id. The application is responsible for making
sure that the id is unique for all readers it is actively using.
VReader *vreader_find_reader_by_id(VReaderID id);
This function returns the reader which matches the id. If two readers match,
only one is returned. The function returns NULL if the id is -1.
Event *vevent_wait_next_vevent();
This function blocks waiting for reader and card insertion events. There
will be one event for each card insertion, each card removal, each reader
insertion and each reader removal. At start up, events are created for all
the initial readers found, as well as all the cards that are inserted.
Event *vevent_get_next_vevent();
This function returns a pending event if it exists, otherwise it returns
NULL. It does not block.
----------------
Card Type Emulator: Adding a New Virtual Card Type
The ISO 7816 card spec describes 2 types of cards:
1) File system cards, where the smartcard is managed by reading and writing
data to files in a file system. There is currently only boiler plate
implemented for file system cards.
2) VM cards, where the card has loadable applets which perform the card
functions. The current implementation supports VM cards.
In the case of VM cards, the difference between various types of cards is
really what applets have been installed in that card. This structure is
mirrored in card type emulators. The 7816 emulator already handles the basic
ISO 7186 commands. Card type emulators simply need to add the virtual applets
which emulate the real card applets. Card type emulators have exactly one
public entry point:
VCARDStatus xxx_card_init(VCard *card, const char *flags,
const unsigned char *cert[],
int cert_len[],
VCardKey *key[],
int cert_count);
The parameters for this are:
card - the virtual card structure which will represent this card.
flags - option flags that may be specific to this card type.
cert - array of binary certificates.
cert_len - array of lengths of each of the certificates specified in cert.
key - array of opaque key structures representing the private keys on
the card.
cert_count - number of entries in cert, cert_len, and key arrays.
Any cert, cert_len, or key with the same index are matching sets. That is
cert[0] is cert_len[0] long and has the corresponding private key of key[0].
The card type emulator is expected to own the VCardKeys, but it should copy
any raw cert data it wants to save. It can create new applets and add them to
the card using the following functions:
VCardApplet *vcard_new_applet(VCardProcessAPDU apdu_func,
VCardResetApplet reset_func,
const unsigned char *aid,
int aid_len);
This function creates a new applet. Applet structures store the following
information:
1) the AID of the applet (set by aid and aid_len).
2) a function to handle APDUs for this applet. (set by apdu_func, more on
this below).
3) a function to reset the applet state when the applet is selected.
(set by reset_func, more on this below).
3) applet private data, a data pointer used by the card type emulator to
store any data or state it needs to complete requests. (set by a
separate call).
4) applet private data free, a function used to free the applet private
data when the applet itself is destroyed.
The created applet can be added to the card with vcard_add_applet below.
void vcard_set_applet_private(VCardApplet *applet,
VCardAppletPrivate *private,
VCardAppletPrivateFree private_free);
This function sets the private data and the corresponding free function.
VCardAppletPrivate is an opaque data structure to the rest of the emulator.
The card type emulator can define it any way it wants by defining
struct VCardAppletPrivateStruct {};. If there is already a private data
structure on the applet, the old one is freed before the new one is set up.
passing two NULL clear any existing private data.
VCardStatus vcard_add_applet(VCard *card, VCardApplet *applet);
Add an applet onto the list of applets attached to the card. Once an applet
has been added, it can be selected by its AID, and then commands will be
routed to it VCardProcessAPDU function. This function adopts the applet that
is passed into it. Note: 2 applets with the same AID should not be added to
the same card. It is permissible to add more than one applet. Multiple applets
may have the same VCardPRocessAPDU entry point.
The certs and keys should be attached to private data associated with one or
more appropriate applets for that card. Control will come to the card type
emulators once one of its applets are selected through the VCardProcessAPDU
function it specified when it created the applet.
The signature of VCardResetApplet is:
VCardStatus (*VCardResetApplet) (VCard *card, int channel);
This function will reset the any internal applet state that needs to be
cleared after a select applet call. It should return VCARD_DONE;
The signature of VCardProcessAPDU is:
VCardStatus (*VCardProcessAPDU)(VCard *card, VCardAPDU *apdu,
VCardResponse **response);
This function examines the APDU and determines whether it should process
the apdu directly, reject the apdu as invalid, or pass the apdu on to
the basic 7816 emulator for processing.
If the 7816 emulator should process the apdu, then the VCardProcessAPDU
should return VCARD_NEXT.
If there is an error, then VCardProcessAPDU should return an error
response using vcard_make_response and the appropriate 7816 error code
(see card_7816t.h) or vcard_make_response with a card type specific error
code. It should then return VCARD_DONE.
If the apdu can be processed correctly, VCardProcessAPDU should do so,
set the response value appropriately for that APDU, and return VCARD_DONE.
VCardProcessAPDU should always set the response if it returns VCARD_DONE.
It should always either return VCARD_DONE or VCARD_NEXT.
Parsing the APDU --
Prior to processing calling the card type emulator's VCardProcessAPDU function, the emulator has already decoded the APDU header and set several fields:
apdu->a_data - The raw apdu data bytes.
apdu->a_len - The len of the raw apdu data.
apdu->a_body - The start of any post header parameter data.
apdu->a_Lc - The parameter length value.
apdu->a_Le - The expected length of any returned data.
apdu->a_cla - The raw apdu class.
apdu->a_channel - The channel (decoded from the class).
apdu->a_secure_messaging_type - The decoded secure messaging type
(from class).
apdu->a_type - The decode class type.
apdu->a_gen_type - the generic class type (7816, PROPRIETARY, RFU, PTS).
apdu->a_ins - The instruction byte.
apdu->a_p1 - Parameter 1.
apdu->a_p2 - Parameter 2.
Creating a Response --
The expected result of any APDU call is a response. The card type emulator must
set *response with an appropriate VCardResponse value if it returns VCARD_DONE.
Responses could be as simple as returning a 2 byte status word response, to as
complex as returning a block of data along with a 2 byte response. Which is
returned will depend on the semantics of the APDU. The following functions will
create card responses.
VCardResponse *vcard_make_response(VCard7816Status status);
This is the most basic function to get a response. This function will
return a response the consists solely one 2 byte status code. If that status
code is defined in card_7816t.h, then this function is guaranteed to
return a response with that status. If a cart type specific status code
is passed and vcard_make_response fails to allocate the appropriate memory
for that response, then vcard_make_response will return a VCardResponse
of VCARD7816_STATUS_EXC_ERROR_MEMORY. In any case, this function is
guaranteed to return a valid VCardResponse.
VCardResponse *vcard_response_new(unsigned char *buf, int len,
VCard7816Status status);
This function is similar to vcard_make_response except it includes some
returned data with the response. It could also fail to allocate enough
memory, in which case it will return NULL.
VCardResponse *vcard_response_new_status_bytes(unsigned char sw1,
unsigned char sw2);
Sometimes in 7816 the response bytes are treated as two separate bytes with
split meanings. This function allows you to create a response based on
two separate bytes. This function could fail, in which case it will return
NULL.
VCardResponse *vcard_response_new_bytes(unsigned char *buf, int len,
unsigned char sw1,
unsigned char sw2);
This function is the same as vcard_response_new except you may specify
the status as two separate bytes like vcard_response_new_status_bytes.
Implementing functionality ---
The following helper functions access information about the current card
and applet.
VCARDAppletPrivate *vcard_get_current_applet_private(VCard *card,
int channel);
This function returns any private data set by the card type emulator on
the currently selected applet. The card type emulator keeps track of the
current applet state in this data structure. Any certs and keys associated
with a particular applet is also stored here.
int vcard_emul_get_login_count(VCard *card);
This function returns the the number of remaining login attempts for this
card. If the card emulator does not know, or the card does not have a
way of giving this information, this function returns -1.
VCard7816Status vcard_emul_login(VCard *card, unsigned char *pin,
int pin_len);
This function logs into the card and returns the standard 7816 status
word depending on the success or failure of the call.
void vcard_emul_delete_key(VCardKey *key);
This function frees the VCardKey passed in to xxxx_card_init. The card
type emulator is responsible for freeing this key when it no longer needs
it.
VCard7816Status vcard_emul_rsa_op(VCard *card, VCardKey *key,
unsigned char *buffer,
int buffer_size);
This function does a raw rsa op on the buffer with the given key.
The sample card type emulator is found in cac.c. It implements the cac specific
applets. Only those applets needed by the coolkey pkcs#11 driver on the guest
have been implemented. To support the full range CAC middleware, a complete CAC
card according to the CAC specs should be implemented here.
------------------------------
Virtual Card Emulator
This code accesses both real smart cards and simulated smart cards through
services provided on the client. The current implementation uses NSS, which
already knows how to talk to various PKCS #11 modules on the client, and is
portable to most operating systems. A particular emulator can have only one
virtual card implementation at a time.
The virtual card emulator consists of a series of virtual card services. In
addition to the services describe above (services starting with
vcard_emul_xxxx), the virtual card emulator also provides the following
functions:
VCardEmulError vcard_emul_init(cont VCardEmulOptions *options);
The options structure is built by another function in the virtual card
interface where a string of virtual card emulator specific strings are
mapped to the options. The actual structure is defined by the virtual card
emulator and is used to determine the configuration of soft cards, or to
determine which physical cards to present to the guest.
The vcard_emul_init function will build up sets of readers, create any
threads that are needed to watch for changes in the reader state. If readers
have cards present in them, they are also initialized.
Readers are created with the function.
VReader *vreader_new(VReaderEmul *reader_emul,
VReaderEmulFree reader_emul_free);
The freeFunc is used to free the VReaderEmul * when the reader is
destroyed. The VReaderEmul structure is an opaque structure to the
rest of the code, but defined by the virtual card emulator, which can
use it to store any reader specific state.
Once the reader has been created, it can be added to the front end with the
call:
VReaderStatus vreader_add_reader(VReader *reader);
This function will automatically generate the appropriate new reader
events and add the reader to the list.
To create a new card, the virtual card emulator will call a similar
function.
VCard *vcard_new(VCardEmul *card_emul,
VCardEmulFree card_emul_free);
Like vreader_new, this function takes a virtual card emulator specific
structure which it uses to keep track of the card state.
Once the card is created, it is attached to a card type emulator with the
following function:
VCardStatus vcard_init(VCard *vcard, VCardEmulType type,
const char *flags,
unsigned char *const *certs,
int *cert_len,
VCardKey *key[],
int cert_count);
The vcard is the value returned from vcard_new. The type is the
card type emulator that this card should presented to the guest as.
The flags are card type emulator specific options. The certs,
cert_len, and keys are all arrays of length cert_count. These are the
the same of the parameters xxxx_card_init() accepts.
Finally the card is associated with its reader by the call:
VReaderStatus vreader_insert_card(VReader *vreader, VCard *vcard);
This function, like vreader_add_reader, will take care of any event
notification for the card insert.
VCardEmulError vcard_emul_force_card_remove(VReader *vreader);
Force a card that is present to appear to be removed to the guest, even if
that card is a physical card and is present.
VCardEmulError vcard_emul_force_card_insert(VReader *reader);
Force a card that has been removed by vcard_emul_force_card_remove to be
reinserted from the point of view of the guest. This will only work if the
card is physically present (which is always true fro a soft card).
void vcard_emul_get_atr(Vcard *card, unsigned char *atr, int *atr_len);
Return the virtual ATR for the card. By convention this should be the value
VCARD_ATR_PREFIX(size) followed by several ascii bytes related to this
particular emulator. For instance the NSS emulator returns
{VCARD_ATR_PREFIX(3), 'N', 'S', 'S' }. Do ot return more data then *atr_len;
void vcard_emul_reset(VCard *card, VCardPower power)
Set the state of 'card' to the current power level and reset its internal
state (logout, etc).
-------------------------------------------------------
List of files and their function:
README - This file
card_7816.c - emulate basic 7816 functionality. Parse APDUs.
card_7816.h - apdu and response services definitions.
card_7816t.h - 7816 specific structures, types and definitions.
event.c - event handling code.
event.h - event handling services definitions.
eventt.h - event handling structures and types
vcard.c - handle common virtual card services like creation, destruction, and
applet management.
vcard.h - common virtual card services function definitions.
vcardt.h - comon virtual card types
vreader.c - common virtual reader services.
vreader.h - common virtual reader services definitions.
vreadert.h - comon virtual reader types.
vcard_emul_type.c - manage the card type emulators.
vcard_emul_type.h - definitions for card type emulators.
cac.c - card type emulator for CAC cards
vcard_emul.h - virtual card emulator service definitions.
vcard_emul_nss.c - virtual card emulator implementation for nss.
vscclient.c - socket connection to guest qemu usb driver.
vscard_common.h - common header with the guest qemu usb driver.
mutex.h - header file for machine independent mutexes.
link_test.c - static test to make sure all the symbols are properly defined.

View file

@ -1,58 +0,0 @@
LIVE BLOCK OPERATIONS
=====================
High level description of live block operations. Note these are not
supported for use with the raw format at the moment.
Snapshot live merge
===================
Given a snapshot chain, described in this document in the following
format:
[A] -> [B] -> [C] -> [D]
Where the rightmost object ([D] in the example) described is the current
image which the guest OS has write access to. To the left of it is its base
image, and so on accordingly until the leftmost image, which has no
base.
The snapshot live merge operation transforms such a chain into a
smaller one with fewer elements, such as this transformation relative
to the first example:
[A] -> [D]
Currently only forward merge with target being the active image is
supported, that is, data copy is performed in the right direction with
destination being the rightmost image.
The operation is implemented in QEMU through image streaming facilities.
The basic idea is to execute 'block_stream virtio0' while the guest is
running. Progress can be monitored using 'info block-jobs'. When the
streaming operation completes it raises a QMP event. 'block_stream'
copies data from the backing file(s) into the active image. When finished,
it adjusts the backing file pointer.
The 'base' parameter specifies an image which data need not be streamed from.
This image will be used as the backing file for the active image when the
operation is finished.
In the example above, the command would be:
(qemu) block_stream virtio0 A
Live block copy
===============
To copy an in use image to another destination in the filesystem, one
should create a live snapshot in the desired destination, then stream
into that image. Example:
(qemu) snapshot_blkdev ide0-hd0 /new-path/disk.img qcow2
(qemu) block_stream ide0-hd0

View file

@ -1,296 +0,0 @@
= Migration =
QEMU has code to load/save the state of the guest that it is running.
These are two complementary operations. Saving the state just does
that, saves the state for each device that the guest is running.
Restoring a guest is just the opposite operation: we need to load the
state of each device.
For this to work, QEMU has to be launched with the same arguments the
two times. I.e. it can only restore the state in one guest that has
the same devices that the one it was saved (this last requirement can
be relaxed a bit, but for now we can consider that configuration has
to be exactly the same).
Once that we are able to save/restore a guest, a new functionality is
requested: migration. This means that QEMU is able to start in one
machine and being "migrated" to another machine. I.e. being moved to
another machine.
Next was the "live migration" functionality. This is important
because some guests run with a lot of state (specially RAM), and it
can take a while to move all state from one machine to another. Live
migration allows the guest to continue running while the state is
transferred. Only while the last part of the state is transferred has
the guest to be stopped. Typically the time that the guest is
unresponsive during live migration is the low hundred of milliseconds
(notice that this depends on a lot of things).
=== Types of migration ===
Now that we have talked about live migration, there are several ways
to do migration:
- tcp migration: do the migration using tcp sockets
- unix migration: do the migration using unix sockets
- exec migration: do the migration using the stdin/stdout through a process.
- fd migration: do the migration using an file descriptor that is
passed to QEMU. QEMU doesn't care how this file descriptor is opened.
All these four migration protocols use the same infrastructure to
save/restore state devices. This infrastructure is shared with the
savevm/loadvm functionality.
=== State Live Migration ===
This is used for RAM and block devices. It is not yet ported to vmstate.
<Fill more information here>
=== What is the common infrastructure ===
QEMU uses a QEMUFile abstraction to be able to do migration. Any type
of migration that wants to use QEMU infrastructure has to create a
QEMUFile with:
QEMUFile *qemu_fopen_ops(void *opaque,
QEMUFilePutBufferFunc *put_buffer,
QEMUFileGetBufferFunc *get_buffer,
QEMUFileCloseFunc *close);
The functions have the following functionality:
This function writes a chunk of data to a file at the given position.
The pos argument can be ignored if the file is only used for
streaming. The handler should try to write all of the data it can.
typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
int64_t pos, int size);
Read a chunk of data from a file at the given position. The pos argument
can be ignored if the file is only be used for streaming. The number of
bytes actually read should be returned.
typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
int64_t pos, int size);
Close a file and return an error code.
typedef int (QEMUFileCloseFunc)(void *opaque);
You can use any internal state that you need using the opaque void *
pointer that is passed to all functions.
The important functions for us are put_buffer()/get_buffer() that
allow to write/read a buffer into the QEMUFile.
=== How to save the state of one device ===
The state of a device is saved using intermediate buffers. There are
some helper functions to assist this saving.
There is a new concept that we have to explain here: device state
version. When we migrate a device, we save/load the state as a series
of fields. Some times, due to bugs or new functionality, we need to
change the state to store more/different information. We use the
version to identify each time that we do a change. Each version is
associated with a series of fields saved. The save_state always saves
the state as the newer version. But load_state sometimes is able to
load state from an older version.
=== Legacy way ===
This way is going to disappear as soon as all current users are ported to VMSTATE.
Each device has to register two functions, one to save the state and
another to load the state back.
int register_savevm(DeviceState *dev,
const char *idstr,
int instance_id,
int version_id,
SaveStateHandler *save_state,
LoadStateHandler *load_state,
void *opaque);
typedef void SaveStateHandler(QEMUFile *f, void *opaque);
typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id);
The important functions for the device state format are the save_state
and load_state. Notice that load_state receives a version_id
parameter to know what state format is receiving. save_state doesn't
have a version_id parameter because it always uses the latest version.
=== VMState ===
The legacy way of saving/loading state of the device had the problem
that we have to maintain two functions in sync. If we did one change
in one of them and not in the other, we would get a failed migration.
VMState changed the way that state is saved/loaded. Instead of using
a function to save the state and another to load it, it was changed to
a declarative way of what the state consisted of. Now VMState is able
to interpret that definition to be able to load/save the state. As
the state is declared only once, it can't go out of sync in the
save/load functions.
An example (from hw/input/pckbd.c)
static const VMStateDescription vmstate_kbd = {
.name = "pckbd",
.version_id = 3,
.minimum_version_id = 3,
.fields = (VMStateField[]) {
VMSTATE_UINT8(write_cmd, KBDState),
VMSTATE_UINT8(status, KBDState),
VMSTATE_UINT8(mode, KBDState),
VMSTATE_UINT8(pending, KBDState),
VMSTATE_END_OF_LIST()
}
};
We are declaring the state with name "pckbd".
The version_id is 3, and the fields are 4 uint8_t in a KBDState structure.
We registered this with:
vmstate_register(NULL, 0, &vmstate_kbd, s);
Note: talk about how vmstate <-> qdev interact, and what the instance ids mean.
You can search for VMSTATE_* macros for lots of types used in QEMU in
include/hw/hw.h.
=== More about versions ===
You can see that there are several version fields:
- version_id: the maximum version_id supported by VMState for that device.
- minimum_version_id: the minimum version_id that VMState is able to understand
for that device.
- minimum_version_id_old: For devices that were not able to port to vmstate, we can
assign a function that knows how to read this old state. This field is
ignored if there is no load_state_old handler.
So, VMState is able to read versions from minimum_version_id to
version_id. And the function load_state_old() (if present) is able to
load state from minimum_version_id_old to minimum_version_id. This
function is deprecated and will be removed when no more users are left.
=== Massaging functions ===
Sometimes, it is not enough to be able to save the state directly
from one structure, we need to fill the correct values there. One
example is when we are using kvm. Before saving the cpu state, we
need to ask kvm to copy to QEMU the state that it is using. And the
opposite when we are loading the state, we need a way to tell kvm to
load the state for the cpu that we have just loaded from the QEMUFile.
The functions to do that are inside a vmstate definition, and are called:
- int (*pre_load)(void *opaque);
This function is called before we load the state of one device.
- int (*post_load)(void *opaque, int version_id);
This function is called after we load the state of one device.
- void (*pre_save)(void *opaque);
This function is called before we save the state of one device.
Example: You can look at hpet.c, that uses the three function to
massage the state that is transferred.
If you use memory API functions that update memory layout outside
initialization (i.e., in response to a guest action), this is a strong
indication that you need to call these functions in a post_load callback.
Examples of such memory API functions are:
- memory_region_add_subregion()
- memory_region_del_subregion()
- memory_region_set_readonly()
- memory_region_set_enabled()
- memory_region_set_address()
- memory_region_set_alias_offset()
=== Subsections ===
The use of version_id allows to be able to migrate from older versions
to newer versions of a device. But not the other way around. This
makes very complicated to fix bugs in stable branches. If we need to
add anything to the state to fix a bug, we have to disable migration
to older versions that don't have that bug-fix (i.e. a new field).
But sometimes, that bug-fix is only needed sometimes, not always. For
instance, if the device is in the middle of a DMA operation, it is
using a specific functionality, ....
It is impossible to create a way to make migration from any version to
any other version to work. But we can do better than only allowing
migration from older versions to newer ones. For that fields that are
only needed sometimes, we add the idea of subsections. A subsection
is "like" a device vmstate, but with a particularity, it has a Boolean
function that tells if that values are needed to be sent or not. If
this functions returns false, the subsection is not sent.
On the receiving side, if we found a subsection for a device that we
don't understand, we just fail the migration. If we understand all
the subsections, then we load the state with success.
One important note is that the post_load() function is called "after"
loading all subsections, because a newer subsection could change same
value that it uses.
Example:
static bool ide_drive_pio_state_needed(void *opaque)
{
IDEState *s = opaque;
return ((s->status & DRQ_STAT) != 0)
|| (s->bus->error_status & BM_STATUS_PIO_RETRY);
}
const VMStateDescription vmstate_ide_drive_pio_state = {
.name = "ide_drive/pio_state",
.version_id = 1,
.minimum_version_id = 1,
.pre_save = ide_drive_pio_pre_save,
.post_load = ide_drive_pio_post_load,
.fields = (VMStateField[]) {
VMSTATE_INT32(req_nb_sectors, IDEState),
VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
vmstate_info_uint8, uint8_t),
VMSTATE_INT32(cur_io_buffer_offset, IDEState),
VMSTATE_INT32(cur_io_buffer_len, IDEState),
VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
VMSTATE_INT32(elementary_transfer_size, IDEState),
VMSTATE_INT32(packet_transfer_size, IDEState),
VMSTATE_END_OF_LIST()
}
};
const VMStateDescription vmstate_ide_drive = {
.name = "ide_drive",
.version_id = 3,
.minimum_version_id = 0,
.post_load = ide_drive_post_load,
.fields = (VMStateField[]) {
.... several fields ....
VMSTATE_END_OF_LIST()
},
.subsections = (VMStateSubsection []) {
{
.vmsd = &vmstate_ide_drive_pio_state,
.needed = ide_drive_pio_state_needed,
}, {
/* empty */
}
}
};
Here we have a subsection for the pio state. We only need to
save/send this state when we are in the middle of a pio operation
(that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is
not enabled, the values on that fields are garbage and don't need to
be sent.

View file

@ -1,134 +0,0 @@
Copyright (c) 2014 Red Hat Inc.
This work is licensed under the terms of the GNU GPL, version 2 or later. See
the COPYING file in the top-level directory.
This document explains the IOThread feature and how to write code that runs
outside the QEMU global mutex.
The main loop and IOThreads
---------------------------
QEMU is an event-driven program that can do several things at once using an
event loop. The VNC server and the QMP monitor are both processed from the
same event loop, which monitors their file descriptors until they become
readable and then invokes a callback.
The default event loop is called the main loop (see main-loop.c). It is
possible to create additional event loop threads using -object
iothread,id=my-iothread.
Side note: The main loop and IOThread are both event loops but their code is
not shared completely. Sometimes it is useful to remember that although they
are conceptually similar they are currently not interchangeable.
Why IOThreads are useful
------------------------
IOThreads allow the user to control the placement of work. The main loop is a
scalability bottleneck on hosts with many CPUs. Work can be spread across
several IOThreads instead of just one main loop. When set up correctly this
can improve I/O latency and reduce jitter seen by the guest.
The main loop is also deeply associated with the QEMU global mutex, which is a
scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
global mutex to serialize execution of QEMU code. This mutex is necessary
because a lot of QEMU's code historically was not thread-safe.
The fact that all I/O processing is done in a single main loop and that the
QEMU global mutex is contended by all vCPU threads and the main loop explain
why it is desirable to place work into IOThreads.
The experimental virtio-blk data-plane implementation has been benchmarked and
shows these effects:
ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
How to program for IOThreads
----------------------------
The main difference between legacy code and new code that can run in an
IOThread is dealing explicitly with the event loop object, AioContext
(see include/block/aio.h). Code that only works in the main loop
implicitly uses the main loop's AioContext. Code that supports running
in IOThreads must be aware of its AioContext.
AioContext supports the following services:
* File descriptor monitoring (read/write/error on POSIX hosts)
* Event notifiers (inter-thread signalling)
* Timers
* Bottom Halves (BH) deferred callbacks
There are several old APIs that use the main loop AioContext:
* LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
* LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
* LEGACY timer_new_ms() - create a timer
* LEGACY qemu_bh_new() - create a BH
* LEGACY qemu_aio_wait() - run an event loop iteration
Since they implicitly work on the main loop they cannot be used in code that
runs in an IOThread. They might cause a crash or deadlock if called from an
IOThread since the QEMU global mutex is not held.
Instead, use the AioContext functions directly (see include/block/aio.h):
* aio_set_fd_handler() - monitor a file descriptor
* aio_set_event_notifier() - monitor an event notifier
* aio_timer_new() - create a timer
* aio_bh_new() - create a BH
* aio_poll() - run an event loop iteration
The AioContext can be obtained from the IOThread using
iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
Code that takes an AioContext argument works both in IOThreads or the main
loop, depending on which AioContext instance the caller passes in.
How to synchronize with an IOThread
-----------------------------------
AioContext is not thread-safe so some rules must be followed when using file
descriptors, event notifiers, timers, or BHs across threads:
1. AioContext functions can be called safely from file descriptor, event
notifier, timer, or BH callbacks invoked by the AioContext. No locking is
necessary.
2. Other threads wishing to access the AioContext must use
aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
context is acquired no other thread can access it or run event loop iterations
in this AioContext.
aio_context_acquire()/aio_context_release() calls may be nested. This
means you can call them if you're not sure whether #1 applies.
There is currently no lock ordering rule if a thread needs to acquire multiple
AioContexts simultaneously. Therefore, it is only safe for code holding the
QEMU global mutex to acquire other AioContexts.
Side note: the best way to schedule a function call across threads is to create
a BH in the target AioContext beforehand and then call qemu_bh_schedule(). No
acquire/release or locking is needed for the qemu_bh_schedule() call. But be
sure to acquire the AioContext for aio_bh_new() if necessary.
The relationship between AioContext and the block layer
-------------------------------------------------------
The AioContext originates from the QEMU block layer because it provides a
scoped way of running event loop iterations until all work is done. This
feature is used to complete all in-flight block I/O requests (see
bdrv_drain_all()). Nowadays AioContext is a generic event loop that can be
used by any QEMU subsystem.
The block layer has support for AioContext integrated. Each BlockDriverState
is associated with an AioContext using bdrv_set_aio_context() and
bdrv_get_aio_context(). This allows block layer code to process I/O inside the
right AioContext. Other subsystems may wish to follow a similar approach.
Block layer code must therefore expect to run in an IOThread and avoid using
old APIs that implicitly use the main loop. See the "How to program for
IOThreads" above for information on how to do that.
If main loop code such as a QMP function wishes to access a BlockDriverState it
must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
IOThread does not run in parallel.
Long-running jobs (usually in the form of coroutines) are best scheduled in the
BlockDriverState's AioContext to avoid the need to acquire/release around each
bdrv_*() call. Be aware that there is currently no mechanism to get notified
when bdrv_set_aio_context() moves this BlockDriverState to a different
AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
may need to add this if you want to support long-running jobs.

View file

@ -1,102 +0,0 @@
multiseat howto (with some multihead coverage)
==============================================
host side
---------
First you must compile qemu with a user interface supporting
multihead/multiseat and input event routing. Right now this
list includes sdl2 and gtk (both 2+3):
./configure --enable-sdl --with-sdlabi=2.0
or
./configure --enable-gtk
Next put together the qemu command line:
qemu -enable-kvm -usb $memory $disk $whatever \
-display [ sdl | gtk ] \
-vga std \
-device usb-tablet
That is it for the first head, which will use the standard vga, the
standard ps/2 keyboard (implicitly there) and the usb-tablet. Now the
additional switches for the second head:
-device pci-bridge,addr=12.0,chassis_nr=2,id=head.2 \
-device secondary-vga,bus=head.2,addr=02.0,id=video.2 \
-device nec-usb-xhci,bus=head.2,addr=0f.0,id=usb.2 \
-device usb-kbd,bus=usb.2.0,port=1,display=video.2 \
-device usb-tablet,bus=usb.2.0,port=2,display=video.2
This places a pci bridge in slot 12, connects a display adapter and
xhci (usb) controller to the bridge. Then it adds a usb keyboard and
usb mouse, both connected to the xhci and linked to the display.
The "display=video2" sets up the input routing. Any input coming from
the window which belongs to the video.2 display adapter will be routed
to these input devices.
The sdl2 ui will start up with two windows, one for each display
device. The gtk ui will start with a single window and each display
in a separate tab. You can either simply switch tabs to switch heads,
or use the "View / Detach tab" menu item to move one of the displays
to its own window so you can see both display devices side-by-side.
Note on spice: Spice handles multihead just fine. But it can't do
multiseat. For tablet events the event source is sent to the spice
agent. But qemu can't figure it, so it can't do input routing.
Fixing this needs a new or extended input interface between
libspice-server and qemu. For keyboard events it is even worse: The
event source isn't included in the spice protocol, so the wire
protocol must be extended to support this.
guest side
----------
You need a pretty recent linux guest. systemd with loginctl. kernel
3.14+ with CONFIG_DRM_BOCHS enabled. Fedora 20 will do. Must be
fully updated for the new kernel though, i.e. the live iso doesn't cut
it.
Now we'll have to configure the guest. Boot and login. "lspci -vt"
should list the pci bridge with the display adapter and usb controller:
[root@fedora ~]# lspci -vt
-[0000:00]-+-00.0 Intel Corporation 440FX - 82441FX PMC [Natoma]
[ ... ]
\-12.0-[01]--+-02.0 Device 1234:1111
\-0f.0 NEC Corporation USB 3.0 Host Controller
Good. Now lets tell the system that the pci bridge and all devices
below it belong to a separate seat by dropping a file into
/etc/udev/rules.d:
[root@fedora ~]# cat /etc/udev/rules.d/70-qemu-autoseat.rules
SUBSYSTEMS=="pci", DEVPATH=="*/0000:00:12.0", TAG+="seat", ENV{ID_AUTOSEAT}="1"
Reboot. System should come up with two seats. With loginctl you can
check the configuration:
[root@fedora ~]# loginctl list-seats
SEAT
seat0
seat-pci-pci-0000_00_12_0
2 seats listed.
You can use "loginctl seat-status seat-pci-pci-0000_00_12_0" to list
the devices attached to the seat.
Background info is here:
http://www.freedesktop.org/wiki/Software/systemd/multiseat/
Enjoy!
--
Gerd Hoffmann <kraxel@redhat.com>

View file

@ -1,152 +0,0 @@
################################################################
#
# qemu -M q35 creates a bare machine with just the very essential
# chipset devices being present:
#
# 00.0 - Host bridge
# 1f.0 - ISA bridge / LPC
# 1f.2 - SATA (AHCI) controller
# 1f.3 - SMBus controller
#
# This config file documents the other devices and how they are
# created. You can simply use "-readconfig $thisfile" to create
# them all. Here is a overview:
#
# 19.0 - Ethernet controller (not created, our e1000 emulation
# doesn't emulate the ich9 device).
# 1a.* - USB Controller #2 (ehci + uhci companions)
# 1b.0 - HD Audio Controller
# 1c.* - PCI Express Ports
# 1d.* - USB Controller #1 (ehci + uhci companions,
# "qemu -M q35 -usb" creates these too)
# 1e.0 - PCI Bridge
#
[device "ich9-ehci-2"]
driver = "ich9-usb-ehci2"
multifunction = "on"
bus = "pcie.0"
addr = "1a.7"
[device "ich9-uhci-4"]
driver = "ich9-usb-uhci4"
multifunction = "on"
bus = "pcie.0"
addr = "1a.0"
masterbus = "ich9-ehci-2.0"
firstport = "0"
[device "ich9-uhci-5"]
driver = "ich9-usb-uhci5"
multifunction = "on"
bus = "pcie.0"
addr = "1a.1"
masterbus = "ich9-ehci-2.0"
firstport = "2"
[device "ich9-uhci-6"]
driver = "ich9-usb-uhci6"
multifunction = "on"
bus = "pcie.0"
addr = "1a.2"
masterbus = "ich9-ehci-2.0"
firstport = "4"
[device "ich9-hda-audio"]
driver = "ich9-intel-hda"
bus = "pcie.0"
addr = "1b.0"
[device "ich9-pcie-port-1"]
driver = "ioh3420"
multifunction = "on"
bus = "pcie.0"
addr = "1c.0"
port = "1"
chassis = "1"
[device "ich9-pcie-port-2"]
driver = "ioh3420"
multifunction = "on"
bus = "pcie.0"
addr = "1c.1"
port = "2"
chassis = "2"
[device "ich9-pcie-port-3"]
driver = "ioh3420"
multifunction = "on"
bus = "pcie.0"
addr = "1c.2"
port = "3"
chassis = "3"
[device "ich9-pcie-port-4"]
driver = "ioh3420"
multifunction = "on"
bus = "pcie.0"
addr = "1c.3"
port = "4"
chassis = "4"
##
# Example PCIe switch with two downstream ports
#
#[device "pcie-switch-upstream-port-1"]
# driver = "x3130-upstream"
# bus = "ich9-pcie-port-4"
# addr = "00.0"
#
#[device "pcie-switch-downstream-port-1-1"]
# driver = "xio3130-downstream"
# multifunction = "on"
# bus = "pcie-switch-upstream-port-1"
# addr = "00.0"
# port = "1"
# chassis = "5"
#
#[device "pcie-switch-downstream-port-1-2"]
# driver = "xio3130-downstream"
# multifunction = "on"
# bus = "pcie-switch-upstream-port-1"
# addr = "00.1"
# port = "1"
# chassis = "6"
[device "ich9-ehci-1"]
driver = "ich9-usb-ehci1"
multifunction = "on"
bus = "pcie.0"
addr = "1d.7"
[device "ich9-uhci-1"]
driver = "ich9-usb-uhci1"
multifunction = "on"
bus = "pcie.0"
addr = "1d.0"
masterbus = "ich9-ehci-1.0"
firstport = "0"
[device "ich9-uhci-2"]
driver = "ich9-usb-uhci2"
multifunction = "on"
bus = "pcie.0"
addr = "1d.1"
masterbus = "ich9-ehci-1.0"
firstport = "2"
[device "ich9-uhci-3"]
driver = "ich9-usb-uhci3"
multifunction = "on"
bus = "pcie.0"
addr = "1d.2"
masterbus = "ich9-ehci-1.0"
firstport = "4"
[device "ich9-pci-bridge"]
driver = "i82801b11-bridge"
bus = "pcie.0"
addr = "1e.0"

View file

@ -1,590 +0,0 @@
= How to use the QAPI code generator =
QAPI is a native C API within QEMU which provides management-level
functionality to internal/external users. For external
users/processes, this interface is made available by a JSON-based
QEMU Monitor protocol that is provided by the QMP server.
To map QMP-defined interfaces to the native C QAPI implementations,
a JSON-based schema is used to define types and function
signatures, and a set of scripts is used to generate types/signatures,
and marshaling/dispatch code. The QEMU Guest Agent also uses these
scripts, paired with a separate schema, to generate
marshaling/dispatch code for the guest agent server running in the
guest.
This document will describe how the schemas, scripts, and resulting
code are used.
== QMP/Guest agent schema ==
This file defines the types, commands, and events used by QMP. It should
fully describe the interface used by QMP.
This file is designed to be loosely based on JSON although it's technically
executable Python. While dictionaries are used, they are parsed as
OrderedDicts so that ordering is preserved.
There are two basic syntaxes used, type definitions and command definitions.
The first syntax defines a type and is represented by a dictionary. There are
three kinds of user-defined types that are supported: complex types,
enumeration types and union types.
Generally speaking, types definitions should always use CamelCase for the type
names. Command names should be all lower case with words separated by a hyphen.
=== Includes ===
The QAPI schema definitions can be modularized using the 'include' directive:
{ 'include': 'path/to/file.json'}
The directive is evaluated recursively, and include paths are relative to the
file using the directive. Multiple includes of the same file are safe.
=== Complex types ===
A complex type is a dictionary containing a single key whose value is a
dictionary. This corresponds to a struct in C or an Object in JSON. An
example of a complex type is:
{ 'type': 'MyType',
'data': { 'member1': 'str', 'member2': 'int', '*member3': 'str' } }
The use of '*' as a prefix to the name means the member is optional.
The default initialization value of an optional argument should not be changed
between versions of QEMU unless the new default maintains backward
compatibility to the user-visible behavior of the old default.
With proper documentation, this policy still allows some flexibility; for
example, documenting that a default of 0 picks an optimal buffer size allows
one release to declare the optimal size at 512 while another release declares
the optimal size at 4096 - the user-visible behavior is not the bytes used by
the buffer, but the fact that the buffer was optimal size.
On input structures (only mentioned in the 'data' side of a command), changing
from mandatory to optional is safe (older clients will supply the option, and
newer clients can benefit from the default); changing from optional to
mandatory is backwards incompatible (older clients may be omitting the option,
and must continue to work).
On output structures (only mentioned in the 'returns' side of a command),
changing from mandatory to optional is in general unsafe (older clients may be
expecting the field, and could crash if it is missing), although it can be done
if the only way that the optional argument will be omitted is when it is
triggered by the presence of a new input flag to the command that older clients
don't know to send. Changing from optional to mandatory is safe.
A structure that is used in both input and output of various commands
must consider the backwards compatibility constraints of both directions
of use.
A complex type definition can specify another complex type as its base.
In this case, the fields of the base type are included as top-level fields
of the new complex type's dictionary in the QMP wire format. An example
definition is:
{ 'type': 'BlockdevOptionsGenericFormat', 'data': { 'file': 'str' } }
{ 'type': 'BlockdevOptionsGenericCOWFormat',
'base': 'BlockdevOptionsGenericFormat',
'data': { '*backing': 'str' } }
An example BlockdevOptionsGenericCOWFormat object on the wire could use
both fields like this:
{ "file": "/some/place/my-image",
"backing": "/some/place/my-backing-file" }
=== Enumeration types ===
An enumeration type is a dictionary containing a single key whose value is a
list of strings. An example enumeration is:
{ 'enum': 'MyEnum', 'data': [ 'value1', 'value2', 'value3' ] }
=== Union types ===
Union types are used to let the user choose between several different data
types. A union type is defined using a dictionary as explained in the
following paragraphs.
A simple union type defines a mapping from discriminator values to data types
like in this example:
{ 'type': 'FileOptions', 'data': { 'filename': 'str' } }
{ 'type': 'Qcow2Options',
'data': { 'backing-file': 'str', 'lazy-refcounts': 'bool' } }
{ 'union': 'BlockdevOptions',
'data': { 'file': 'FileOptions',
'qcow2': 'Qcow2Options' } }
In the QMP wire format, a simple union is represented by a dictionary that
contains the 'type' field as a discriminator, and a 'data' field that is of the
specified data type corresponding to the discriminator value:
{ "type": "qcow2", "data" : { "backing-file": "/some/place/my-image",
"lazy-refcounts": true } }
A union definition can specify a complex type as its base. In this case, the
fields of the complex type are included as top-level fields of the union
dictionary in the QMP wire format. An example definition is:
{ 'type': 'BlockdevCommonOptions', 'data': { 'readonly': 'bool' } }
{ 'union': 'BlockdevOptions',
'base': 'BlockdevCommonOptions',
'data': { 'raw': 'RawOptions',
'qcow2': 'Qcow2Options' } }
And it looks like this on the wire:
{ "type": "qcow2",
"readonly": false,
"data" : { "backing-file": "/some/place/my-image",
"lazy-refcounts": true } }
Flat union types avoid the nesting on the wire. They are used whenever a
specific field of the base type is declared as the discriminator ('type' is
then no longer generated). The discriminator must be of enumeration type.
The above example can then be modified as follows:
{ 'enum': 'BlockdevDriver', 'data': [ 'raw', 'qcow2' ] }
{ 'type': 'BlockdevCommonOptions',
'data': { 'driver': 'BlockdevDriver', 'readonly': 'bool' } }
{ 'union': 'BlockdevOptions',
'base': 'BlockdevCommonOptions',
'discriminator': 'driver',
'data': { 'raw': 'RawOptions',
'qcow2': 'Qcow2Options' } }
Resulting in this JSON object:
{ "driver": "qcow2",
"readonly": false,
"backing-file": "/some/place/my-image",
"lazy-refcounts": true }
A special type of unions are anonymous unions. They don't form a dictionary in
the wire format but allow the direct use of different types in their place. As
they aren't structured, they don't have any explicit discriminator but use
the (QObject) data type of their value as an implicit discriminator. This means
that they are restricted to using only one discriminator value per QObject
type. For example, you cannot have two different complex types in an anonymous
union, or two different integer types.
Anonymous unions are declared using an empty dictionary as their discriminator.
The discriminator values never appear on the wire, they are only used in the
generated C code. Anonymous unions cannot have a base type.
{ 'union': 'BlockRef',
'discriminator': {},
'data': { 'definition': 'BlockdevOptions',
'reference': 'str' } }
This example allows using both of the following example objects:
{ "file": "my_existing_block_device_id" }
{ "file": { "driver": "file",
"readonly": false,
"filename": "/tmp/mydisk.qcow2" } }
=== Commands ===
Commands are defined by using a list containing three members. The first
member is the command name, the second member is a dictionary containing
arguments, and the third member is the return type.
An example command is:
{ 'command': 'my-command',
'data': { 'arg1': 'str', '*arg2': 'str' },
'returns': 'str' }
=== Events ===
Events are defined with the keyword 'event'. When 'data' is also specified,
additional info will be included in the event. Finally there will be C API
generated in qapi-event.h; when called by QEMU code, a message with timestamp
will be emitted on the wire. If timestamp is -1, it means failure to retrieve
host time.
An example event is:
{ 'event': 'EVENT_C',
'data': { '*a': 'int', 'b': 'str' } }
Resulting in this JSON object:
{ "event": "EVENT_C",
"data": { "b": "test string" },
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
== Code generation ==
Schemas are fed into 3 scripts to generate all the code/files that, paired
with the core QAPI libraries, comprise everything required to take JSON
commands read in by a QMP/guest agent server, unmarshal the arguments into
the underlying C types, call into the corresponding C function, and map the
response back to a QMP/guest agent response to be returned to the user.
As an example, we'll use the following schema, which describes a single
complex user-defined type (which will produce a C struct, along with a list
node structure that can be used to chain together a list of such types in
case we want to accept/return a list of this type with a command), and a
command which takes that type as a parameter and returns the same type:
$ cat example-schema.json
{ 'type': 'UserDefOne',
'data': { 'integer': 'int', 'string': 'str' } }
{ 'command': 'my-command',
'data': {'arg1': 'UserDefOne'},
'returns': 'UserDefOne' }
{ 'event': 'MY_EVENT' }
=== scripts/qapi-types.py ===
Used to generate the C types defined by a schema. The following files are
created:
$(prefix)qapi-types.h - C types corresponding to types defined in
the schema you pass in
$(prefix)qapi-types.c - Cleanup functions for the above C types
The $(prefix) is an optional parameter used as a namespace to keep the
generated code from one schema/code-generation separated from others so code
can be generated/used from multiple schemas without clobbering previously
created code.
Example:
$ python scripts/qapi-types.py --output-dir="qapi-generated" \
--prefix="example-" --input-file=example-schema.json
$ cat qapi-generated/example-qapi-types.c
[Uninteresting stuff omitted...]
void qapi_free_UserDefOneList(UserDefOneList *obj)
{
QapiDeallocVisitor *md;
Visitor *v;
if (!obj) {
return;
}
md = qapi_dealloc_visitor_new();
v = qapi_dealloc_get_visitor(md);
visit_type_UserDefOneList(v, &obj, NULL, NULL);
qapi_dealloc_visitor_cleanup(md);
}
void qapi_free_UserDefOne(UserDefOne *obj)
{
QapiDeallocVisitor *md;
Visitor *v;
if (!obj) {
return;
}
md = qapi_dealloc_visitor_new();
v = qapi_dealloc_get_visitor(md);
visit_type_UserDefOne(v, &obj, NULL, NULL);
qapi_dealloc_visitor_cleanup(md);
}
$ cat qapi-generated/example-qapi-types.h
[Uninteresting stuff omitted...]
#ifndef EXAMPLE_QAPI_TYPES_H
#define EXAMPLE_QAPI_TYPES_H
[Builtin types omitted...]
typedef struct UserDefOne UserDefOne;
typedef struct UserDefOneList
{
union {
UserDefOne *value;
uint64_t padding;
};
struct UserDefOneList *next;
} UserDefOneList;
[Functions on builtin types omitted...]
struct UserDefOne
{
int64_t integer;
char *string;
};
void qapi_free_UserDefOneList(UserDefOneList *obj);
void qapi_free_UserDefOne(UserDefOne *obj);
#endif
=== scripts/qapi-visit.py ===
Used to generate the visitor functions used to walk through and convert
a QObject (as provided by QMP) to a native C data structure and
vice-versa, as well as the visitor function used to dealloc a complex
schema-defined C type.
The following files are generated:
$(prefix)qapi-visit.c: visitor function for a particular C type, used
to automagically convert QObjects into the
corresponding C type and vice-versa, as well
as for deallocating memory for an existing C
type
$(prefix)qapi-visit.h: declarations for previously mentioned visitor
functions
Example:
$ python scripts/qapi-visit.py --output-dir="qapi-generated"
--prefix="example-" --input-file=example-schema.json
$ cat qapi-generated/example-qapi-visit.c
[Uninteresting stuff omitted...]
static void visit_type_UserDefOne_fields(Visitor *m, UserDefOne **obj, Error **errp)
{
Error *err = NULL;
visit_type_int(m, &(*obj)->integer, "integer", &err);
if (err) {
goto out;
}
visit_type_str(m, &(*obj)->string, "string", &err);
if (err) {
goto out;
}
out:
error_propagate(errp, err);
}
void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp)
{
Error *err = NULL;
visit_start_struct(m, (void **)obj, "UserDefOne", name, sizeof(UserDefOne), &err);
if (!err) {
if (*obj) {
visit_type_UserDefOne_fields(m, obj, errp);
}
visit_end_struct(m, &err);
}
error_propagate(errp, err);
}
void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp)
{
Error *err = NULL;
GenericList *i, **prev;
visit_start_list(m, name, &err);
if (err) {
goto out;
}
for (prev = (GenericList **)obj;
!err && (i = visit_next_list(m, prev, &err)) != NULL;
prev = &i) {
UserDefOneList *native_i = (UserDefOneList *)i;
visit_type_UserDefOne(m, &native_i->value, NULL, &err);
}
error_propagate(errp, err);
err = NULL;
visit_end_list(m, &err);
out:
error_propagate(errp, err);
}
$ python scripts/qapi-commands.py --output-dir="qapi-generated" \
--prefix="example-" --input-file=example-schema.json
$ cat qapi-generated/example-qapi-visit.h
[Uninteresting stuff omitted...]
#ifndef EXAMPLE_QAPI_VISIT_H
#define EXAMPLE_QAPI_VISIT_H
[Visitors for builtin types omitted...]
void visit_type_UserDefOne(Visitor *m, UserDefOne **obj, const char *name, Error **errp);
void visit_type_UserDefOneList(Visitor *m, UserDefOneList **obj, const char *name, Error **errp);
#endif
=== scripts/qapi-commands.py ===
Used to generate the marshaling/dispatch functions for the commands defined
in the schema. The following files are generated:
$(prefix)qmp-marshal.c: command marshal/dispatch functions for each
QMP command defined in the schema. Functions
generated by qapi-visit.py are used to
convert QObjects received from the wire into
function parameters, and uses the same
visitor functions to convert native C return
values to QObjects from transmission back
over the wire.
$(prefix)qmp-commands.h: Function prototypes for the QMP commands
specified in the schema.
Example:
$ python scripts/qapi-commands.py --output-dir="qapi-generated"
--prefix="example-" --input-file=example-schema.json
$ cat qapi-generated/example-qmp-marshal.c
[Uninteresting stuff omitted...]
static void qmp_marshal_output_my_command(UserDefOne *ret_in, QObject **ret_out, Error **errp)
{
Error *local_err = NULL;
QmpOutputVisitor *mo = qmp_output_visitor_new();
QapiDeallocVisitor *md;
Visitor *v;
v = qmp_output_get_visitor(mo);
visit_type_UserDefOne(v, &ret_in, "unused", &local_err);
if (local_err) {
goto out;
}
*ret_out = qmp_output_get_qobject(mo);
out:
error_propagate(errp, local_err);
qmp_output_visitor_cleanup(mo);
md = qapi_dealloc_visitor_new();
v = qapi_dealloc_get_visitor(md);
visit_type_UserDefOne(v, &ret_in, "unused", NULL);
qapi_dealloc_visitor_cleanup(md);
}
static void qmp_marshal_input_my_command(QDict *args, QObject **ret, Error **errp)
{
Error *local_err = NULL;
UserDefOne *retval = NULL;
QmpInputVisitor *mi = qmp_input_visitor_new_strict(QOBJECT(args));
QapiDeallocVisitor *md;
Visitor *v;
UserDefOne *arg1 = NULL;
v = qmp_input_get_visitor(mi);
visit_type_UserDefOne(v, &arg1, "arg1", &local_err);
if (local_err) {
goto out;
}
retval = qmp_my_command(arg1, &local_err);
if (local_err) {
goto out;
}
qmp_marshal_output_my_command(retval, ret, &local_err);
out:
error_propagate(errp, local_err);
qmp_input_visitor_cleanup(mi);
md = qapi_dealloc_visitor_new();
v = qapi_dealloc_get_visitor(md);
visit_type_UserDefOne(v, &arg1, "arg1", NULL);
qapi_dealloc_visitor_cleanup(md);
return;
}
static void qmp_init_marshal(void)
{
qmp_register_command("my-command", qmp_marshal_input_my_command, QCO_NO_OPTIONS);
}
qapi_init(qmp_init_marshal);
$ cat qapi-generated/example-qmp-commands.h
[Uninteresting stuff omitted...]
#ifndef EXAMPLE_QMP_COMMANDS_H
#define EXAMPLE_QMP_COMMANDS_H
#include "example-qapi-types.h"
#include "qapi/qmp/qdict.h"
#include "qapi/error.h"
UserDefOne *qmp_my_command(UserDefOne *arg1, Error **errp);
#endif
=== scripts/qapi-event.py ===
Used to generate the event-related C code defined by a schema. The
following files are created:
$(prefix)qapi-event.h - Function prototypes for each event type, plus an
enumeration of all event names
$(prefix)qapi-event.c - Implementation of functions to send an event
Example:
$ python scripts/qapi-event.py --output-dir="qapi-generated"
--prefix="example-" --input-file=example-schema.json
$ cat qapi-generated/example-qapi-event.c
[Uninteresting stuff omitted...]
void qapi_event_send_my_event(Error **errp)
{
QDict *qmp;
Error *local_err = NULL;
QMPEventFuncEmit emit;
emit = qmp_event_get_func_emit();
if (!emit) {
return;
}
qmp = qmp_event_build_dict("MY_EVENT");
emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &local_err);
error_propagate(errp, local_err);
QDECREF(qmp);
}
const char *EXAMPLE_QAPIEvent_lookup[] = {
"MY_EVENT",
NULL,
};
$ cat qapi-generated/example-qapi-event.h
[Uninteresting stuff omitted...]
#ifndef EXAMPLE_QAPI_EVENT_H
#define EXAMPLE_QAPI_EVENT_H
#include "qapi/error.h"
#include "qapi/qmp/qdict.h"
#include "example-qapi-types.h"
void qapi_event_send_my_event(Error **errp);
extern const char *EXAMPLE_QAPIEvent_lookup[];
typedef enum EXAMPLE_QAPIEvent
{
EXAMPLE_QAPI_EVENT_MY_EVENT = 0,
EXAMPLE_QAPI_EVENT_MAX = 1,
} EXAMPLE_QAPIEvent;
#endif

View file

@ -1,416 +0,0 @@
= How to convert to -device & friends =
=== Specifying Bus and Address on Bus ===
In qdev, each device has a parent bus. Some devices provide one or
more buses for children. You can specify a device's parent bus with
-device parameter bus.
A device typically has a device address on its parent bus. For buses
where this address can be configured, devices provide a bus-specific
property. Examples:
bus property name value format
PCI addr %x.%x (dev.fn, .fn optional)
I2C address %u
SCSI scsi-id %u
IDE unit %u
HDA cad %u
virtio-serial-bus nr %u
ccid-bus slot %u
USB port %d(.%d)* (port.port...)
Example: device i440FX-pcihost is on the root bus, and provides a PCI
bus named pci.0. To put a FOO device into its slot 4, use -device
FOO,bus=/i440FX-pcihost/pci.0,addr=4. The abbreviated form bus=pci.0
also works as long as the bus name is unique.
=== Block Devices ===
A QEMU block device (drive) has a host and a guest part.
In the general case, the guest device is connected to a controller
device. For instance, the IDE controller provides two IDE buses, each
of which can have up to two ide-drive devices, and each ide-drive
device is a guest part, and is connected to a host part.
Except we sometimes lump controller, bus(es) and drive device(s) all
together into a single device. For instance, the ISA floppy
controller is connected to up to two host drives.
The old ways to define block devices define host and guest part
together. Sometimes, they can even define a controller device in
addition to the block device.
The new way keeps the parts separate: you create the host part with
-drive, and guest device(s) with -device.
The various old ways to define drives all boil down to the common form
-drive if=TYPE,bus=BUS,unit=UNIT,OPTS...
TYPE, BUS and UNIT identify the controller device, which of its buses
to use, and the drive's address on that bus. Details depend on TYPE.
Instead of bus=BUS,unit=UNIT, you can also say index=IDX.
In the new way, this becomes something like
-drive if=none,id=DRIVE-ID,HOST-OPTS...
-device DEVNAME,drive=DRIVE-ID,DEV-OPTS...
The old OPTS get split into HOST-OPTS and DEV-OPTS as follows:
* file, format, snapshot, cache, aio, readonly, rerror, werror go into
HOST-OPTS.
* cyls, head, secs and trans go into HOST-OPTS. Future work: they
should go into DEV-OPTS instead.
* serial goes into DEV-OPTS, for devices supporting serial numbers.
For other devices, it goes nowhere.
* media is special. In the old way, it selects disk vs. CD-ROM with
if=ide, if=scsi and if=xen. The new way uses DEVNAME for that.
Additionally, readonly=on goes into HOST-OPTS.
* addr is special, see if=virtio below.
The -device argument differs in detail for each type of drive:
* if=ide
-device DEVNAME,drive=DRIVE-ID,bus=IDE-BUS,unit=UNIT
where DEVNAME is either ide-hd or ide-cd, IDE-BUS identifies an IDE
bus, normally either ide.0 or ide.1, and UNIT is either 0 or 1.
* if=scsi
The old way implicitly creates SCSI controllers as needed. The new
way makes that explicit:
-device lsi53c895a,id=ID
As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to
control the PCI device address.
This SCSI controller provides a single SCSI bus, named ID.0. Put a
disk on it:
-device DEVNAME,drive=DRIVE-ID,bus=ID.0,scsi-id=UNIT
where DEVNAME is either scsi-hd, scsi-cd or scsi-generic.
* if=floppy
-global isa-fdc.driveA=DRIVE-ID
-global isa-fdc.driveB=DRIVE-ID
This is -global instead of -device, because the floppy controller is
created automatically, and we want to configure that one, not create
a second one (which isn't possible anyway).
Without any -global isa-fdc,... you get an empty driveA and no
driveB. You can use -nodefaults to suppress the default driveA, see
"Default Devices".
* if=virtio
-device virtio-blk-pci,drive=DRIVE-ID,class=C,vectors=V,ioeventfd=IOEVENTFD
This lets you control PCI device class and MSI-X vectors.
IOEVENTFD controls whether or not ioeventfd is used for virtqueue
notify. It can be set to on (default) or off.
As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to
control the PCI device address. This replaces option addr available
with -drive if=virtio.
* if=pflash, if=mtd, if=sd, if=xen are not yet available with -device
For USB devices, the old way is actually different:
-usbdevice disk:format=FMT:FILENAME
Provides much less control than -drive's OPTS... The new way fixes
that:
-device usb-storage,drive=DRIVE-ID,removable=RMB
The removable parameter gives control over the SCSI INQUIRY removable
(RMB) bit. USB thumbdrives usually set removable=on, while USB hard
disks set removable=off.
Bug: usb-storage pretends to be a block device, but it's really a SCSI
controller that can serve only a single device, which it creates
automatically. The automatic creation guesses what kind of guest part
to create from the host part, like -drive if=scsi. Host and guest
part are not cleanly separated.
=== Character Devices ===
A QEMU character device has a host and a guest part.
The old ways to define character devices define host and guest part
together.
The new way keeps the parts separate: you create the host part with
-chardev, and the guest device with -device.
The various old ways to define a character device are all of the
general form
-FOO FOO-OPTS...,LEGACY-CHARDEV
where FOO-OPTS... is specific to -FOO, and the host part
LEGACY-CHARDEV is the same everywhere.
In the new way, this becomes
-chardev HOST-OPTS...,id=CHR-ID
-device DEVNAME,chardev=CHR-ID,DEV-OPTS...
The appropriate DEVNAME depends on the machine type. For type "pc":
* -serial becomes -device isa-serial,iobase=IOADDR,irq=IRQ,index=IDX
This lets you control I/O ports and IRQs.
* -parallel becomes -device isa-parallel,iobase=IOADDR,irq=IRQ,index=IDX
This lets you control I/O ports and IRQs.
* -usbdevice serial:vendorid=VID,productid=PRID becomes
-device usb-serial,vendorid=VID,productid=PRID
* -usbdevice braille doesn't support LEGACY-CHARDEV syntax. It always
uses "braille". With -device, this useful default is gone, so you
have to use something like
-device usb-braille,chardev=braille,vendorid=VID,productid=PRID
-chardev braille,id=braille
* -virtioconsole becomes
-device virtio-serial-pci,class=C,vectors=V,ioeventfd=IOEVENTFD,max_ports=N
-device virtconsole,is_console=NUM,nr=NR,name=NAME
LEGACY-CHARDEV translates to -chardev HOST-OPTS... as follows:
* null becomes -chardev null
* pty, msmouse, braille, stdio likewise
* vc:WIDTHxHEIGHT becomes -chardev vc,width=WIDTH,height=HEIGHT
* vc:<COLS>Cx<ROWS>C becomes -chardev vc,cols=<COLS>,rows=<ROWS>
* con: becomes -chardev console
* COM<NUM> becomes -chardev serial,path=COM<NUM>
* file:FNAME becomes -chardev file,path=FNAME
* pipe:FNAME becomes -chardev pipe,path=FNAME
* tcp:HOST:PORT,OPTS... becomes -chardev socket,host=HOST,port=PORT,OPTS...
* telnet:HOST:PORT,OPTS... becomes
-chardev socket,host=HOST,port=PORT,OPTS...,telnet=on
* udp:HOST:PORT@LOCALADDR:LOCALPORT becomes
-chardev udp,host=HOST,port=PORT,localaddr=LOCALADDR,localport=LOCALPORT
* unix:FNAME becomes -chardev socket,path=FNAME
* /dev/parportN becomes -chardev parport,file=/dev/parportN
* /dev/ppiN likewise
* Any other /dev/FNAME becomes -chardev tty,path=/dev/FNAME
* mon:LEGACY-CHARDEV is special: it multiplexes the monitor onto the
character device defined by LEGACY-CHARDEV. -chardev provides more
general multiplexing instead: you can connect up to four users to a
single host part. You need to pass mux=on to -chardev to enable
switching the input focus.
QEMU uses LEGACY-CHARDEV syntax not just to set up guest devices, but
also in various other places such as -monitor or -net
user,guestfwd=... You can use chardev:CHR-ID in place of
LEGACY-CHARDEV to refer to a host part defined with -chardev.
=== Network Devices ===
Host and guest part of network devices have always been separate.
The old way to define the guest part looks like this:
-net nic,netdev=NET-ID,macaddr=MACADDR,model=MODEL,name=ID,addr=STR,vectors=V
Except for USB it looks like this:
-usbdevice net:netdev=NET-ID,macaddr=MACADDR,name=ID
The new way is -device:
-device DEVNAME,netdev=NET-ID,mac=MACADDR,DEV-OPTS...
DEVNAME equals MODEL, except for virtio you have to name the virtio
device appropriate for the bus (virtio-net-pci for PCI), and for USB
you have to use usb-net.
The old name=ID parameter becomes the usual id=ID with -device.
For PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control the PCI
device address, as usual. The old -net nic provides parameter addr
for that, which is silently ignored when the NIC is not a PCI device.
For virtio-net-pci, you can control whether or not ioeventfd is used for
virtqueue notify by setting ioeventfd= to on or off (default).
-net nic accepts vectors=V for all models, but it's silently ignored
except for virtio-net-pci (model=virtio). With -device, only devices
that support it accept it.
Not all devices are available with -device at this time. All PCI
devices and ne2k_isa are.
Some PCI devices aren't available with -net nic, e.g. i82558a.
To connect to a VLAN instead of an ordinary host part, replace
netdev=NET-ID by vlan=VLAN.
=== Graphics Devices ===
Host and guest part of graphics devices have always been separate.
The old way to define the guest graphics device is -vga VGA. Not all
machines support all -vga options.
The new way is -device. The mapping from -vga argument to -device
depends on the machine type. For machine "pc", it's:
std -device VGA
cirrus -device cirrus-vga
vmware -device vmware-svga
qxl -device qxl-vga
none -nodefaults
disables more than just VGA, see "Default Devices"
As for all PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control
the PCI device address.
-device VGA supports properties bios-offset and bios-size, but they
aren't used with machine type "pc".
For machine "isapc", it's
std -device isa-vga
cirrus not yet available with -device
none -nodefaults
disables more than just VGA, see "Default Devices"
Bug: the new way doesn't work for machine types "pc" and "isapc",
because it violates obscure device initialization ordering
constraints.
=== Audio Devices ===
Host and guest part of audio devices have always been separate.
The old way to define guest audio devices is -soundhw C1,...
The new way is to define each guest audio device separately with
-device.
Map from -soundhw sound card name to -device:
ac97 -device AC97
cs4231a -device cs4231a,iobase=IOADDR,irq=IRQ,dma=DMA
es1370 -device ES1370
gus -device gus,iobase=IOADDR,irq=IRQ,dma=DMA,freq=F
hda -device intel-hda,msi=MSI -device hda-duplex
sb16 -device sb16,iobase=IOADDR,irq=IRQ,dma=DMA,dma16=DMA16,version=V
adlib not yet available with -device
pcspk not yet available with -device
For PCI devices, you can add bus=PCI-BUS,addr=DEVFN to control the PCI
device address, as usual.
=== USB Devices ===
The old way to define a virtual USB device is -usbdevice DRIVER:OPTS...
The new way is -device DEVNAME,DEV-OPTS... Details depend on DRIVER:
* ccid -device usb-ccid
* keyboard -device usb-kbd
* mouse -device usb-mouse
* tablet -device usb-tablet
* wacom-tablet -device usb-wacom-tablet
* host:... See "Host Device Assignment"
* disk:... See "Block Devices"
* serial:... See "Character Devices"
* braille See "Character Devices"
* net:... See "Network Devices"
* bt:... not yet available with -device
=== Watchdog Devices ===
Host and guest part of watchdog devices have always been separate.
The old way to define a guest watchdog device is -watchdog DEVNAME.
The new way is -device DEVNAME. For PCI devices, you can add
bus=PCI-BUS,addr=DEVFN to control the PCI device address, as usual.
=== Host Device Assignment ===
QEMU supports assigning host PCI devices (qemu-kvm only at this time)
and host USB devices.
The old way to assign a host PCI device is
-pcidevice host=ADDR,dma=none,id=ID
The new way is
-device pci-assign,host=ADDR,iommu=IOMMU,id=ID
The old dma=none becomes iommu=off with -device.
The old way to assign a host USB device is
-usbdevice host:auto:BUS.ADDR:VID:PRID
where any of BUS, ADDR, VID, PRID can be the wildcard *.
The new way is
-device usb-host,hostbus=BUS,hostaddr=ADDR,vendorid=VID,productid=PRID
Omitted options match anything, just like the old way's wildcard.
=== Default Devices ===
QEMU creates a number of devices by default, depending on the machine
type.
-device DEVNAME... and global DEVNAME... suppress default devices for
some DEVNAMEs:
default device suppressing DEVNAMEs
CD-ROM ide-cd, ide-drive, scsi-cd
isa-fdc's driveA isa-fdc
parallel isa-parallel
serial isa-serial
VGA VGA, cirrus-vga, vmware-svga
virtioconsole virtio-serial-pci, virtio-serial-s390, virtio-serial
The default NIC is connected to a default part created along with it.
It is *not* suppressed by configuring a NIC with -device (you may call
that a bug). -net and -netdev suppress the default NIC.
-nodefaults suppresses all the default devices mentioned above, plus a
few other things such as default SD-Card drive and default monitor.

View file

@ -1,102 +0,0 @@
; qemupciserial.inf for QEMU, based on MSPORTS.INF
; The driver itself is shipped with Windows (serial.sys). This is
; just a inf file to tell windows which pci id the serial pci card
; emulated by qemu has, and to apply a name tag to it which windows
; will show in the device manager.
; Installing the driver: Go to device manager. You should find a "pci
; serial card" tagged with a yellow question mark. Open properties.
; Pick "update driver". Then "select driver manually". Pick "Ports
; (Com+Lpt)" from the list. Click "Have a disk". Select this file.
; Procedure may vary a bit depending on the windows version.
; This file covers all options: pci-serial, pci-serial-2x, pci-serial-4x
; for both 32 and 64 bit platforms.
[Version]
Signature="$Windows NT$"
Class=MultiFunction
ClassGUID={4d36e971-e325-11ce-bfc1-08002be10318}
Provider=%QEMU%
DriverVer=12/29/2013,1.3.0
[ControlFlags]
ExcludeFromSelect=*
[Manufacturer]
%QEMU%=QEMU,NTx86,NTAMD64
[QEMU.NTx86]
%QEMU-PCI_SERIAL_1_PORT%=ComPort_inst1, PCI\VEN_1B36&DEV_0002
%QEMU-PCI_SERIAL_2_PORT%=ComPort_inst2, PCI\VEN_1B36&DEV_0003
%QEMU-PCI_SERIAL_4_PORT%=ComPort_inst4, PCI\VEN_1B36&DEV_0004
[QEMU.NTAMD64]
%QEMU-PCI_SERIAL_1_PORT%=ComPort_inst1, PCI\VEN_1B36&DEV_0002
%QEMU-PCI_SERIAL_2_PORT%=ComPort_inst2, PCI\VEN_1B36&DEV_0003
%QEMU-PCI_SERIAL_4_PORT%=ComPort_inst4, PCI\VEN_1B36&DEV_0004
[ComPort_inst1]
Include=mf.inf
Needs=MFINSTALL.mf
[ComPort_inst2]
Include=mf.inf
Needs=MFINSTALL.mf
[ComPort_inst4]
Include=mf.inf
Needs=MFINSTALL.mf
[ComPort_inst1.HW]
AddReg=ComPort_inst1.RegHW
[ComPort_inst2.HW]
AddReg=ComPort_inst2.RegHW
[ComPort_inst4.HW]
AddReg=ComPort_inst4.RegHW
[ComPort_inst1.Services]
Include=mf.inf
Needs=MFINSTALL.mf.Services
[ComPort_inst2.Services]
Include=mf.inf
Needs=MFINSTALL.mf.Services
[ComPort_inst4.Services]
Include=mf.inf
Needs=MFINSTALL.mf.Services
[ComPort_inst1.RegHW]
HKR,Child0000,HardwareID,,*PNP0501
HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
HKR,Child0000,ResourceMap,1,02
[ComPort_inst2.RegHW]
HKR,Child0000,HardwareID,,*PNP0501
HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
HKR,Child0000,ResourceMap,1,02
HKR,Child0001,HardwareID,,*PNP0501
HKR,Child0001,VaryingResourceMap,1,00, 08,00,00,00, 08,00,00,00
HKR,Child0001,ResourceMap,1,02
[ComPort_inst4.RegHW]
HKR,Child0000,HardwareID,,*PNP0501
HKR,Child0000,VaryingResourceMap,1,00, 00,00,00,00, 08,00,00,00
HKR,Child0000,ResourceMap,1,02
HKR,Child0001,HardwareID,,*PNP0501
HKR,Child0001,VaryingResourceMap,1,00, 08,00,00,00, 08,00,00,00
HKR,Child0001,ResourceMap,1,02
HKR,Child0002,HardwareID,,*PNP0501
HKR,Child0002,VaryingResourceMap,1,00, 10,00,00,00, 08,00,00,00
HKR,Child0002,ResourceMap,1,02
HKR,Child0003,HardwareID,,*PNP0501
HKR,Child0003,VaryingResourceMap,1,00, 18,00,00,00, 08,00,00,00
HKR,Child0003,ResourceMap,1,02
[Strings]
QEMU="QEMU"
QEMU-PCI_SERIAL_1_PORT="1x QEMU PCI Serial Card"
QEMU-PCI_SERIAL_2_PORT="2x QEMU PCI Serial Card"
QEMU-PCI_SERIAL_4_PORT="4x QEMU PCI Serial Card"

View file

@ -1,87 +0,0 @@
QEMU Machine Protocol
=====================
Introduction
------------
The QEMU Machine Protocol (QMP) allows applications to operate a
QEMU instance.
QMP is JSON[1] based and features the following:
- Lightweight, text-based, easy to parse data format
- Asynchronous messages support (ie. events)
- Capabilities Negotiation
For detailed information on QMP's usage, please, refer to the following files:
o qmp-spec.txt QEMU Machine Protocol current specification
o qmp-commands.txt QMP supported commands (auto-generated at build-time)
o qmp-events.txt List of available asynchronous events
[1] http://www.json.org
Usage
-----
You can use the -qmp option to enable QMP. For example, the following
makes QMP available on localhost port 4444:
$ qemu [...] -qmp tcp:localhost:4444,server,nowait
However, for more flexibility and to make use of more options, the -mon
command-line option should be used. For instance, the following example
creates one HMP instance (human monitor) on stdio and one QMP instance
on localhost port 4444:
$ qemu [...] -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline \
-chardev socket,id=mon1,host=localhost,port=4444,server,nowait \
-mon chardev=mon1,mode=control,pretty=on
Please, refer to QEMU's manpage for more information.
Simple Testing
--------------
To manually test QMP one can connect with telnet and issue commands by hand:
$ telnet localhost 4444
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
{
"QMP": {
"version": {
"qemu": {
"micro": 50,
"minor": 6,
"major": 1
},
"package": ""
},
"capabilities": [
]
}
}
{ "execute": "qmp_capabilities" }
{
"return": {
}
}
{ "execute": "query-status" }
{
"return": {
"status": "prelaunch",
"singlestep": false,
"running": false
}
}
Please, refer to the qapi-schema.json file for a complete command reference.
QMP wiki page
-------------
http://wiki.qemu-project.org/QMP

View file

@ -1,627 +0,0 @@
QEMU Machine Protocol Events
============================
ACPI_DEVICE_OST
---------------
Emitted when guest executes ACPI _OST method.
- data: ACPIOSTInfo type as described in qapi-schema.json
{ "event": "ACPI_DEVICE_OST",
"data": { "device": "d1", "slot": "0", "slot-type": "DIMM", "source": 1, "status": 0 } }
BALLOON_CHANGE
--------------
Emitted when the guest changes the actual BALLOON level. This
value is equivalent to the 'actual' field return by the
'query-balloon' command
Data:
- "actual": actual level of the guest memory balloon in bytes (json-number)
Example:
{ "event": "BALLOON_CHANGE",
"data": { "actual": 944766976 },
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
BLOCK_IMAGE_CORRUPTED
---------------------
Emitted when a disk image is being marked corrupt.
Data:
- "device": Device name (json-string)
- "msg": Informative message (e.g., reason for the corruption) (json-string)
- "offset": If the corruption resulted from an image access, this is the access
offset into the image (json-int)
- "size": If the corruption resulted from an image access, this is the access
size (json-int)
Example:
{ "event": "BLOCK_IMAGE_CORRUPTED",
"data": { "device": "ide0-hd0",
"msg": "Prevented active L1 table overwrite", "offset": 196608,
"size": 65536 },
"timestamp": { "seconds": 1378126126, "microseconds": 966463 } }
BLOCK_IO_ERROR
--------------
Emitted when a disk I/O error occurs.
Data:
- "device": device name (json-string)
- "operation": I/O operation (json-string, "read" or "write")
- "action": action that has been taken, it's one of the following (json-string):
"ignore": error has been ignored
"report": error has been reported to the device
"stop": the VM is going to stop because of the error
Example:
{ "event": "BLOCK_IO_ERROR",
"data": { "device": "ide0-hd1",
"operation": "write",
"action": "stop" },
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
Note: If action is "stop", a STOP event will eventually follow the
BLOCK_IO_ERROR event.
BLOCK_JOB_CANCELLED
-------------------
Emitted when a block job has been cancelled.
Data:
- "type": Job type (json-string; "stream" for image streaming
"commit" for block commit)
- "device": Device name (json-string)
- "len": Maximum progress value (json-int)
- "offset": Current progress value (json-int)
On success this is equal to len.
On failure this is less than len.
- "speed": Rate limit, bytes per second (json-int)
Example:
{ "event": "BLOCK_JOB_CANCELLED",
"data": { "type": "stream", "device": "virtio-disk0",
"len": 10737418240, "offset": 134217728,
"speed": 0 },
"timestamp": { "seconds": 1267061043, "microseconds": 959568 } }
BLOCK_JOB_COMPLETED
-------------------
Emitted when a block job has completed.
Data:
- "type": Job type (json-string; "stream" for image streaming
"commit" for block commit)
- "device": Device name (json-string)
- "len": Maximum progress value (json-int)
- "offset": Current progress value (json-int)
On success this is equal to len.
On failure this is less than len.
- "speed": Rate limit, bytes per second (json-int)
- "error": Error message (json-string, optional)
Only present on failure. This field contains a human-readable
error message. There are no semantics other than that streaming
has failed and clients should not try to interpret the error
string.
Example:
{ "event": "BLOCK_JOB_COMPLETED",
"data": { "type": "stream", "device": "virtio-disk0",
"len": 10737418240, "offset": 10737418240,
"speed": 0 },
"timestamp": { "seconds": 1267061043, "microseconds": 959568 } }
BLOCK_JOB_ERROR
---------------
Emitted when a block job encounters an error.
Data:
- "device": device name (json-string)
- "operation": I/O operation (json-string, "read" or "write")
- "action": action that has been taken, it's one of the following (json-string):
"ignore": error has been ignored, the job may fail later
"report": error will be reported and the job canceled
"stop": error caused job to be paused
Example:
{ "event": "BLOCK_JOB_ERROR",
"data": { "device": "ide0-hd1",
"operation": "write",
"action": "stop" },
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
BLOCK_JOB_READY
---------------
Emitted when a block job is ready to complete.
Data:
- "type": Job type (json-string; "stream" for image streaming
"commit" for block commit)
- "device": Device name (json-string)
- "len": Maximum progress value (json-int)
- "offset": Current progress value (json-int)
On success this is equal to len.
On failure this is less than len.
- "speed": Rate limit, bytes per second (json-int)
Example:
{ "event": "BLOCK_JOB_READY",
"data": { "device": "drive0", "type": "mirror", "speed": 0,
"len": 2097152, "offset": 2097152 }
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
event.
DEVICE_DELETED
--------------
Emitted whenever the device removal completion is acknowledged
by the guest.
At this point, it's safe to reuse the specified device ID.
Device removal can be initiated by the guest or by HMP/QMP commands.
Data:
- "device": device name (json-string, optional)
- "path": device path (json-string)
{ "event": "DEVICE_DELETED",
"data": { "device": "virtio-net-pci-0",
"path": "/machine/peripheral/virtio-net-pci-0" },
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
DEVICE_TRAY_MOVED
-----------------
It's emitted whenever the tray of a removable device is moved by the guest
or by HMP/QMP commands.
Data:
- "device": device name (json-string)
- "tray-open": true if the tray has been opened or false if it has been closed
(json-bool)
{ "event": "DEVICE_TRAY_MOVED",
"data": { "device": "ide1-cd0",
"tray-open": true
},
"timestamp": { "seconds": 1265044230, "microseconds": 450486 } }
GUEST_PANICKED
--------------
Emitted when guest OS panic is detected.
Data:
- "action": Action that has been taken (json-string, currently always "pause").
Example:
{ "event": "GUEST_PANICKED",
"data": { "action": "pause" } }
NIC_RX_FILTER_CHANGED
---------------------
The event is emitted once until the query command is executed,
the first event will always be emitted.
Data:
- "name": net client name (json-string)
- "path": device path (json-string)
{ "event": "NIC_RX_FILTER_CHANGED",
"data": { "name": "vnet0",
"path": "/machine/peripheral/vnet0/virtio-backend" },
"timestamp": { "seconds": 1368697518, "microseconds": 326866 } }
}
POWERDOWN
---------
Emitted when the Virtual Machine is powered down through the power
control system, such as via ACPI.
Data: None.
Example:
{ "event": "POWERDOWN",
"timestamp": { "seconds": 1267040730, "microseconds": 682951 } }
QUORUM_FAILURE
--------------
Emitted by the Quorum block driver if it fails to establish a quorum.
Data:
- "reference": device name if defined else node name.
- "sector-num": Number of the first sector of the failed read operation.
- "sectors-count": Failed read operation sector count.
Example:
{ "event": "QUORUM_FAILURE",
"data": { "reference": "usr1", "sector-num": 345435, "sectors-count": 5 },
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
QUORUM_REPORT_BAD
-----------------
Emitted to report a corruption of a Quorum file.
Data:
- "error": Error message (json-string, optional)
Only present on failure. This field contains a human-readable
error message. There are no semantics other than that the
block layer reported an error and clients should not try to
interpret the error string.
- "node-name": The graph node name of the block driver state.
- "sector-num": Number of the first sector of the failed read operation.
- "sectors-count": Failed read operation sector count.
Example:
{ "event": "QUORUM_REPORT_BAD",
"data": { "node-name": "1.raw", "sector-num": 345435, "sectors-count": 5 },
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
RESET
-----
Emitted when the Virtual Machine is reset.
Data: None.
Example:
{ "event": "RESET",
"timestamp": { "seconds": 1267041653, "microseconds": 9518 } }
RESUME
------
Emitted when the Virtual Machine resumes execution.
Data: None.
Example:
{ "event": "RESUME",
"timestamp": { "seconds": 1271770767, "microseconds": 582542 } }
RTC_CHANGE
----------
Emitted when the guest changes the RTC time.
Data:
- "offset": Offset between base RTC clock (as specified by -rtc base), and
new RTC clock value (json-number)
Example:
{ "event": "RTC_CHANGE",
"data": { "offset": 78 },
"timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
SHUTDOWN
--------
Emitted when the Virtual Machine has shut down, indicating that qemu
is about to exit.
Data: None.
Example:
{ "event": "SHUTDOWN",
"timestamp": { "seconds": 1267040730, "microseconds": 682951 } }
Note: If the command-line option "-no-shutdown" has been specified, a STOP
event will eventually follow the SHUTDOWN event.
SPICE_CONNECTED
---------------
Emitted when a SPICE client connects.
Data:
- "server": Server information (json-object)
- "host": IP address (json-string)
- "port": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "client": Client information (json-object)
- "host": IP address (json-string)
- "port": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
Example:
{ "timestamp": {"seconds": 1290688046, "microseconds": 388707},
"event": "SPICE_CONNECTED",
"data": {
"server": { "port": "5920", "family": "ipv4", "host": "127.0.0.1"},
"client": {"port": "52873", "family": "ipv4", "host": "127.0.0.1"}
}}
SPICE_DISCONNECTED
------------------
Emitted when a SPICE client disconnects.
Data:
- "server": Server information (json-object)
- "host": IP address (json-string)
- "port": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "client": Client information (json-object)
- "host": IP address (json-string)
- "port": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
Example:
{ "timestamp": {"seconds": 1290688046, "microseconds": 388707},
"event": "SPICE_DISCONNECTED",
"data": {
"server": { "port": "5920", "family": "ipv4", "host": "127.0.0.1"},
"client": {"port": "52873", "family": "ipv4", "host": "127.0.0.1"}
}}
SPICE_INITIALIZED
-----------------
Emitted after initial handshake and authentication takes place (if any)
and the SPICE channel is up and running
Data:
- "server": Server information (json-object)
- "host": IP address (json-string)
- "port": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "auth": authentication method (json-string, optional)
- "client": Client information (json-object)
- "host": IP address (json-string)
- "port": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "connection-id": spice connection id. All channels with the same id
belong to the same spice session (json-int)
- "channel-type": channel type. "1" is the main control channel, filter for
this one if you want track spice sessions only (json-int)
- "channel-id": channel id. Usually "0", might be different needed when
multiple channels of the same type exist, such as multiple
display channels in a multihead setup (json-int)
- "tls": whevener the channel is encrypted (json-bool)
Example:
{ "timestamp": {"seconds": 1290688046, "microseconds": 417172},
"event": "SPICE_INITIALIZED",
"data": {"server": {"auth": "spice", "port": "5921",
"family": "ipv4", "host": "127.0.0.1"},
"client": {"port": "49004", "family": "ipv4", "channel-type": 3,
"connection-id": 1804289383, "host": "127.0.0.1",
"channel-id": 0, "tls": true}
}}
SPICE_MIGRATE_COMPLETED
-----------------------
Emitted when SPICE migration has completed
Data: None.
Example:
{ "timestamp": {"seconds": 1290688046, "microseconds": 417172},
"event": "SPICE_MIGRATE_COMPLETED" }
STOP
----
Emitted when the Virtual Machine is stopped.
Data: None.
Example:
{ "event": "STOP",
"timestamp": { "seconds": 1267041730, "microseconds": 281295 } }
SUSPEND
-------
Emitted when guest enters S3 state.
Data: None.
Example:
{ "event": "SUSPEND",
"timestamp": { "seconds": 1344456160, "microseconds": 309119 } }
SUSPEND_DISK
------------
Emitted when the guest makes a request to enter S4 state.
Data: None.
Example:
{ "event": "SUSPEND_DISK",
"timestamp": { "seconds": 1344456160, "microseconds": 309119 } }
Note: QEMU shuts down when entering S4 state.
VNC_CONNECTED
-------------
Emitted when a VNC client establishes a connection.
Data:
- "server": Server information (json-object)
- "host": IP address (json-string)
- "service": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "auth": authentication method (json-string, optional)
- "client": Client information (json-object)
- "host": IP address (json-string)
- "service": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
Example:
{ "event": "VNC_CONNECTED",
"data": {
"server": { "auth": "sasl", "family": "ipv4",
"service": "5901", "host": "0.0.0.0" },
"client": { "family": "ipv4", "service": "58425",
"host": "127.0.0.1" } },
"timestamp": { "seconds": 1262976601, "microseconds": 975795 } }
Note: This event is emitted before any authentication takes place, thus
the authentication ID is not provided.
VNC_DISCONNECTED
----------------
Emitted when the connection is closed.
Data:
- "server": Server information (json-object)
- "host": IP address (json-string)
- "service": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "auth": authentication method (json-string, optional)
- "client": Client information (json-object)
- "host": IP address (json-string)
- "service": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "x509_dname": TLS dname (json-string, optional)
- "sasl_username": SASL username (json-string, optional)
Example:
{ "event": "VNC_DISCONNECTED",
"data": {
"server": { "auth": "sasl", "family": "ipv4",
"service": "5901", "host": "0.0.0.0" },
"client": { "family": "ipv4", "service": "58425",
"host": "127.0.0.1", "sasl_username": "luiz" } },
"timestamp": { "seconds": 1262976601, "microseconds": 975795 } }
VNC_INITIALIZED
---------------
Emitted after authentication takes place (if any) and the VNC session is
made active.
Data:
- "server": Server information (json-object)
- "host": IP address (json-string)
- "service": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "auth": authentication method (json-string, optional)
- "client": Client information (json-object)
- "host": IP address (json-string)
- "service": port number (json-string)
- "family": address family (json-string, "ipv4" or "ipv6")
- "x509_dname": TLS dname (json-string, optional)
- "sasl_username": SASL username (json-string, optional)
Example:
{ "event": "VNC_INITIALIZED",
"data": {
"server": { "auth": "sasl", "family": "ipv4",
"service": "5901", "host": "0.0.0.0"},
"client": { "family": "ipv4", "service": "46089",
"host": "127.0.0.1", "sasl_username": "luiz" } },
"timestamp": { "seconds": 1263475302, "microseconds": 150772 } }
VSERPORT_CHANGE
---------------
Emitted when the guest opens or closes a virtio-serial port.
Data:
- "id": device identifier of the virtio-serial port (json-string)
- "open": true if the guest has opened the virtio-serial port (json-bool)
Example:
{ "event": "VSERPORT_CHANGE",
"data": { "id": "channel0", "open": true },
"timestamp": { "seconds": 1401385907, "microseconds": 422329 } }
WAKEUP
------
Emitted when the guest has woken up from S3 and is running.
Data: None.
Example:
{ "event": "WAKEUP",
"timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
WATCHDOG
--------
Emitted when the watchdog device's timer is expired.
Data:
- "action": Action that has been taken, it's one of the following (json-string):
"reset", "shutdown", "poweroff", "pause", "debug", or "none"
Example:
{ "event": "WATCHDOG",
"data": { "action": "reset" },
"timestamp": { "seconds": 1267061043, "microseconds": 959568 } }
Note: If action is "reset", "shutdown", or "pause" the WATCHDOG event is
followed respectively by the RESET, SHUTDOWN, or STOP events.

View file

@ -1,273 +0,0 @@
QEMU Machine Protocol Specification
1. Introduction
===============
This document specifies the QEMU Machine Protocol (QMP), a JSON-based protocol
which is available for applications to operate QEMU at the machine-level.
2. Protocol Specification
=========================
This section details the protocol format. For the purpose of this document
"Client" is any application which is using QMP to communicate with QEMU and
"Server" is QEMU itself.
JSON data structures, when mentioned in this document, are always in the
following format:
json-DATA-STRUCTURE-NAME
Where DATA-STRUCTURE-NAME is any valid JSON data structure, as defined by
the JSON standard:
http://www.ietf.org/rfc/rfc4627.txt
For convenience, json-object members and json-array elements mentioned in
this document will be in a certain order. However, in real protocol usage
they can be in ANY order, thus no particular order should be assumed.
2.1 General Definitions
-----------------------
2.1.1 All interactions transmitted by the Server are json-objects, always
terminating with CRLF
2.1.2 All json-objects members are mandatory when not specified otherwise
2.2 Server Greeting
-------------------
Right when connected the Server will issue a greeting message, which signals
that the connection has been successfully established and that the Server is
ready for capabilities negotiation (for more information refer to section
'4. Capabilities Negotiation').
The greeting message format is:
{ "QMP": { "version": json-object, "capabilities": json-array } }
Where,
- The "version" member contains the Server's version information (the format
is the same of the query-version command)
- The "capabilities" member specify the availability of features beyond the
baseline specification
2.3 Issuing Commands
--------------------
The format for command execution is:
{ "execute": json-string, "arguments": json-object, "id": json-value }
Where,
- The "execute" member identifies the command to be executed by the Server
- The "arguments" member is used to pass any arguments required for the
execution of the command, it is optional when no arguments are required
- The "id" member is a transaction identification associated with the
command execution, it is optional and will be part of the response if
provided
2.4 Commands Responses
----------------------
There are two possible responses which the Server will issue as the result
of a command execution: success or error.
2.4.1 success
-------------
The format of a success response is:
{ "return": json-object, "id": json-value }
Where,
- The "return" member contains the command returned data, which is defined
in a per-command basis or an empty json-object if the command does not
return data
- The "id" member contains the transaction identification associated
with the command execution if issued by the Client
2.4.2 error
-----------
The format of an error response is:
{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
Where,
- The "class" member contains the error class name (eg. "GenericError")
- The "desc" member is a human-readable error message. Clients should
not attempt to parse this message.
- The "id" member contains the transaction identification associated with
the command execution if issued by the Client
NOTE: Some errors can occur before the Server is able to read the "id" member,
in these cases the "id" member will not be part of the error response, even
if provided by the client.
2.5 Asynchronous events
-----------------------
As a result of state changes, the Server may send messages unilaterally
to the Client at any time. They are called "asynchronous events".
The format of asynchronous events is:
{ "event": json-string, "data": json-object,
"timestamp": { "seconds": json-number, "microseconds": json-number } }
Where,
- The "event" member contains the event's name
- The "data" member contains event specific data, which is defined in a
per-event basis, it is optional
- The "timestamp" member contains the exact time of when the event occurred
in the Server. It is a fixed json-object with time in seconds and
microseconds
For a listing of supported asynchronous events, please, refer to the
qmp-events.txt file.
3. QMP Examples
===============
This section provides some examples of real QMP usage, in all of them
"C" stands for "Client" and "S" stands for "Server".
3.1 Server greeting
-------------------
S: { "QMP": { "version": { "qemu": { "micro": 50, "minor": 6, "major": 1 },
"package": ""}, "capabilities": []}}
3.2 Simple 'stop' execution
---------------------------
C: { "execute": "stop" }
S: { "return": {} }
3.3 KVM information
-------------------
C: { "execute": "query-kvm", "id": "example" }
S: { "return": { "enabled": true, "present": true }, "id": "example"}
3.4 Parsing error
------------------
C: { "execute": }
S: { "error": { "class": "GenericError", "desc": "Invalid JSON syntax" } }
3.5 Powerdown event
-------------------
S: { "timestamp": { "seconds": 1258551470, "microseconds": 802384 },
"event": "POWERDOWN" }
4. Capabilities Negotiation
----------------------------
When a Client successfully establishes a connection, the Server is in
Capabilities Negotiation mode.
In this mode only the qmp_capabilities command is allowed to run, all
other commands will return the CommandNotFound error. Asynchronous
messages are not delivered either.
Clients should use the qmp_capabilities command to enable capabilities
advertised in the Server's greeting (section '2.2 Server Greeting') they
support.
When the qmp_capabilities command is issued, and if it does not return an
error, the Server enters in Command mode where capabilities changes take
effect, all commands (except qmp_capabilities) are allowed and asynchronous
messages are delivered.
5 Compatibility Considerations
------------------------------
All protocol changes or new features which modify the protocol format in an
incompatible way are disabled by default and will be advertised by the
capabilities array (section '2.2 Server Greeting'). Thus, Clients can check
that array and enable the capabilities they support.
The QMP Server performs a type check on the arguments to a command. It
generates an error if a value does not have the expected type for its
key, or if it does not understand a key that the Client included. The
strictness of the Server catches wrong assumptions of Clients about
the Server's schema. Clients can assume that, when such validation
errors occur, they will be reported before the command generated any
side effect.
However, Clients must not assume any particular:
- Length of json-arrays
- Size of json-objects; in particular, future versions of QEMU may add
new keys and Clients should be able to ignore them.
- Order of json-object members or json-array elements
- Amount of errors generated by a command, that is, new errors can be added
to any existing command in newer versions of the Server
Of course, the Server does guarantee to send valid JSON. But apart from
this, a Client should be "conservative in what they send, and liberal in
what they accept".
6. Downstream extension of QMP
------------------------------
We recommend that downstream consumers of QEMU do *not* modify QMP.
Management tools should be able to support both upstream and downstream
versions of QMP without special logic, and downstream extensions are
inherently at odds with that.
However, we recognize that it is sometimes impossible for downstreams to
avoid modifying QMP. Both upstream and downstream need to take care to
preserve long-term compatibility and interoperability.
To help with that, QMP reserves JSON object member names beginning with
'__' (double underscore) for downstream use ("downstream names"). This
means upstream will never use any downstream names for its commands,
arguments, errors, asynchronous events, and so forth.
Any new names downstream wishes to add must begin with '__'. To
ensure compatibility with other downstreams, it is strongly
recommended that you prefix your downstream names with '__RFQDN_' where
RFQDN is a valid, reverse fully qualified domain name which you
control. For example, a qemu-kvm specific monitor command would be:
(qemu) __org.linux-kvm_enable_irqchip
Downstream must not change the server greeting (section 2.2) other than
to offer additional capabilities. But see below for why even that is
discouraged.
Section '5 Compatibility Considerations' applies to downstream as well
as to upstream, obviously. It follows that downstream must behave
exactly like upstream for any input not containing members with
downstream names ("downstream members"), except it may add members
with downstream names to its output.
Thus, a client should not be able to distinguish downstream from
upstream as long as it doesn't send input with downstream members, and
properly ignores any downstream members in the output it receives.
Advice on downstream modifications:
1. Introducing new commands is okay. If you want to extend an existing
command, consider introducing a new one with the new behaviour
instead.
2. Introducing new asynchronous messages is okay. If you want to extend
an existing message, consider adding a new one instead.
3. Introducing new errors for use in new commands is okay. Adding new
errors to existing commands counts as extension, so 1. applies.
4. New capabilities are strongly discouraged. Capabilities are for
evolving the basic protocol, and multiple diverging basic protocol
dialects are most undesirable.

View file

@ -1,420 +0,0 @@
(RDMA: Remote Direct Memory Access)
RDMA Live Migration Specification, Version # 1
==============================================
Wiki: http://wiki.qemu-project.org/Features/RDMALiveMigration
Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
An *exhaustive* paper (2010) shows additional performance details
linked on the QEMU wiki above.
Contents:
=========
* Introduction
* Before running
* Running
* Performance
* RDMA Migration Protocol Description
* Versioning and Capabilities
* QEMUFileRDMA Interface
* Migration of VM's ram
* Error handling
* TODO
Introduction:
=============
RDMA helps make your migration more deterministic under heavy load because
of the significantly lower latency and higher throughput over TCP/IP. This is
because the RDMA I/O architecture reduces the number of interrupts and
data copies by bypassing the host networking stack. In particular, a TCP-based
migration, under certain types of memory-bound workloads, may take a more
unpredicatable amount of time to complete the migration if the amount of
memory tracked during each live migration iteration round cannot keep pace
with the rate of dirty memory produced by the workload.
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
over Converged Ethernet) as well as Infiniband-based. This implementation of
migration using RDMA is capable of using both technologies because of
the use of the OpenFabrics OFED software stack that abstracts out the
programming model irrespective of the underlying hardware.
Refer to openfabrics.org or your respective RDMA hardware vendor for
an understanding on how to verify that you have the OFED software stack
installed in your environment. You should be able to successfully link
against the "librdmacm" and "libibverbs" libraries and development headers
for a working build of QEMU to run successfully using RDMA Migration.
BEFORE RUNNING:
===============
Use of RDMA during migration requires pinning and registering memory
with the hardware. This means that memory must be physically resident
before the hardware can transmit that memory to another machine.
If this is not acceptable for your application or product, then the use
of RDMA migration may in fact be harmful to co-located VMs or other
software on the machine if there is not sufficient memory available to
relocate the entire footprint of the virtual machine. If so, then the
use of RDMA is discouraged and it is recommended to use standard TCP migration.
Experimental: Next, decide if you want dynamic page registration.
For example, if you have an 8GB RAM virtual machine, but only 1GB
is in active use, then enabling this feature will cause all 8GB to
be pinned and resident in memory. This feature mostly affects the
bulk-phase round of the migration and can be enabled for extremely
high-performance RDMA hardware using the following command:
QEMU Monitor Command:
$ migrate_set_capability rdma-pin-all on # disabled by default
Performing this action will cause all 8GB to be pinned, so if that's
not what you want, then please ignore this step altogether.
On the other hand, this will also significantly speed up the bulk round
of the migration, which can greatly reduce the "total" time of your migration.
Example performance of this using an idle VM in the previous example
can be found in the "Performance" section.
Note: for very large virtual machines (hundreds of GBs), pinning all
*all* of the memory of your virtual machine in the kernel is very expensive
may extend the initial bulk iteration time by many seconds,
and thus extending the total migration time. However, this will not
affect the determinism or predictability of your migration you will
still gain from the benefits of advanced pinning with RDMA.
RUNNING:
========
First, set the migration speed to match your hardware's capabilities:
QEMU Monitor Command:
$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
Next, on the destination machine, add the following to the QEMU command line:
qemu ..... -incoming rdma:host:port
Finally, perform the actual migration on the source machine:
QEMU Monitor Command:
$ migrate -d rdma:host:port
PERFORMANCE
===========
Here is a brief summary of total migration time and downtime using RDMA:
Using a 40gbps infiniband link performing a worst-case stress test,
using an 8GB RAM virtual machine:
Using the following command:
$ apt-get install stress
$ stress --vm-bytes 7500M --vm 1 --vm-keep
1. Migration throughput: 26 gigabits/second.
2. Downtime (stop time) varies between 15 and 100 milliseconds.
EFFECTS of memory registration on bulk phase round:
For example, in the same 8GB RAM example with all 8GB of memory in
active use and the VM itself is completely idle using the same 40 gbps
infiniband link:
1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
These numbers would of course scale up to whatever size virtual machine
you have to migrate using RDMA.
Enabling this feature does *not* have any measurable affect on
migration *downtime*. This is because, without this feature, all of the
memory will have already been registered already in advance during
the bulk round and does not need to be re-registered during the successive
iteration rounds.
RDMA Protocol Description:
==========================
Migration with RDMA is separated into two parts:
1. The transmission of the pages using RDMA
2. Everything else (a control channel is introduced)
"Everything else" is transmitted using a formal
protocol now, consisting of infiniband SEND messages.
An infiniband SEND message is the standard ibverbs
message used by applications of infiniband hardware.
The only difference between a SEND message and an RDMA
message is that SEND messages cause notifications
to be posted to the completion queue (CQ) on the
infiniband receiver side, whereas RDMA messages (used
for VM's ram) do not (to behave like an actual DMA).
Messages in infiniband require two things:
1. registration of the memory that will be transmitted
2. (SEND only) work requests to be posted on both
sides of the network before the actual transmission
can occur.
RDMA messages are much easier to deal with. Once the memory
on the receiver side is registered and pinned, we're
basically done. All that is required is for the sender
side to start dumping bytes onto the link.
(Memory is not released from pinning until the migration
completes, given that RDMA migrations are very fast.)
SEND messages require more coordination because the
receiver must have reserved space (using a receive
work request) on the receive queue (RQ) before QEMUFileRDMA
can start using them to carry all the bytes as
a control transport for migration of device state.
To begin the migration, the initial connection setup is
as follows (migration-rdma.c):
1. Receiver and Sender are started (command line or libvirt):
2. Both sides post two RQ work requests
3. Receiver does listen()
4. Sender does connect()
5. Receiver accept()
6. Check versioning and capabilities (described later)
At this point, we define a control channel on top of SEND messages
which is described by a formal protocol. Each SEND message has a
header portion and a data portion (but together are transmitted
as a single SEND message).
Header:
* Length (of the data portion, uint32, network byte order)
* Type (what command to perform, uint32, network byte order)
* Repeat (Number of commands in data portion, same type only)
The 'Repeat' field is here to support future multiple page registrations
in a single message without any need to change the protocol itself
so that the protocol is compatible against multiple versions of QEMU.
Version #1 requires that all server implementations of the protocol must
check this field and register all requests found in the array of commands located
in the data portion and return an equal number of results in the response.
The maximum number of repeats is hard-coded to 4096. This is a conservative
limit based on the maximum size of a SEND message along with empirical
observations on the maximum future benefit of simultaneous page registrations.
The 'type' field has 12 different command values:
1. Unused
2. Error (sent to the source during bad things)
3. Ready (control-channel is available)
4. QEMU File (for sending non-live device state)
5. RAM Blocks request (used right after connection setup)
6. RAM Blocks result (used right after connection setup)
7. Compress page (zap zero page and skip registration)
8. Register request (dynamic chunk registration)
9. Register result ('rkey' to be used by sender)
10. Register finished (registration for current iteration finished)
11. Unregister request (unpin previously registered memory)
12. Unregister finished (confirmation that unpin completed)
A single control message, as hinted above, can contain within the data
portion an array of many commands of the same type. If there is more than
one command, then the 'repeat' field will be greater than 1.
After connection setup, message 5 & 6 are used to exchange ram block
information and optionally pin all the memory if requested by the user.
After ram block exchange is completed, we have two protocol-level
functions, responsible for communicating control-channel commands
using the above list of values:
Logically:
qemu_rdma_exchange_recv(header, expected command type)
1. We transmit a READY command to let the sender know that
we are *ready* to receive some data bytes on the control channel.
2. Before attempting to receive the expected command, we post another
RQ work request to replace the one we just used up.
3. Block on a CQ event channel and wait for the SEND to arrive.
4. When the send arrives, librdmacm will unblock us.
5. Verify that the command-type and version received matches the one we expected.
qemu_rdma_exchange_send(header, data, optional response header & data):
1. Block on the CQ event channel waiting for a READY command
from the receiver to tell us that the receiver
is *ready* for us to transmit some new bytes.
2. Optionally: if we are expecting a response from the command
(that we have not yet transmitted), let's post an RQ
work request to receive that data a few moments later.
3. When the READY arrives, librdmacm will
unblock us and we immediately post a RQ work request
to replace the one we just used up.
4. Now, we can actually post the work request to SEND
the requested command type of the header we were asked for.
5. Optionally, if we are expecting a response (as before),
we block again and wait for that response using the additional
work request we previously posted. (This is used to carry
'Register result' commands #6 back to the sender which
hold the rkey need to perform RDMA. Note that the virtual address
corresponding to this rkey was already exchanged at the beginning
of the connection (described below).
All of the remaining command types (not including 'ready')
described above all use the aformentioned two functions to do the hard work:
1. After connection setup, RAMBlock information is exchanged using
this protocol before the actual migration begins. This information includes
a description of each RAMBlock on the server side as well as the virtual addresses
and lengths of each RAMBlock. This is used by the client to determine the
start and stop locations of chunks and how to register them dynamically
before performing the RDMA operations.
2. During runtime, once a 'chunk' becomes full of pages ready to
be sent with RDMA, the registration commands are used to ask the
other side to register the memory for this chunk and respond
with the result (rkey) of the registration.
3. Also, the QEMUFile interfaces also call these functions (described below)
when transmitting non-live state, such as devices or to send
its own protocol information during the migration process.
4. Finally, zero pages are only checked if a page has not yet been registered
using chunk registration (or not checked at all and unconditionally
written if chunk registration is disabled. This is accomplished using
the "Compress" command listed above. If the page *has* been registered
then we check the entire chunk for zero. Only if the entire chunk is
zero, then we send a compress command to zap the page on the other side.
Versioning and Capabilities
===========================
Current version of the protocol is version #1.
The same version applies to both for protocol traffic and capabilities
negotiation. (i.e. There is only one version number that is referred to
by all communication).
librdmacm provides the user with a 'private data' area to be exchanged
at connection-setup time before any infiniband traffic is generated.
Header:
* Version (protocol version validated before send/recv occurs),
uint32, network byte order
* Flags (bitwise OR of each capability),
uint32, network byte order
There is no data portion of this header right now, so there is
no length field. The maximum size of the 'private data' section
is only 192 bytes per the Infiniband specification, so it's not
very useful for data anyway. This structure needs to remain small.
This private data area is a convenient place to check for protocol
versioning because the user does not need to register memory to
transmit a few bytes of version information.
This is also a convenient place to negotiate capabilities
(like dynamic page registration).
If the version is invalid, we throw an error.
If the version is new, we only negotiate the capabilities that the
requested version is able to perform and ignore the rest.
Currently there is only one capability in Version #1: dynamic page registration
Finally: Negotiation happens with the Flags field: If the primary-VM
sets a flag, but the destination does not support this capability, it
will return a zero-bit for that flag and the primary-VM will understand
that as not being an available capability and will thus disable that
capability on the primary-VM side.
QEMUFileRDMA Interface:
=======================
QEMUFileRDMA introduces a couple of new functions:
1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
These two functions are very short and simply use the protocol
describe above to deliver bytes without changing the upper-level
users of QEMUFile that depend on a bytestream abstraction.
Finally, how do we handoff the actual bytes to get_buffer()?
Again, because we're trying to "fake" a bytestream abstraction
using an analogy not unlike individual UDP frames, we have
to hold on to the bytes received from control-channel's SEND
messages in memory.
Each time we receive a complete "QEMU File" control-channel
message, the bytes from SEND are copied into a small local holding area.
Then, we return the number of bytes requested by get_buffer()
and leave the remaining bytes in the holding area until get_buffer()
comes around for another pass.
If the buffer is empty, then we follow the same steps
listed above and issue another "QEMU File" protocol command,
asking for a new SEND message to re-fill the buffer.
Migration of VM's ram:
====================
At the beginning of the migration, (migration-rdma.c),
the sender and the receiver populate the list of RAMBlocks
to be registered with each other into a structure.
Then, using the aforementioned protocol, they exchange a
description of these blocks with each other, to be used later
during the iteration of main memory. This description includes
a list of all the RAMBlocks, their offsets and lengths, virtual
addresses and possibly includes pre-registered RDMA keys in case dynamic
page registration was disabled on the server-side, otherwise not.
Main memory is not migrated with the aforementioned protocol,
but is instead migrated with normal RDMA Write operations.
Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
Chunk size is not dynamic, but it could be in a future implementation.
There's nothing to indicate that this is useful right now.
When a chunk is full (or a flush() occurs), the memory backed by
the chunk is registered with librdmacm is pinned in memory on
both sides using the aforementioned protocol.
After pinning, an RDMA Write is generated and transmitted
for the entire chunk.
Chunks are also transmitted in batches: This means that we
do not request that the hardware signal the completion queue
for the completion of *every* chunk. The current batch size
is about 64 chunks (corresponding to 64 MB of memory).
Only the last chunk in a batch must be signaled.
This helps keep everything as asynchronous as possible
and helps keep the hardware busy performing RDMA operations.
Error-handling:
===============
Infiniband has what is called a "Reliable, Connected"
link (one of 4 choices). This is the mode in which
we use for RDMA migration.
If a *single* message fails,
the decision is to abort the migration entirely and
cleanup all the RDMA descriptors and unregister all
the memory.
After cleanup, the Virtual Machine is returned to normal
operation the same way that would happen if the TCP
socket is broken during a non-RDMA based migration.
TODO:
=====
1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
are not compatible with infinband memory pinning and will result in
an aborted migration (but with the source VM left unaffected).
2. Use of the recent /proc/<pid>/pagemap would likely speed up
the use of KSM and ballooning while using RDMA.
3. Also, some form of balloon-device usage tracking would also
help alleviate some issues.
4. Use LRU to provide more fine-grained direction of UNREGISTER
requests for unpinning memory in an overcommitted environment.
5. Expose UNREGISTER support to the user by way of workload-specific
hints about application behavior.

View file

@ -1,24 +0,0 @@
QEMU<->ACPI BIOS CPU hotplug interface
--------------------------------------
QEMU supports CPU hotplug via ACPI. This document
describes the interface between QEMU and the ACPI BIOS.
ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
-----------------------------------------
Generic ACPI GPE block. Bit 2 (GPE.2) used to notify CPU
hot-add/remove event to ACPI BIOS, via SCI interrupt.
CPU present bitmap for:
ICH9-LPC (IO port 0x0cd8-0xcf7, 1-byte access)
PIIX-PM (IO port 0xaf00-0xaf1f, 1-byte access)
---------------------------------------------------------------
One bit per CPU. Bit position reflects corresponding CPU APIC ID.
Read-only.
CPU hot-add/remove notification:
-----------------------------------------------------
QEMU sets/clears corresponding CPU bit on hot-add/remove event.
CPU present map read by ACPI BIOS GPE.2 handler to notify OS of CPU
hot-(un)plug events.

View file

@ -1,44 +0,0 @@
QEMU<->ACPI BIOS memory hotplug interface
--------------------------------------
ACPI BIOS GPE.3 handler is dedicated for notifying OS about memory hot-add
events.
Memory hot-plug interface (IO port 0xa00-0xa17, 1-4 byte access):
---------------------------------------------------------------
0xa00:
read access:
[0x0-0x3] Lo part of memory device phys address
[0x4-0x7] Hi part of memory device phys address
[0x8-0xb] Lo part of memory device size in bytes
[0xc-0xf] Hi part of memory device size in bytes
[0x10-0x13] Memory device proximity domain
[0x14] Memory device status fields
bits:
0: Device is enabled and may be used by guest
1: Device insert event, used to distinguish device for which
no device check event to OSPM was issued.
It's valid only when bit 1 is set.
2-7: reserved and should be ignored by OSPM
[0x15-0x17] reserved
write access:
[0x0-0x3] Memory device slot selector, selects active memory device.
All following accesses to other registers in 0xa00-0xa17
region will read/store data from/to selected memory device.
[0x4-0x7] OST event code reported by OSPM
[0x8-0xb] OST status code reported by OSPM
[0xc-0x13] reserved, writes into it are ignored
[0x14] Memory device control fields
bits:
0: reserved, OSPM must clear it before writing to register
1: if set to 1 clears device insert event, set by OSPM
after it has emitted device check event for the
selected memory device
2-7: reserved, OSPM must clear them before writing to register
Selecting memory device slot beyond present range has no effect on platform:
- write accesses to memory hot-plug registers not documented above are
ignored
- read accesses to memory hot-plug registers not documented above return
all bits set to 1.

View file

@ -1,45 +0,0 @@
QEMU<->ACPI BIOS PCI hotplug interface
--------------------------------------
QEMU supports PCI hotplug via ACPI, for PCI bus 0. This document
describes the interface between QEMU and the ACPI BIOS.
ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
-----------------------------------------
Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
event to ACPI BIOS, via SCI interrupt.
PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
---------------------------------------------------------------
Slot injection notification pending. One bit per slot.
Read by ACPI BIOS GPE.1 handler to notify OS of injection
events. Read-only.
PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
-----------------------------------------------------
Slot removal notification pending. One bit per slot.
Read by ACPI BIOS GPE.1 handler to notify OS of removal
events. Read-only.
PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
----------------------------------------
Write: Used by ACPI BIOS _EJ0 method to request device removal.
One bit per slot.
Read: Hotplug features register. Used by platform to identify features
available. Current base feature set (no bits set):
- Read-only "up" register @0xae00, 4-byte access, bit per slot
- Read-only "down" register @0xae04, 4-byte access, bit per slot
- Read/write "eject" register @0xae08, 4-byte access,
write: bit per slot eject, read: hotplug feature set
- Read-only hotplug capable register @0xae0c, 4-byte access, bit per slot
PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
-----------------------------------------------
Used by ACPI BIOS _RMV method to indicate removability status to OS. One
bit per slot. Read-only

View file

@ -1,96 +0,0 @@
Device Specification for Inter-VM shared memory device
------------------------------------------------------
The Inter-VM shared memory device is designed to share a region of memory to
userspace in multiple virtual guests. The memory region does not belong to any
guest, but is a POSIX memory object on the host. Optionally, the device may
support sending interrupts to other guests sharing the same memory region.
The Inter-VM PCI device
-----------------------
*BARs*
The device supports three BARs. BAR0 is a 1 Kbyte MMIO region to support
registers. BAR1 is used for MSI-X when it is enabled in the device. BAR2 is
used to map the shared memory object from the host. The size of BAR2 is
specified when the guest is started and must be a power of 2 in size.
*Registers*
The device currently supports 4 registers of 32-bits each. Registers
are used for synchronization between guests sharing the same memory object when
interrupts are supported (this requires using the shared memory server).
The server assigns each VM an ID number and sends this ID number to the QEMU
process when the guest starts.
enum ivshmem_registers {
IntrMask = 0,
IntrStatus = 4,
IVPosition = 8,
Doorbell = 12
};
The first two registers are the interrupt mask and status registers. Mask and
status are only used with pin-based interrupts. They are unused with MSI
interrupts.
Status Register: The status register is set to 1 when an interrupt occurs.
Mask Register: The mask register is bitwise ANDed with the interrupt status
and the result will raise an interrupt if it is non-zero. However, since 1 is
the only value the status will be set to, it is only the first bit of the mask
that has any effect. Therefore interrupts can be masked by setting the first
bit to 0 and unmasked by setting the first bit to 1.
IVPosition Register: The IVPosition register is read-only and reports the
guest's ID number. The guest IDs are non-negative integers. When using the
server, since the server is a separate process, the VM ID will only be set when
the device is ready (shared memory is received from the server and accessible via
the device). If the device is not ready, the IVPosition will return -1.
Applications should ensure that they have a valid VM ID before accessing the
shared memory.
Doorbell Register: To interrupt another guest, a guest must write to the
Doorbell register. The doorbell register is 32-bits, logically divided into
two 16-bit fields. The high 16-bits are the guest ID to interrupt and the low
16-bits are the interrupt vector to trigger. The semantics of the value
written to the doorbell depends on whether the device is using MSI or a regular
pin-based interrupt. In short, MSI uses vectors while regular interrupts set the
status register.
Regular Interrupts
If regular interrupts are used (due to either a guest not supporting MSI or the
user specifying not to use them on startup) then the value written to the lower
16-bits of the Doorbell register results is arbitrary and will trigger an
interrupt in the destination guest.
Message Signalled Interrupts
A ivshmem device may support multiple MSI vectors. If so, the lower 16-bits
written to the Doorbell register must be between 0 and the maximum number of
vectors the guest supports. The lower 16 bits written to the doorbell is the
MSI vector that will be raised in the destination guest. The number of MSI
vectors is configurable but it is set when the VM is started.
The important thing to remember with MSI is that it is only a signal, no status
is set (since MSI interrupts are not shared). All information other than the
interrupt itself should be communicated via the shared memory region. Devices
supporting multiple MSI vectors can use different vectors to indicate different
events have occurred. The semantics of interrupt vectors are left to the
user's discretion.
Usage in the Guest
------------------
The shared memory device is intended to be used with the provided UIO driver.
Very little configuration is needed. The guest should map BAR0 to access the
registers (an array of 32-bit ints allows simple writing) and map BAR2 to
access the shared memory region itself. The size of the shared memory region
is specified when the guest (or shared memory server) is started. A guest may
map the whole shared memory region or only part of it.

View file

@ -1,50 +0,0 @@
PCI IDs for qemu
================
Red Hat, Inc. donates a part of its device ID range to qemu, to be used for
virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36.
Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned
for your devices.
1af4 vendor ID
--------------
The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
Note that this allocation separate from the virtio device IDs, which are
maintained as part of the virtio specification.
1af4:1000 network device
1af4:1001 block device
1af4:1002 balloon device
1af4:1003 console device
1af4:1004 SCSI host bus adapter device
1af4:1005 entropy generator device
1af4:1009 9p filesystem device
1af4:10f0 Available for experimental usage without registration. Must get
to official ID when the code leaves the test lab (i.e. when seeking
1af4:10ff upstream merge or shipping a distro/product) to avoid conflicts.
1af4:1100 Used as PCI Subsystem ID for existing hardware devices emulated
by qemu.
1af4:1110 ivshmem device (shared memory, docs/specs/ivshmem_device_spec.txt)
All other device IDs are reserved.
1b36 vendor ID
--------------
The 0000 -> 00ff device ID range is used as follows for QEMU-specific
PCI devices (other than virtio):
1b36:0001 PCI-PCI bridge
1b36:0002 PCI serial port (16550A) adapter (docs/specs/pci-serial.txt)
1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
All these devices are documented in docs/specs.
The 0100 device ID is used for the QXL video card device.

View file

@ -1,34 +0,0 @@
QEMU pci serial devices
=======================
There is one single-port variant and two muliport-variants. Linux
guests out-of-the box with all cards. There is a Windows inf file
(docs/qemupciserial.inf) to setup the single-port card in Windows
guests.
single-port card
----------------
Name: pci-serial
PCI ID: 1b36:0002
PCI Region 0:
IO bar, 8 bytes long, with the 16550 uart mapped to it.
Interrupt is wired to pin A.
multiport cards
---------------
Name: pci-serial-2x
PCI ID: 1b36:0003
Name: pci-serial-4x
PCI ID: 1b36:0004
PCI Region 0:
IO bar, with two/four 16550 uart mapped after each other.
The first is at offset 0, second at offset 8, ...
Interrupt is wired to pin A.

View file

@ -1,26 +0,0 @@
pci-test is a device used for testing low level IO
device implements up to two BARs: BAR0 and BAR1.
Each BAR can be memory or IO. Guests must detect
BAR type and act accordingly.
Each BAR size is up to 4K bytes.
Each BAR starts with the following header:
typedef struct PCITestDevHdr {
uint8_t test; <- write-only, starts a given test number
uint8_t width_type; <- read-only, type and width of access for a given test.
1,2,4 for byte,word or long write.
any other value if test not supported on this BAR
uint8_t pad0[2];
uint32_t offset; <- read-only, offset in this BAR for a given test
uint32_t data; <- read-only, data to use for a given test
uint32_t count; <- for debugging. number of writes detected.
uint8_t name[]; <- for debugging. 0-terminated ASCII string.
} PCITestDevHdr;
All registers are little endian.
device is expected to always implement tests 0 to N on each BAR, and to add new
tests with higher numbers. In this way a guest can scan test numbers until it
detects an access type that it does not support on this BAR, then stop.

View file

@ -1,78 +0,0 @@
When used with the "pseries" machine type, QEMU-system-ppc64 implements
a set of hypervisor calls using a subset of the server "PAPR" specification
(IBM internal at this point), which is also what IBM's proprietary hypervisor
adheres too.
The subset is selected based on the requirements of Linux as a guest.
In addition to those calls, we have added our own private hypervisor
calls which are mostly used as a private interface between the firmware
running in the guest and QEMU.
All those hypercalls start at hcall number 0xf000 which correspond
to a implementation specific range in PAPR.
- H_RTAS (0xf000)
RTAS is a set of runtime services generally provided by the firmware
inside the guest to the operating system. It predates the existence
of hypervisors (it was originally an extension to Open Firmware) and
is still used by PAPR to provide various services that aren't performance
sensitive.
We currently implement the RTAS services in QEMU itself. The actual RTAS
"firmware" blob in the guest is a small stub of a few instructions which
calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
Arguments:
r3 : H_RTAS (0xf000)
r4 : Guest physical address of RTAS parameter block
Returns:
H_SUCCESS : Successfully called the RTAS function (RTAS result
will have been stored in the parameter block)
H_PARAMETER : Unknown token
- H_LOGICAL_MEMOP (0xf001)
When the guest runs in "real mode" (in powerpc lingua this means
with MMU disabled, ie guest effective == guest physical), it only
has access to a subset of memory and no IOs.
PAPR provides a set of hypervisor calls to perform cachable or
non-cachable accesses to any guest physical addresses that the
guest can use in order to access IO devices while in real mode.
This is typically used by the firmware running in the guest.
However, doing a hypercall for each access is extremely inefficient
(even more so when running KVM) when accessing the frame buffer. In
that case, things like scrolling become unusably slow.
This hypercall allows the guest to request a "memory op" to be applied
to memory. The supported memory ops at this point are to copy a range
of memory (supports overlap of source and destination) and XOR which
is used by our SLOF firmware to invert the screen.
Arguments:
r3: H_LOGICAL_MEMOP (0xf001)
r4: Guest physical address of destination
r5: Guest physical address of source
r6: Individual element size
0 = 1 byte
1 = 2 bytes
2 = 4 bytes
3 = 8 bytes
r7: Number of elements
r8: Operation
0 = copy
1 = xor
Returns:
H_SUCCESS : Success
H_PARAMETER : Invalid argument

View file

@ -1,39 +0,0 @@
PVPANIC DEVICE
==============
pvpanic device is a simulated ISA device, through which a guest panic
event is sent to qemu, and a QMP event is generated. This allows
management apps (e.g. libvirt) to be notified and respond to the event.
The management app has the option of waiting for GUEST_PANICKED events,
and/or polling for guest-panicked RunState, to learn when the pvpanic
device has fired a panic event.
ISA Interface
-------------
pvpanic exposes a single I/O port, by default 0x505. On read, the bits
recognized by the device are set. Software should ignore bits it doesn't
recognize. On write, the bits not recognized by the device are ignored.
Software should set only bits both itself and the device recognize.
Currently, only bit 0 is recognized, setting it indicates a guest panic
has happened.
ACPI Interface
--------------
pvpanic device is defined with ACPI ID "QEMU0001". Custom methods:
RDPT: To determine whether guest panic notification is supported.
Arguments: None
Return: Returns a byte, bit 0 set to indicate guest panic
notification is supported. Other bits are reserved and
should be ignored.
WRPT: To send a guest panic event
Arguments: Arg0 is a byte, with bit 0 set to indicate guest panic has
happened. Other bits are reserved and should be cleared.
Return: None
The ACPI device will automatically refer to the right port in case it
is modified.

View file

@ -1,362 +0,0 @@
== General ==
A qcow2 image file is organized in units of constant size, which are called
(host) clusters. A cluster is the unit in which all allocations are done,
both for actual guest data and for image metadata.
Likewise, the virtual disk as seen by the guest is divided into (guest)
clusters of the same size.
All numbers in qcow2 are stored in Big Endian byte order.
== Header ==
The first cluster of a qcow2 image contains the file header:
Byte 0 - 3: magic
QCOW magic string ("QFI\xfb")
4 - 7: version
Version number (valid values are 2 and 3)
8 - 15: backing_file_offset
Offset into the image file at which the backing file name
is stored (NB: The string is not null terminated). 0 if the
image doesn't have a backing file.
16 - 19: backing_file_size
Length of the backing file name in bytes. Must not be
longer than 1023 bytes. Undefined if the image doesn't have
a backing file.
20 - 23: cluster_bits
Number of bits that are used for addressing an offset
within a cluster (1 << cluster_bits is the cluster size).
Must not be less than 9 (i.e. 512 byte clusters).
Note: qemu as of today has an implementation limit of 2 MB
as the maximum cluster size and won't be able to open images
with larger cluster sizes.
24 - 31: size
Virtual disk size in bytes
32 - 35: crypt_method
0 for no encryption
1 for AES encryption
36 - 39: l1_size
Number of entries in the active L1 table
40 - 47: l1_table_offset
Offset into the image file at which the active L1 table
starts. Must be aligned to a cluster boundary.
48 - 55: refcount_table_offset
Offset into the image file at which the refcount table
starts. Must be aligned to a cluster boundary.
56 - 59: refcount_table_clusters
Number of clusters that the refcount table occupies
60 - 63: nb_snapshots
Number of snapshots contained in the image
64 - 71: snapshots_offset
Offset into the image file at which the snapshot table
starts. Must be aligned to a cluster boundary.
If the version is 3 or higher, the header has the following additional fields.
For version 2, the values are assumed to be zero, unless specified otherwise
in the description of a field.
72 - 79: incompatible_features
Bitmask of incompatible features. An implementation must
fail to open an image if an unknown bit is set.
Bit 0: Dirty bit. If this bit is set then refcounts
may be inconsistent, make sure to scan L1/L2
tables to repair refcounts before accessing the
image.
Bit 1: Corrupt bit. If this bit is set then any data
structure may be corrupt and the image must not
be written to (unless for regaining
consistency).
Bits 2-63: Reserved (set to 0)
80 - 87: compatible_features
Bitmask of compatible features. An implementation can
safely ignore any unknown bits that are set.
Bit 0: Lazy refcounts bit. If this bit is set then
lazy refcount updates can be used. This means
marking the image file dirty and postponing
refcount metadata updates.
Bits 1-63: Reserved (set to 0)
88 - 95: autoclear_features
Bitmask of auto-clear features. An implementation may only
write to an image with unknown auto-clear features if it
clears the respective bits from this field first.
Bits 0-63: Reserved (set to 0)
96 - 99: refcount_order
Describes the width of a reference count block entry (width
in bits: refcount_bits = 1 << refcount_order). For version 2
images, the order is always assumed to be 4
(i.e. refcount_bits = 16).
This value may not exceed 6 (i.e. refcount_bits = 64).
100 - 103: header_length
Length of the header structure in bytes. For version 2
images, the length is always assumed to be 72 bytes.
Directly after the image header, optional sections called header extensions can
be stored. Each extension has a structure like the following:
Byte 0 - 3: Header extension type:
0x00000000 - End of the header extension area
0xE2792ACA - Backing file format name
0x6803f857 - Feature name table
other - Unknown header extension, can be safely
ignored
4 - 7: Length of the header extension data
8 - n: Header extension data
n - m: Padding to round up the header extension size to the next
multiple of 8.
Unless stated otherwise, each header extension type shall appear at most once
in the same image.
If the image has a backing file then the backing file name should be stored in
the remaining space between the end of the header extension area and the end of
the first cluster. It is not allowed to store other data here, so that an
implementation can safely modify the header and add extensions without harming
data of compatible features that it doesn't support. Compatible features that
need space for additional data can use a header extension.
== Feature name table ==
The feature name table is an optional header extension that contains the name
for features used by the image. It can be used by applications that don't know
the respective feature (e.g. because the feature was introduced only later) to
display a useful error message.
The number of entries in the feature name table is determined by the length of
the header extension data. Each entry look like this:
Byte 0: Type of feature (select feature bitmap)
0: Incompatible feature
1: Compatible feature
2: Autoclear feature
1: Bit number within the selected feature bitmap (valid
values: 0-63)
2 - 47: Feature name (padded with zeros, but not necessarily null
terminated if it has full length)
== Host cluster management ==
qcow2 manages the allocation of host clusters by maintaining a reference count
for each host cluster. A refcount of 0 means that the cluster is free, 1 means
that it is used, and >= 2 means that it is used and any write access must
perform a COW (copy on write) operation.
The refcounts are managed in a two-level table. The first level is called
refcount table and has a variable size (which is stored in the header). The
refcount table can cover multiple clusters, however it needs to be contiguous
in the image file.
It contains pointers to the second level structures which are called refcount
blocks and are exactly one cluster in size.
Given a offset into the image file, the refcount of its cluster can be obtained
as follows:
refcount_block_entries = (cluster_size * 8 / refcount_bits)
refcount_block_index = (offset / cluster_size) % refcount_block_entries
refcount_table_index = (offset / cluster_size) / refcount_block_entries
refcount_block = load_cluster(refcount_table[refcount_table_index]);
return refcount_block[refcount_block_index];
Refcount table entry:
Bit 0 - 8: Reserved (set to 0)
9 - 63: Bits 9-63 of the offset into the image file at which the
refcount block starts. Must be aligned to a cluster
boundary.
If this is 0, the corresponding refcount block has not yet
been allocated. All refcounts managed by this refcount block
are 0.
Refcount block entry (x = refcount_bits - 1):
Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
sub-byte width, note that bit 0 means the least significant
bit in this context.
== Cluster mapping ==
Just as for refcounts, qcow2 uses a two-level structure for the mapping of
guest clusters to host clusters. They are called L1 and L2 table.
The L1 table has a variable size (stored in the header) and may use multiple
clusters, however it must be contiguous in the image file. L2 tables are
exactly one cluster in size.
Given a offset into the virtual disk, the offset into the image file can be
obtained as follows:
l2_entries = (cluster_size / sizeof(uint64_t))
l2_index = (offset / cluster_size) % l2_entries
l1_index = (offset / cluster_size) / l2_entries
l2_table = load_cluster(l1_table[l1_index]);
cluster_offset = l2_table[l2_index];
return cluster_offset + (offset % cluster_size)
L1 table entry:
Bit 0 - 8: Reserved (set to 0)
9 - 55: Bits 9-55 of the offset into the image file at which the L2
table starts. Must be aligned to a cluster boundary. If the
offset is 0, the L2 table and all clusters described by this
L2 table are unallocated.
56 - 62: Reserved (set to 0)
63: 0 for an L2 table that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
in the active L1 table.
L2 table entry:
Bit 0 - 61: Cluster descriptor
62: 0 for standard clusters
1 for compressed clusters
63: 0 for a cluster that is unused or requires COW, 1 if its
refcount is exactly one. This information is only accurate
in L2 tables that are reachable from the the active L1
table.
Standard Cluster Descriptor:
Bit 0: If set to 1, the cluster reads as all zeros. The host
cluster offset can be used to describe a preallocation,
but it won't be used for reading data from this cluster,
nor is data read from the backing file if the cluster is
unallocated.
With version 2, this is always 0.
1 - 8: Reserved (set to 0)
9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
cluster boundary. If the offset is 0, the cluster is
unallocated.
56 - 61: Reserved (set to 0)
Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
cluster boundary!
x+1 - 61: Compressed size of the images in sectors of 512 bytes
If a cluster is unallocated, read requests shall read the data from the backing
file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
no backing file or the backing file is smaller than the image, they shall read
zeros for all parts that are not covered by the backing file.
== Snapshots ==
qcow2 supports internal snapshots. Their basic principle of operation is to
switch the active L1 table, so that a different set of host clusters are
exposed to the guest.
When creating a snapshot, the L1 table should be copied and the refcount of all
L2 tables and clusters reachable from this L1 table must be increased, so that
a write causes a COW and isn't visible in other snapshots.
When loading a snapshot, bit 63 of all entries in the new active L1 table and
all L2 tables referenced by it must be reconstructed from the refcount table
as it doesn't need to be accurate in inactive L1 tables.
A directory of all snapshots is stored in the snapshot table, a contiguous area
in the image file, whose starting offset and length are given by the header
fields snapshots_offset and nb_snapshots. The entries of the snapshot table
have variable length, depending on the length of ID, name and extra data.
Snapshot table entry:
Byte 0 - 7: Offset into the image file at which the L1 table for the
snapshot starts. Must be aligned to a cluster boundary.
8 - 11: Number of entries in the L1 table of the snapshots
12 - 13: Length of the unique ID string describing the snapshot
14 - 15: Length of the name of the snapshot
16 - 19: Time at which the snapshot was taken in seconds since the
Epoch
20 - 23: Subsecond part of the time at which the snapshot was taken
in nanoseconds
24 - 31: Time that the guest was running until the snapshot was
taken in nanoseconds
32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
If there is VM state, it starts at the first cluster
described by first L1 table entry that doesn't describe a
regular guest cluster (i.e. VM state is stored like guest
disk content, except that it is stored at offsets that are
larger than the virtual disk presented to the guest)
36 - 39: Size of extra data in the table entry (used for future
extensions of the format)
variable: Extra data for future extensions. Unknown fields must be
ignored. Currently defined are (offset relative to snapshot
table entry):
Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
state is saved. If this field is present,
the 32-bit value in bytes 32-35 is ignored.
Byte 48 - 55: Virtual disk size of the snapshot in bytes
Version 3 images must include extra data at least up to
byte 55.
variable: Unique ID string for the snapshot (not null terminated)
variable: Name of the snapshot (not null terminated)
variable: Padding to round up the snapshot table entry size to the
next multiple of 8.

View file

@ -1,138 +0,0 @@
=Specification=
The file format looks like this:
+----------+----------+----------+-----+
| cluster0 | cluster1 | cluster2 | ... |
+----------+----------+----------+-----+
The first cluster begins with the '''header'''. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''. L1 and L2 tables are composed of one or more contiguous clusters.
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
All fields are little-endian.
==Header==
Header {
uint32_t magic; /* QED\0 */
uint32_t cluster_size; /* in bytes */
uint32_t table_size; /* for L1 and L2 tables, in clusters */
uint32_t header_size; /* in clusters */
uint64_t features; /* format feature bits */
uint64_t compat_features; /* compat feature bits */
uint64_t autoclear_features; /* self-resetting feature bits */
uint64_t l1_table_offset; /* in bytes */
uint64_t image_size; /* total logical image size, in bytes */
/* if (features & QED_F_BACKING_FILE) */
uint32_t backing_filename_offset; /* in bytes from start of header */
uint32_t backing_filename_size; /* in bytes */
}
Field descriptions:
* ''cluster_size'' must be a power of 2 in range [2^12, 2^26].
* ''table_size'' must be a power of 2 in range [1, 16].
* ''header_size'' is the number of clusters used by the header and any additional information stored before regular clusters.
* ''features'', ''compat_features'', and ''autoclear_features'' are file format extension bitmaps. They work as follows:
** An image with unknown ''features'' bits enabled must not be opened. File format changes that are not backwards-compatible must use ''features'' bits.
** An image with unknown ''compat_features'' bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
** An image with unknown ''autoclear_features'' bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
* ''l1_table_offset'' is the offset of the first byte of the L1 table in the image file and must be a multiple of ''cluster_size''.
* ''image_size'' is the block device size seen by the guest and must be a multiple of 512 bytes.
* ''backing_filename_offset'' and ''backing_filename_size'' describe a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints. The string must be stored within the first ''header_size'' clusters. The backing filename may be an absolute path or relative to the image file.
Feature bits:
* QED_F_BACKING_FILE = 0x01. The image uses a backing file.
* QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
* QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing files are never detected as an image format if they happen to contain magic constants.
There are currently no defined ''compat_features'' or ''autoclear_features'' bits.
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
==Tables==
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
Table {
uint64_t offsets[TABLE_NOFFSETS];
}
The tables are organized as follows:
+----------+
| L1 table |
+----------+
,------' | '------.
+----------+ | +----------+
| L2 table | ... | L2 table |
+----------+ +----------+
,------' | '------.
+----------+ | +----------+
| Data | ... | Data |
+----------+ +----------+
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
L1, L2, and data cluster offsets must be aligned to header.cluster_size. The following offsets have special meanings:
===L2 table offsets===
* 0 - unallocated. The L2 table is not yet allocated.
===Data cluster offsets===
* 0 - unallocated. The data cluster is not yet allocated.
* 1 - zero. The data cluster contents are all zeroes and no cluster is allocated.
Future format extensions may wish to store per-offset information. The least significant 12 bits of an offset are reserved for this purpose and must be set to zero. Image files with cluster_size > 2^12 will have more unused bits which should also be zeroed.
===Unallocated L2 tables and data clusters===
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing file (or zeroes if no backing file) and the data being written.
===Zero data clusters===
Zero data clusters are a space-efficient way of storing zeroed regions of the image.
Reads to a zero data cluster produce zeroes. Note that the difference between an unallocated and a zero data cluster is that zero data clusters stop the reading of contents from the backing file.
Writes to a zero data cluster cause a new data cluster to be allocated. The new data cluster is populated with zeroes and the data being written.
===Logical offset translation===
Logical offsets are translated into cluster offsets as follows:
table_bits table_bits cluster_bits
<--------> <--------> <--------------->
+----------+----------+-----------------+
| L1 index | L2 index | byte offset |
+----------+----------+-----------------+
Structure of a logical offset
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset
def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
l2_offset = l1_table[l1_index]
l2_table = load_table(l2_offset)
cluster_offset = l2_table[l2_index] & offset_mask
return cluster_offset + byte_offset
==Consistency checking==
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK ''features'' bit.
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
Consistency check includes the following invariants:
# Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
# Offsets must be within the image file size and must be ''cluster_size'' aligned.
# Table offsets must at least ''table_size'' * ''cluster_size'' bytes from the end of the image file so that there is space for the entire table.
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.

View file

@ -1,81 +0,0 @@
QEMU Standard VGA
=================
Exists in two variants, for isa and pci.
command line switches:
-vga std [ picks isa for -M isapc, otherwise pci ]
-device VGA [ pci variant ]
-device isa-vga [ isa variant ]
-device secondary-vga [ legacy-free pci variant ]
PCI spec
--------
Applies to the pci variant only for obvious reasons.
PCI ID: 1234:1111
PCI Region 0:
Framebuffer memory, 16 MB in size (by default).
Size is tunable via vga_mem_mb property.
PCI Region 1:
Reserved (so we have the option to make the framebuffer bar 64bit).
PCI Region 2:
MMIO bar, 4096 bytes in size (qemu 1.3+)
PCI ROM Region:
Holds the vgabios (qemu 0.14+).
The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER
instead of PCI_CLASS_DISPLAY_VGA.
IO ports used
-------------
Doesn't apply to the legacy-free pci variant, use the MMIO bar instead.
03c0 - 03df : standard vga ports
01ce : bochs vbe interface index port
01cf : bochs vbe interface data port (x86 only)
01d0 : bochs vbe interface data port
Memory regions used
-------------------
0xe0000000 : Framebuffer memory, isa variant only.
The pci variant used to mirror the framebuffer bar here, qemu 0.14+
stops doing that (except when in -M pc-$old compat mode).
MMIO area spec
--------------
Likewise applies to the pci variant only for obvious reasons.
0000 - 03ff : reserved, for possible virtio extension.
0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1.
word access is supported, bytes are written
in little endia order (aka index port first),
so indexed registers can be updated with a
single mmio write (and thus only one vmexit).
0500 - 0515 : bochs dispi interface registers, mapped flat
without index/data ports. Use (index << 1)
as offset for (16bit) register access.
0600 - 0607 : qemu extended registers. qemu 2.2+ only.
The pci revision is 2 (or greater) when
these registers are present. The registers
are 32bit.
0600 : qemu extended register region size, in bytes.
0604 : framebuffer endianness register.
- 0xbebebebe indicates big endian.
- 0x1e1e1e1e indicates little endian.

View file

@ -1,266 +0,0 @@
Vhost-user Protocol
===================
Copyright (c) 2014 Virtual Open Systems Sarl.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
===================
This protocol is aiming to complement the ioctl interface used to control the
vhost implementation in the Linux kernel. It implements the control plane needed
to establish virtqueue sharing with a user space process on the same host. It
uses communication over a Unix domain socket to share file descriptors in the
ancillary data of the message.
The protocol defines 2 sides of the communication, master and slave. Master is
the application that shares its virtqueues, in our case QEMU. Slave is the
consumer of the virtqueues.
In the current implementation QEMU is the Master, and the Slave is intended to
be a software Ethernet switch running in user space, such as Snabbswitch.
Master and slave can be either a client (i.e. connecting) or server (listening)
in the socket communication.
Message Specification
---------------------
Note that all numbers are in the machine native byte order. A vhost-user message
consists of 3 header fields and a payload:
------------------------------------
| request | flags | size | payload |
------------------------------------
* Request: 32-bit type of the request
* Flags: 32-bit bit field:
- Lower 2 bits are the version (currently 0x01)
- Bit 2 is the reply flag - needs to be sent on each reply from the slave
* Size - 32-bit size of the payload
Depending on the request type, payload can be:
* A single 64-bit integer
-------
| u64 |
-------
u64: a 64-bit unsigned integer
* A vring state description
---------------
| index | num |
---------------
Index: a 32-bit index
Num: a 32-bit number
* A vring address description
--------------------------------------------------------------
| index | flags | size | descriptor | used | available | log |
--------------------------------------------------------------
Index: a 32-bit vring index
Flags: a 32-bit vring flags
Descriptor: a 64-bit user address of the vring descriptor table
Used: a 64-bit user address of the vring used ring
Available: a 64-bit user address of the vring available ring
Log: a 64-bit guest address for logging
* Memory regions description
---------------------------------------------------
| num regions | padding | region0 | ... | region7 |
---------------------------------------------------
Num regions: a 32-bit number of regions
Padding: 32-bit
A region is:
-----------------------------------------------------
| guest address | size | user address | mmap offset |
-----------------------------------------------------
Guest address: a 64-bit guest address of the region
Size: a 64-bit size
User address: a 64-bit user address
mmap offset: 64-bit offset where region starts in the mapped memory
In QEMU the vhost-user message is implemented with the following struct:
typedef struct VhostUserMsg {
VhostUserRequest request;
uint32_t flags;
uint32_t size;
union {
uint64_t u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
VhostUserMemory memory;
};
} QEMU_PACKED VhostUserMsg;
Communication
-------------
The protocol for vhost-user is based on the existing implementation of vhost
for the Linux Kernel. Most messages that can be sent via the Unix domain socket
implementing vhost-user have an equivalent ioctl to the kernel implementation.
The communication consists of master sending message requests and slave sending
message replies. Most of the requests don't require replies. Here is a list of
the ones that do:
* VHOST_GET_FEATURES
* VHOST_GET_VRING_BASE
There are several messages that the master sends with file descriptors passed
in the ancillary data:
* VHOST_SET_MEM_TABLE
* VHOST_SET_LOG_FD
* VHOST_SET_VRING_KICK
* VHOST_SET_VRING_CALL
* VHOST_SET_VRING_ERR
If Master is unable to send the full message or receives a wrong reply it will
close the connection. An optional reconnection mechanism can be implemented.
Message types
-------------
* VHOST_USER_GET_FEATURES
Id: 1
Equivalent ioctl: VHOST_GET_FEATURES
Master payload: N/A
Slave payload: u64
Get from the underlying vhost implementation the features bitmask.
* VHOST_USER_SET_FEATURES
Id: 2
Ioctl: VHOST_SET_FEATURES
Master payload: u64
Enable features in the underlying vhost implementation using a bitmask.
* VHOST_USER_SET_OWNER
Id: 3
Equivalent ioctl: VHOST_SET_OWNER
Master payload: N/A
Issued when a new connection is established. It sets the current Master
as an owner of the session. This can be used on the Slave as a
"session start" flag.
* VHOST_USER_RESET_OWNER
Id: 4
Equivalent ioctl: VHOST_RESET_OWNER
Master payload: N/A
Issued when a new connection is about to be closed. The Master will no
longer own this connection (and will usually close it).
* VHOST_USER_SET_MEM_TABLE
Id: 5
Equivalent ioctl: VHOST_SET_MEM_TABLE
Master payload: memory regions description
Sets the memory map regions on the slave so it can translate the vring
addresses. In the ancillary data there is an array of file descriptors
for each memory mapped region. The size and ordering of the fds matches
the number and ordering of memory regions.
* VHOST_USER_SET_LOG_BASE
Id: 6
Equivalent ioctl: VHOST_SET_LOG_BASE
Master payload: u64
Sets the logging base address.
* VHOST_USER_SET_LOG_FD
Id: 7
Equivalent ioctl: VHOST_SET_LOG_FD
Master payload: N/A
Sets the logging file descriptor, which is passed as ancillary data.
* VHOST_USER_SET_VRING_NUM
Id: 8
Equivalent ioctl: VHOST_SET_VRING_NUM
Master payload: vring state description
Sets the number of vrings for this owner.
* VHOST_USER_SET_VRING_ADDR
Id: 9
Equivalent ioctl: VHOST_SET_VRING_ADDR
Master payload: vring address description
Slave payload: N/A
Sets the addresses of the different aspects of the vring.
* VHOST_USER_SET_VRING_BASE
Id: 10
Equivalent ioctl: VHOST_SET_VRING_BASE
Master payload: vring state description
Sets the base offset in the available vring.
* VHOST_USER_GET_VRING_BASE
Id: 11
Equivalent ioctl: VHOST_USER_GET_VRING_BASE
Master payload: vring state description
Slave payload: vring state description
Get the available vring base offset.
* VHOST_USER_SET_VRING_KICK
Id: 12
Equivalent ioctl: VHOST_SET_VRING_KICK
Master payload: u64
Set the event file descriptor for adding buffers to the vring. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. This signals that polling should be used
instead of waiting for a kick.
* VHOST_USER_SET_VRING_CALL
Id: 13
Equivalent ioctl: VHOST_SET_VRING_CALL
Master payload: u64
Set the event file descriptor to signal when buffers are used. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. This signals that polling will be used
instead of waiting for the call.
* VHOST_USER_SET_VRING_ERR
Id: 14
Equivalent ioctl: VHOST_SET_VRING_ERR
Master payload: u64
Set the event file descriptor to signal when error occurs. It
is passed in the ancillary data.
Bits (0-7) of the payload contain the vring index. Bit 8 is the
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data.

View file

@ -1,92 +0,0 @@
General Description
===================
This document describes VMWare PVSCSI device interface specification.
Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD.
Based on source code of PVSCSI Linux driver from kernel 3.0.4
PVSCSI Device Interface Overview
================================
The interface is based on memory area shared between hypervisor and VM.
Memory area is obtained by driver as device IO memory resource of
PVSCSI_MEM_SPACE_SIZE length.
The shared memory consists of registers area and rings area.
The registers area is used to raise hypervisor interrupts and issue device
commands. The rings area is used to transfer data descriptors and SCSI
commands from VM to hypervisor and to transfer messages produced by
hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA.
PVSCSI Device Registers
=======================
The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES).
The structure of the registers area is described by the PVSCSIRegOffset enum.
There are registers to issue device command (with optional short data),
issue device interrupt, control interrupts masking.
PVSCSI Device Rings
===================
There are three rings in shared memory:
1. Request ring (struct PVSCSIRingReqDesc *req_ring)
- ring for OS to device requests
2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring)
- ring for device request completions
3. Message ring (struct PVSCSIRingMsgDesc *msg_ring)
- ring for messages from device.
This ring is optional and the guest might not configure it.
There is a control area (struct PVSCSIRingsState *rings_state) used to control
rings operation.
PVSCSI Device to Host Interrupts
================================
There are following interrupt types supported by PVSCSI device:
1. Completion interrupts (completion ring notifications):
PVSCSI_INTR_CMPL_0
PVSCSI_INTR_CMPL_1
2. Message interrupts (message ring notifications):
PVSCSI_INTR_MSG_0
PVSCSI_INTR_MSG_1
Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register
Bit set means interrupt enabled, bit cleared - disabled
Interrupt modes supported are legacy, MSI and MSI-X
In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS
is used to check which interrupt has arrived. Interrupts are
acknowledged when the corresponding bit is written to the interrupt
status register.
PVSCSI Device Operation Sequences
=================================
1. Startup sequence:
a. Issue PVSCSI_CMD_ADAPTER_RESET command;
aa. Windows driver reads interrupt status register here;
b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data,
check status and disable device messages if error returned;
(Omitted if device messages disabled by driver configuration)
c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration
as struct PVSCSICmdDescSetupRings;
d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide
rings configuration as struct PVSCSICmdDescSetupMsgRing;
e. Unmask completion and message (if device messages enabled) interrupts.
2. Shutdown sequences
a. Mask interrupts;
b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO;
c. Issue PVSCSI_CMD_ADAPTER_RESET command.
3. Send request
a. Fill next free request ring descriptor;
b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations;
or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations.
4. Abort command
a. Issue PVSCSI_CMD_ABORT_CMD command;
5. Request completion processing
a. Upon completion interrupt arrival process completion
and message (if enabled) rings.

View file

@ -1,19 +0,0 @@
A Spice port channel is an arbitrary communication between the Spice
server host side and the client side.
Thanks to the associated reverse fully qualified domain name (fqdn),
a Spice client can handle the various ports appropriately.
The following fqdn names are reserved by the QEMU project:
org.qemu.monitor.hmp.0
QEMU human monitor
org.qemu.monitor.qmp.0:
QEMU control monitor
org.qemu.console.serial.0
QEMU virtual serial port
org.qemu.console.debug.0
QEMU debug console

View file

@ -1,349 +0,0 @@
= Tracing =
== Introduction ==
This document describes the tracing infrastructure in QEMU and how to use it
for debugging, profiling, and observing execution.
== Quickstart ==
1. Build with the 'simple' trace backend:
./configure --enable-trace-backends=simple
make
2. Create a file with the events you want to trace:
echo bdrv_aio_readv > /tmp/events
echo bdrv_aio_writev >> /tmp/events
3. Run the virtual machine to produce a trace file:
qemu -trace events=/tmp/events ... # your normal QEMU invocation
4. Pretty-print the binary trace file:
./scripts/simpletrace.py trace-events trace-* # Override * with QEMU <pid>
== Trace events ==
There is a set of static trace events declared in the "trace-events" source
file. Each trace event declaration names the event, its arguments, and the
format string which can be used for pretty-printing:
qemu_vmalloc(size_t size, void *ptr) "size %zu ptr %p"
qemu_vfree(void *ptr) "ptr %p"
The "trace-events" file is processed by the "tracetool" script during build to
generate code for the trace events. Trace events are invoked directly from
source code like this:
#include "trace.h" /* needed for trace event prototype */
void *qemu_vmalloc(size_t size)
{
void *ptr;
size_t align = QEMU_VMALLOC_ALIGN;
if (size < align) {
align = getpagesize();
}
ptr = qemu_memalign(align, size);
trace_qemu_vmalloc(size, ptr);
return ptr;
}
=== Declaring trace events ===
The "tracetool" script produces the trace.h header file which is included by
every source file that uses trace events. Since many source files include
trace.h, it uses a minimum of types and other header files included to keep the
namespace clean and compile times and dependencies down.
Trace events should use types as follows:
* Use stdint.h types for fixed-size types. Most offsets and guest memory
addresses are best represented with uint32_t or uint64_t. Use fixed-size
types over primitive types whose size may change depending on the host
(32-bit versus 64-bit) so trace events don't truncate values or break
the build.
* Use void * for pointers to structs or for arrays. The trace.h header
cannot include all user-defined struct declarations and it is therefore
necessary to use void * for pointers to structs.
* For everything else, use primitive scalar types (char, int, long) with the
appropriate signedness.
Format strings should reflect the types defined in the trace event. Take
special care to use PRId64 and PRIu64 for int64_t and uint64_t types,
respectively. This ensures portability between 32- and 64-bit platforms.
=== Hints for adding new trace events ===
1. Trace state changes in the code. Interesting points in the code usually
involve a state change like starting, stopping, allocating, freeing. State
changes are good trace events because they can be used to understand the
execution of the system.
2. Trace guest operations. Guest I/O accesses like reading device registers
are good trace events because they can be used to understand guest
interactions.
3. Use correlator fields so the context of an individual line of trace output
can be understood. For example, trace the pointer returned by malloc and
used as an argument to free. This way mallocs and frees can be matched up.
Trace events with no context are not very useful.
4. Name trace events after their function. If there are multiple trace events
in one function, append a unique distinguisher at the end of the name.
== Generic interface and monitor commands ==
You can programmatically query and control the state of trace events through a
backend-agnostic interface provided by the header "trace/control.h".
Note that some of the backends do not provide an implementation for some parts
of this interface, in which case QEMU will just print a warning (please refer to
header "trace/control.h" to see which routines are backend-dependent).
The state of events can also be queried and modified through monitor commands:
* info trace-events
View available trace events and their state. State 1 means enabled, state 0
means disabled.
* trace-event NAME on|off
Enable/disable a given trace event or a group of events (using wildcards).
The "-trace events=<file>" command line argument can be used to enable the
events listed in <file> from the very beginning of the program. This file must
contain one event name per line.
If a line in the "-trace events=<file>" file begins with a '-', the trace event
will be disabled instead of enabled. This is useful when a wildcard was used
to enable an entire family of events but one noisy event needs to be disabled.
Wildcard matching is supported in both the monitor command "trace-event" and the
events list file. That means you can enable/disable the events having a common
prefix in a batch. For example, virtio-blk trace events could be enabled using
the following monitor command:
trace-event virtio_blk_* on
== Trace backends ==
The "tracetool" script automates tedious trace event code generation and also
keeps the trace event declarations independent of the trace backend. The trace
events are not tightly coupled to a specific trace backend, such as LTTng or
SystemTap. Support for trace backends can be added by extending the "tracetool"
script.
The trace backends are chosen at configure time:
./configure --enable-trace-backends=simple
For a list of supported trace backends, try ./configure --help or see below.
If multiple backends are enabled, the trace is sent to them all.
The following subsections describe the supported trace backends.
=== Nop ===
The "nop" backend generates empty trace event functions so that the compiler
can optimize out trace events completely. This is the default and imposes no
performance penalty.
Note that regardless of the selected trace backend, events with the "disable"
property will be generated with the "nop" backend.
=== Stderr ===
The "stderr" backend sends trace events directly to standard error. This
effectively turns trace events into debug printfs.
This is the simplest backend and can be used together with existing code that
uses DPRINTF().
=== Simpletrace ===
The "simple" backend supports common use cases and comes as part of the QEMU
source tree. It may not be as powerful as platform-specific or third-party
trace backends but it is portable. This is the recommended trace backend
unless you have specific needs for more advanced backends.
The "simple" backend currently does not capture string arguments, it simply
records the char* pointer value instead of the string that is pointed to.
=== Ftrace ===
The "ftrace" backend writes trace data to ftrace marker. This effectively
sends trace events to ftrace ring buffer, and you can compare qemu trace
data and kernel(especially kvm.ko when using KVM) trace data.
if you use KVM, enable kvm events in ftrace:
# echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
After running qemu by root user, you can get the trace:
# cat /sys/kernel/debug/tracing/trace
Restriction: "ftrace" backend is restricted to Linux only.
==== Monitor commands ====
* trace-file on|off|flush|set <path>
Enable/disable/flush the trace file or set the trace file name.
==== Analyzing trace files ====
The "simple" backend produces binary trace files that can be formatted with the
simpletrace.py script. The script takes the "trace-events" file and the binary
trace:
./scripts/simpletrace.py trace-events trace-12345
You must ensure that the same "trace-events" file was used to build QEMU,
otherwise trace event declarations may have changed and output will not be
consistent.
=== LTTng Userspace Tracer ===
The "ust" backend uses the LTTng Userspace Tracer library. There are no
monitor commands built into QEMU, instead UST utilities should be used to list,
enable/disable, and dump traces.
Package lttng-tools is required for userspace tracing. You must ensure that the
current user belongs to the "tracing" group, or manually launch the
lttng-sessiond daemon for the current user prior to running any instance of
QEMU.
While running an instrumented QEMU, LTTng should be able to list all available
events:
lttng list -u
Create tracing session:
lttng create mysession
Enable events:
lttng enable-event qemu:g_malloc -u
Where the events can either be a comma-separated list of events, or "-a" to
enable all tracepoint events. Start and stop tracing as needed:
lttng start
lttng stop
View the trace:
lttng view
Destroy tracing session:
lttng destroy
Babeltrace can be used at any later time to view the trace:
babeltrace $HOME/lttng-traces/mysession-<date>-<time>
=== SystemTap ===
The "dtrace" backend uses DTrace sdt probes but has only been tested with
SystemTap. When SystemTap support is detected a .stp file with wrapper probes
is generated to make use in scripts more convenient. This step can also be
performed manually after a build in order to change the binary name in the .stp
probes:
scripts/tracetool --dtrace --stap \
--binary path/to/qemu-binary \
--target-type system \
--target-name x86_64 \
<trace-events >qemu.stp
== Trace event properties ==
Each event in the "trace-events" file can be prefixed with a space-separated
list of zero or more of the following event properties.
=== "disable" ===
If a specific trace event is going to be invoked a huge number of times, this
might have a noticeable performance impact even when the event is
programmatically disabled.
In this case you should declare such event with the "disable" property. This
will effectively disable the event at compile time (by using the "nop" backend),
thus having no performance impact at all on regular builds (i.e., unless you
edit the "trace-events" file).
In addition, there might be cases where relatively complex computations must be
performed to generate values that are only used as arguments for a trace
function. In these cases you can use the macro 'TRACE_${EVENT_NAME}_ENABLED' to
guard such computations and avoid its compilation when the event is disabled:
#include "trace.h" /* needed for trace event prototype */
void *qemu_vmalloc(size_t size)
{
void *ptr;
size_t align = QEMU_VMALLOC_ALIGN;
if (size < align) {
align = getpagesize();
}
ptr = qemu_memalign(align, size);
if (TRACE_QEMU_VMALLOC_ENABLED) { /* preprocessor macro */
void *complex;
/* some complex computations to produce the 'complex' value */
trace_qemu_vmalloc(size, ptr, complex);
}
return ptr;
}
You can check both if the event has been disabled and is dynamically enabled at
the same time using the 'trace_event_get_state' routine (see header
"trace/control.h" for more information).
=== "tcg" ===
Guest code generated by TCG can be traced by defining an event with the "tcg"
event property. Internally, this property generates two events:
"<eventname>_trans" to trace the event at translation time, and
"<eventname>_exec" to trace the event at execution time.
Instead of using these two events, you should instead use the function
"trace_<eventname>_tcg" during translation (TCG code generation). This function
will automatically call "trace_<eventname>_trans", and will generate the
necessary TCG code to call "trace_<eventname>_exec" during guest code execution.
Events with the "tcg" property can be declared in the "trace-events" file with a
mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward
them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values
are not known at translation time, these are ignored by the "<eventname>_trans"
event. Because of this, the entry in the "trace-events" file needs two printing
formats (separated by a comma):
tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d"
For example:
#include "trace-tcg.h"
void some_disassembly_func (...)
{
uint8_t a1 = ...;
TCGv_i32 a2 = ...;
trace_foo_tcg(a1, a2);
}
This will immediately call:
void trace_foo_trans(uint8_t a1);
and will generate the TCG code to call:
void trace_foo(uint8_t a1, uint32_t a2);

View file

@ -1,47 +0,0 @@
qemu usb storage emulation
--------------------------
QEMU has three devices for usb storage emulation.
Number one emulates the classic bulk-only transport protocol which is
used by 99% of the usb sticks on the market today and is called
"usb-storage". Usage (hooking up to xhci, other host controllers work
too):
qemu ${other_vm_args} \
-drive if=none,id=stick,file=/path/to/file.img \
-device nec-usb-xhci,id=xhci \
-device usb-storage,bus=xhci.0,drive=stick
Number two is the newer usb attached scsi transport. This one doesn't
automagically create a scsi disk, so you have to explicitly attach one
manually. Multiple logical units are supported. Here is an example
with tree logical units:
qemu ${other_vm_args} \
-drive if=none,id=uas-disk1,file=/path/to/file1.img \
-drive if=none,id=uas-disk2,file=/path/to/file2.img \
-drive if=none,id=uas-cdrom,media=cdrom,file=/path/to/image.iso \
-device nec-usb-xhci,id=xhci \
-device usb-uas,id=uas,bus=xhci.0 \
-device scsi-hd,bus=uas.0,scsi-id=0,lun=0,drive=uas-disk1 \
-device scsi-hd,bus=uas.0,scsi-id=0,lun=1,drive=uas-disk2 \
-device scsi-cd,bus=uas.0,scsi-id=0,lun=5,drive=uas-cdrom
Number three emulates the classic bulk-only transport protocol too.
It's called "usb-bot". It shares most code with "usb-storage", and
the guest will not be able to see the difference. The qemu command
line interface is simliar to usb-uas though, i.e. no automatic scsi
disk creation. It also features support for up to 16 LUNs. The LUN
numbers must be continuous, i.e. for three devices you must use 0+1+2.
The 0+1+5 numbering from the "usb-uas" example isn't going to work
with "usb-bot".
enjoy,
Gerd
--
Gerd Hoffmann <kraxel@redhat.com>

View file

@ -1,161 +0,0 @@
USB 2.0 Quick Start
===================
The QEMU EHCI Adapter can be used with and without companion
controllers. See below for the companion controller mode.
When not running in companion controller mode there are two completely
separate USB busses: One USB 1.1 bus driven by the UHCI controller and
one USB 2.0 bus driven by the EHCI controller. Devices must be
attached to the correct controller manually.
The '-usb' switch will make qemu create the UHCI controller as part of
the PIIX3 chipset. The USB 1.1 bus will carry the name "usb-bus.0".
You can use the standard -device switch to add a EHCI controller to
your virtual machine. It is strongly recommended to specify an ID for
the controller so the USB 2.0 bus gets a individual name, for example
'-device usb-ehci,id=ehci". This will give you a USB 2.0 bus named
"ehci.0".
I strongly recomment to also use -device to attach usb devices because
you can specify the bus they should be attached to this way. Here is
a complete example:
qemu -M pc ${otheroptions} \
-drive if=none,id=usbstick,file=/path/to/image \
-usb \
-device usb-ehci,id=ehci \
-device usb-tablet,bus=usb-bus.0 \
-device usb-storage,bus=ehci.0,drive=usbstick
This attaches a usb tablet to the UHCI adapter and a usb mass storage
device to the EHCI adapter.
Companion controller support
----------------------------
Companion controller support has been added recently. The operational
model described above with two completely separate busses still works
fine. Additionally the UHCI and OHCI controllers got the ability to
attach to a usb bus created by EHCI as companion controllers. This is
done by specifying the masterbus and firstport properties. masterbus
specifies the bus name the controller should attach to. firstport
specifies the first port the controller should attach to, which is
needed as usually one ehci controller with six ports has three uhci
companion controllers with two ports each.
There is a config file in docs which will do all this for you, just
try ...
qemu -readconfig docs/ich9-ehci-uhci.cfg
... then use "bus=ehci.0" to assign your usb devices to that bus.
xhci controller support
-----------------------
There is also xhci host controller support available. It got a lot
less testing than ehci and there are a bunch of known limitations, so
ehci may work better for you. On the other hand the xhci hardware
design is much more virtualization-friendly, thus xhci emulation uses
less resources (especially cpu). If you want to give xhci a try
use this to add the host controller ...
qemu -device nec-usb-xhci,id=xhci
... then use "bus=xhci.0" when assigning usb devices.
More USB tips & tricks
======================
Recently the usb pass through driver (also known as usb-host) and the
qemu usb subsystem gained a few capabilities which are available only
via qdev properties, i,e. when using '-device'.
physical port addressing
------------------------
First you can (for all usb devices) specify the physical port where
the device will show up in the guest. This can be done using the
"port" property. UHCI has two root ports (1,2). EHCI has four root
ports (1-4), the emulated (1.1) USB hub has eight ports.
Plugging a tablet into UHCI port 1 works like this:
-device usb-tablet,bus=usb-bus.0,port=1
Plugging a hub into UHCI port 2 works like this:
-device usb-hub,bus=usb-bus.0,port=2
Plugging a virtual usb stick into port 4 of the hub just plugged works
this way:
-device usb-storage,bus=usb-bus.0,port=2.4,drive=...
You can do basically the same in the monitor using the device_add
command. If you want to unplug devices too you should specify some
unique id which you can use to refer to the device ...
(qemu) device_add usb-tablet,bus=usb-bus.0,port=1,id=my-tablet
(qemu) device_del my-tablet
... when unplugging it with device_del.
USB pass through hints
----------------------
The usb-host driver has a bunch of properties to specify the device
which should be passed to the guest:
hostbus=<nr> -- Specifies the bus number the device must be attached
to.
hostaddr=<nr> -- Specifies the device address the device got
assigned by the guest os.
hostport=<str> -- Specifies the physical port the device is attached
to.
vendorid=<hexnr> -- Specifies the vendor ID of the device.
productid=<hexnr> -- Specifies the product ID of the device.
In theory you can combine all these properties as you like. In
practice only a few combinations are useful:
(1) vendorid+productid -- match for a specific device, pass it to
the guest when it shows up somewhere in the host.
(2) hostbus+hostport -- match for a specific physical port in the
host, any device which is plugged in there gets passed to the
guest.
(3) hostbus+hostaddr -- most useful for ad-hoc pass through as the
hostaddr isn't stable, the next time you plug in the device it
gets a new one ...
Note that USB 1.1 devices are handled by UHCI/OHCI and USB 2.0 by
EHCI. That means a device plugged into the very same physical port
may show up on different busses depending on the speed. The port I'm
using for testing is bus 1 + port 1 for 2.0 devices and bus 3 + port 1
for 1.1 devices. Passing through any device plugged into that port
and also assign them to the correct bus can be done this way:
qemu -M pc ${otheroptions} \
-usb \
-device usb-ehci,id=ehci \
-device usb-host,bus=usb-bus.0,hostbus=3,hostport=1 \
-device usb-host,bus=ehci.0,hostbus=1,hostport=1
enjoy,
Gerd
--
Gerd Hoffmann <kraxel@redhat.com>

View file

@ -1,105 +0,0 @@
virtio balloon memory statistics
================================
The virtio balloon driver supports guest memory statistics reporting. These
statistics are available to QEMU users as QOM (QEMU Object Model) device
properties via a polling mechanism.
Before querying the available stats, clients first have to enable polling.
This is done by writing a time interval value (in seconds) to the
guest-stats-polling-interval property. This value can be:
> 0 enables polling in the specified interval. If polling is already
enabled, the polling time interval is changed to the new value
0 disables polling. Previous polled statistics are still valid and
can be queried.
Once polling is enabled, the virtio-balloon device in QEMU will start
polling the guest's balloon driver for new stats in the specified time
interval.
To retrieve those stats, clients have to query the guest-stats property,
which will return a dictionary containing:
o A key named 'stats', containing all available stats. If the guest
doesn't support a particular stat, or if it couldn't be retrieved,
its value will be -1. Currently, the following stats are supported:
- stat-swap-in
- stat-swap-out
- stat-major-faults
- stat-minor-faults
- stat-free-memory
- stat-total-memory
o A key named last-update, which contains the last stats update
timestamp in seconds. Since this timestamp is generated by the host,
a buggy guest can't influence its value. The value is 0 if the guest
has not updated the stats (yet).
It's also important to note the following:
- Previously polled statistics remain available even if the polling is
later disabled
- As noted above, if a guest doesn't support a particular stat its value
will always be -1. However, it's also possible that a guest temporarily
couldn't update one or even all stats. If this happens, just wait for
the next update
- Polling can be enabled even if the guest doesn't have stats support
or the balloon driver wasn't loaded in the guest. If this is the case
and stats are queried, last-update will be 0.
- The polling timer is only re-armed when the guest responds to the
statistics request. This means that if a (buggy) guest doesn't ever
respond to the request the timer will never be re-armed, which has
the same effect as disabling polling
Here are a few examples. QEMU is started with '-balloon virtio', which
generates '/machine/peripheral-anon/device[1]' as the QOM path for the
balloon device.
Enable polling with 2 seconds interval:
{ "execute": "qom-set",
"arguments": { "path": "/machine/peripheral-anon/device[1]",
"property": "guest-stats-polling-interval", "value": 2 } }
{ "return": {} }
Change polling to 10 seconds:
{ "execute": "qom-set",
"arguments": { "path": "/machine/peripheral-anon/device[1]",
"property": "guest-stats-polling-interval", "value": 10 } }
{ "return": {} }
Get stats:
{ "execute": "qom-get",
"arguments": { "path": "/machine/peripheral-anon/device[1]",
"property": "guest-stats" } }
{
"return": {
"stats": {
"stat-swap-out": 0,
"stat-free-memory": 844943360,
"stat-minor-faults": 219028,
"stat-major-faults": 235,
"stat-total-memory": 1044406272,
"stat-swap-in": 0
},
"last-update": 1358529861
}
}
Disable polling:
{ "execute": "qom-set",
"arguments": { "path": "/machine/peripheral-anon/device[1]",
"property": "stats-polling-interval", "value": 0 } }
{ "return": {} }

View file

@ -1,50 +0,0 @@
VNC LED state Pseudo-encoding
=============================
Introduction
------------
This document describes the Pseudo-encoding of LED state for RFB which
is the protocol used in VNC as reference link below:
http://tigervnc.svn.sourceforge.net/viewvc/tigervnc/rfbproto/rfbproto.rst?content-type=text/plain
When accessing a guest by console through VNC, there might be mismatch
between the lock keys notification LED on the computer running the VNC
client session and the current status of the lock keys on the guest
machine.
To solve this problem it attempts to add LED state Pseudo-encoding
extension to VNC protocol to deal with setting LED state.
Pseudo-encoding
---------------
This Pseudo-encoding requested by client declares to server that it supports
LED state extensions to the protocol.
The Pseudo-encoding number for LED state defined as:
======= ===============================================================
Number Name
======= ===============================================================
-261 'LED state Pseudo-encoding'
======= ===============================================================
LED state Pseudo-encoding
--------------------------
The LED state Pseudo-encoding describes the encoding of LED state which
consists of 3 bits, from left to right each bit represents the Caps, Num,
and Scroll lock key respectively. '1' indicates that the LED should be
on and '0' should be off.
Some example encodings for it as following:
======= ===============================================================
Code Description
======= ===============================================================
100 CapsLock is on, NumLock and ScrollLock are off
010 NumLock is on, CapsLock and ScrollLock are off
111 CapsLock, NumLock and ScrollLock are on
======= ===============================================================

View file

@ -1,651 +0,0 @@
= How to write QMP commands using the QAPI framework =
This document is a step-by-step guide on how to write new QMP commands using
the QAPI framework. It also shows how to implement new style HMP commands.
This document doesn't discuss QMP protocol level details, nor does it dive
into the QAPI framework implementation.
For an in-depth introduction to the QAPI framework, please refer to
docs/qapi-code-gen.txt. For documentation about the QMP protocol, please
check the files in QMP/.
== Overview ==
Generally speaking, the following steps should be taken in order to write a
new QMP command.
1. Write the command's and type(s) specification in the QAPI schema file
(qapi-schema.json in the root source directory)
2. Write the QMP command itself, which is a regular C function. Preferably,
the command should be exported by some QEMU subsystem. But it can also be
added to the qmp.c file
3. At this point the command can be tested under the QMP protocol
4. Write the HMP command equivalent. This is not required and should only be
done if it does make sense to have the functionality in HMP. The HMP command
is implemented in terms of the QMP command
The following sections will demonstrate each of the steps above. We will start
very simple and get more complex as we progress.
=== Testing ===
For all the examples in the next sections, the test setup is the same and is
shown here.
First, QEMU should be started as:
# /path/to/your/source/qemu [...] \
-chardev socket,id=qmp,port=4444,host=localhost,server \
-mon chardev=qmp,mode=control,pretty=on
Then, in a different terminal:
$ telnet localhost 4444
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
{
"QMP": {
"version": {
"qemu": {
"micro": 50,
"minor": 15,
"major": 0
},
"package": ""
},
"capabilities": [
]
}
}
The above output is the QMP server saying you're connected. The server is
actually in capabilities negotiation mode. To enter in command mode type:
{ "execute": "qmp_capabilities" }
Then the server should respond:
{
"return": {
}
}
Which is QMP's way of saying "the latest command executed OK and didn't return
any data". Now you're ready to enter the QMP example commands as explained in
the following sections.
== Writing a command that doesn't return data ==
That's the most simple QMP command that can be written. Usually, this kind of
command carries some meaningful action in QEMU but here it will just print
"Hello, world" to the standard output.
Our command will be called "hello-world". It takes no arguments, nor does it
return any data.
The first step is to add the following line to the bottom of the
qapi-schema.json file:
{ 'command': 'hello-world' }
The "command" keyword defines a new QMP command. It's an JSON object. All
schema entries are JSON objects. The line above will instruct the QAPI to
generate any prototypes and the necessary code to marshal and unmarshal
protocol data.
The next step is to write the "hello-world" implementation. As explained
earlier, it's preferable for commands to live in QEMU subsystems. But
"hello-world" doesn't pertain to any, so we put its implementation in qmp.c:
void qmp_hello_world(Error **errp)
{
printf("Hello, world!\n");
}
There are a few things to be noticed:
1. QMP command implementation functions must be prefixed with "qmp_"
2. qmp_hello_world() returns void, this is in accordance with the fact that the
command doesn't return any data
3. It takes an "Error **" argument. This is required. Later we will see how to
return errors and take additional arguments. The Error argument should not
be touched if the command doesn't return errors
4. We won't add the function's prototype. That's automatically done by the QAPI
5. Printing to the terminal is discouraged for QMP commands, we do it here
because it's the easiest way to demonstrate a QMP command
Now a little hack is needed. As we're still using the old QMP server we need
to add the new command to its internal dispatch table. This step won't be
required in the near future. Open the qmp-commands.hx file and add the
following in the botton:
{
.name = "hello-world",
.args_type = "",
.mhandler.cmd_new = qmp_marshal_input_hello_world,
},
You're done. Now build qemu, run it as suggested in the "Testing" section,
and then type the following QMP command:
{ "execute": "hello-world" }
Then check the terminal running qemu and look for the "Hello, world" string. If
you don't see it then something went wrong.
=== Arguments ===
Let's add an argument called "message" to our "hello-world" command. The new
argument will contain the string to be printed to stdout. It's an optional
argument, if it's not present we print our default "Hello, World" string.
The first change we have to do is to modify the command specification in the
schema file to the following:
{ 'command': 'hello-world', 'data': { '*message': 'str' } }
Notice the new 'data' member in the schema. It's an JSON object whose each
element is an argument to the command in question. Also notice the asterisk,
it's used to mark the argument optional (that means that you shouldn't use it
for mandatory arguments). Finally, 'str' is the argument's type, which
stands for "string". The QAPI also supports integers, booleans, enumerations
and user defined types.
Now, let's update our C implementation in qmp.c:
void qmp_hello_world(bool has_message, const char *message, Error **errp)
{
if (has_message) {
printf("%s\n", message);
} else {
printf("Hello, world\n");
}
}
There are two important details to be noticed:
1. All optional arguments are accompanied by a 'has_' boolean, which is set
if the optional argument is present or false otherwise
2. The C implementation signature must follow the schema's argument ordering,
which is defined by the "data" member
The last step is to update the qmp-commands.hx file:
{
.name = "hello-world",
.args_type = "message:s?",
.mhandler.cmd_new = qmp_marshal_input_hello_world,
},
Notice that the "args_type" member got our "message" argument. The character
"s" stands for "string" and "?" means it's optional. This too must be ordered
according to the C implementation and schema file. You can look for more
examples in the qmp-commands.hx file if you need to define more arguments.
Again, this step won't be required in the future.
Time to test our new version of the "hello-world" command. Build qemu, run it as
described in the "Testing" section and then send two commands:
{ "execute": "hello-world" }
{
"return": {
}
}
{ "execute": "hello-world", "arguments": { "message": "We love qemu" } }
{
"return": {
}
}
You should see "Hello, world" and "we love qemu" in the terminal running qemu,
if you don't see these strings, then something went wrong.
=== Errors ===
QMP commands should use the error interface exported by the error.h header
file. Basically, errors are set by calling the error_set() function.
Let's say we don't accept the string "message" to contain the word "love". If
it does contain it, we want the "hello-world" command to return an error:
void qmp_hello_world(bool has_message, const char *message, Error **errp)
{
if (has_message) {
if (strstr(message, "love")) {
error_set(errp, ERROR_CLASS_GENERIC_ERROR,
"the word 'love' is not allowed");
return;
}
printf("%s\n", message);
} else {
printf("Hello, world\n");
}
}
The first argument to the error_set() function is the Error pointer to pointer,
which is passed to all QMP functions. The second argument is a ErrorClass
value, which should be ERROR_CLASS_GENERIC_ERROR most of the time (more
details about error classes are given below). The third argument is a human
description of the error, this is a free-form printf-like string.
Let's test the example above. Build qemu, run it as defined in the "Testing"
section, and then issue the following command:
{ "execute": "hello-world", "arguments": { "message": "all you need is love" } }
The QMP server's response should be:
{
"error": {
"class": "GenericError",
"desc": "the word 'love' is not allowed"
}
}
As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR. There
are two exceptions to this rule:
1. A non-generic ErrorClass value exists* for the failure you want to report
(eg. DeviceNotFound)
2. Management applications have to take special action on the failure you
want to report, hence you have to add a new ErrorClass value so that they
can check for it
If the failure you want to report doesn't fall in one of the two cases above,
just report ERROR_CLASS_GENERIC_ERROR.
* All existing ErrorClass values are defined in the qapi-schema.json file
=== Command Documentation ===
There's only one step missing to make "hello-world"'s implementation complete,
and that's its documentation in the schema file.
This is very important. No QMP command will be accepted in QEMU without proper
documentation.
There are many examples of such documentation in the schema file already, but
here goes "hello-world"'s new entry for the qapi-schema.json file:
##
# @hello-world
#
# Print a client provided string to the standard output stream.
#
# @message: #optional string to be printed
#
# Returns: Nothing on success.
#
# Notes: if @message is not provided, the "Hello, world" string will
# be printed instead
#
# Since: <next qemu stable release, eg. 1.0>
##
{ 'command': 'hello-world', 'data': { '*message': 'str' } }
Please, note that the "Returns" clause is optional if a command doesn't return
any data nor any errors.
=== Implementing the HMP command ===
Now that the QMP command is in place, we can also make it available in the human
monitor (HMP).
With the introduction of the QAPI, HMP commands make QMP calls. Most of the
time HMP commands are simple wrappers. All HMP commands implementation exist in
the hmp.c file.
Here's the implementation of the "hello-world" HMP command:
void hmp_hello_world(Monitor *mon, const QDict *qdict)
{
const char *message = qdict_get_try_str(qdict, "message");
Error *err = NULL;
qmp_hello_world(!!message, message, &err);
if (err) {
monitor_printf(mon, "%s\n", error_get_pretty(err));
error_free(err);
return;
}
}
Also, you have to add the function's prototype to the hmp.h file.
There are three important points to be noticed:
1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The
former is the monitor object. The latter is how the monitor passes
arguments entered by the user to the command implementation
2. hmp_hello_world() performs error checking. In this example we just print
the error description to the user, but we could do more, like taking
different actions depending on the error qmp_hello_world() returns
3. The "err" variable must be initialized to NULL before performing the
QMP call
There's one last step to actually make the command available to monitor users,
we should add it to the hmp-commands.hx file:
{
.name = "hello-world",
.args_type = "message:s?",
.params = "hello-world [message]",
.help = "Print message to the standard output",
.mhandler.cmd = hmp_hello_world,
},
STEXI
@item hello_world @var{message}
@findex hello_world
Print message to the standard output
ETEXI
To test this you have to open a user monitor and issue the "hello-world"
command. It might be instructive to check the command's documentation with
HMP's "help" command.
Please, check the "-monitor" command-line option to know how to open a user
monitor.
== Writing a command that returns data ==
A QMP command is capable of returning any data the QAPI supports like integers,
strings, booleans, enumerations and user defined types.
In this section we will focus on user defined types. Please, check the QAPI
documentation for information about the other types.
=== User Defined Types ===
FIXME This example needs to be redone after commit 6d32717
For this example we will write the query-alarm-clock command, which returns
information about QEMU's timer alarm. For more information about it, please
check the "-clock" command-line option.
We want to return two pieces of information. The first one is the alarm clock's
name. The second one is when the next alarm will fire. The former information is
returned as a string, the latter is an integer in nanoseconds (which is not
very useful in practice, as the timer has probably already fired when the
information reaches the client).
The best way to return that data is to create a new QAPI type, as shown below:
##
# @QemuAlarmClock
#
# QEMU alarm clock information.
#
# @clock-name: The alarm clock method's name.
#
# @next-deadline: #optional The time (in nanoseconds) the next alarm will fire.
#
# Since: 1.0
##
{ 'type': 'QemuAlarmClock',
'data': { 'clock-name': 'str', '*next-deadline': 'int' } }
The "type" keyword defines a new QAPI type. Its "data" member contains the
type's members. In this example our members are the "clock-name" and the
"next-deadline" one, which is optional.
Now let's define the query-alarm-clock command:
##
# @query-alarm-clock
#
# Return information about QEMU's alarm clock.
#
# Returns a @QemuAlarmClock instance describing the alarm clock method
# being currently used by QEMU (this is usually set by the '-clock'
# command-line option).
#
# Since: 1.0
##
{ 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' }
Notice the "returns" keyword. As its name suggests, it's used to define the
data returned by a command.
It's time to implement the qmp_query_alarm_clock() function, you can put it
in the qemu-timer.c file:
QemuAlarmClock *qmp_query_alarm_clock(Error **errp)
{
QemuAlarmClock *clock;
int64_t deadline;
clock = g_malloc0(sizeof(*clock));
deadline = qemu_next_alarm_deadline();
if (deadline > 0) {
clock->has_next_deadline = true;
clock->next_deadline = deadline;
}
clock->clock_name = g_strdup(alarm_timer->name);
return clock;
}
There are a number of things to be noticed:
1. The QemuAlarmClock type is automatically generated by the QAPI framework,
its members correspond to the type's specification in the schema file
2. As specified in the schema file, the function returns a QemuAlarmClock
instance and takes no arguments (besides the "errp" one, which is mandatory
for all QMP functions)
3. The "clock" variable (which will point to our QAPI type instance) is
allocated by the regular g_malloc0() function. Note that we chose to
initialize the memory to zero. This is recommended for all QAPI types, as
it helps avoiding bad surprises (specially with booleans)
4. Remember that "next_deadline" is optional? All optional members have a
'has_TYPE_NAME' member that should be properly set by the implementation,
as shown above
5. Even static strings, such as "alarm_timer->name", should be dynamically
allocated by the implementation. This is so because the QAPI also generates
a function to free its types and it cannot distinguish between dynamically
or statically allocated strings
6. You have to include the "qmp-commands.h" header file in qemu-timer.c,
otherwise qemu won't build
The last step is to add the correspoding entry in the qmp-commands.hx file:
{
.name = "query-alarm-clock",
.args_type = "",
.mhandler.cmd_new = qmp_marshal_input_query_alarm_clock,
},
Time to test the new command. Build qemu, run it as described in the "Testing"
section and try this:
{ "execute": "query-alarm-clock" }
{
"return": {
"next-deadline": 2368219,
"clock-name": "dynticks"
}
}
==== The HMP command ====
Here's the HMP counterpart of the query-alarm-clock command:
void hmp_info_alarm_clock(Monitor *mon)
{
QemuAlarmClock *clock;
Error *err = NULL;
clock = qmp_query_alarm_clock(&err);
if (err) {
monitor_printf(mon, "Could not query alarm clock information\n");
error_free(err);
return;
}
monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name);
if (clock->has_next_deadline) {
monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n",
clock->next_deadline);
}
qapi_free_QemuAlarmClock(clock);
}
It's important to notice that hmp_info_alarm_clock() calls
qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock().
For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME()
function and that's what you have to use to free the types you define and
qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section).
If the QMP call returns a string, then you should g_free() to free it.
Also note that hmp_info_alarm_clock() performs error handling. That's not
strictly required if you're sure the QMP function doesn't return errors, but
it's good practice to always check for errors.
Another important detail is that HMP's "info" commands don't go into the
hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined
in the monitor.c file. The entry for the "info alarmclock" follows:
{
.name = "alarmclock",
.args_type = "",
.params = "",
.help = "show information about the alarm clock",
.mhandler.info = hmp_info_alarm_clock,
},
To test this, run qemu and type "info alarmclock" in the user monitor.
=== Returning Lists ===
For this example, we're going to return all available methods for the timer
alarm, which is pretty much what the command-line option "-clock ?" does,
except that we're also going to inform which method is in use.
This first step is to define a new type:
##
# @TimerAlarmMethod
#
# Timer alarm method information.
#
# @method-name: The method's name.
#
# @current: true if this alarm method is currently in use, false otherwise
#
# Since: 1.0
##
{ 'type': 'TimerAlarmMethod',
'data': { 'method-name': 'str', 'current': 'bool' } }
The command will be called "query-alarm-methods", here is its schema
specification:
##
# @query-alarm-methods
#
# Returns information about available alarm methods.
#
# Returns: a list of @TimerAlarmMethod for each method
#
# Since: 1.0
##
{ 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] }
Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this
should be read as "returns a list of TimerAlarmMethod instances".
The C implementation follows:
TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp)
{
TimerAlarmMethodList *method_list = NULL;
const struct qemu_alarm_timer *p;
bool current = true;
for (p = alarm_timers; p->name; p++) {
TimerAlarmMethodList *info = g_malloc0(sizeof(*info));
info->value = g_malloc0(sizeof(*info->value));
info->value->method_name = g_strdup(p->name);
info->value->current = current;
current = false;
info->next = method_list;
method_list = info;
}
return method_list;
}
The most important difference from the previous examples is the
TimerAlarmMethodList type, which is automatically generated by the QAPI from
the TimerAlarmMethod type.
Each list node is represented by a TimerAlarmMethodList instance. We have to
allocate it, and that's done inside the for loop: the "info" pointer points to
an allocated node. We also have to allocate the node's contents, which is
stored in its "value" member. In our example, the "value" member is a pointer
to an TimerAlarmMethod instance.
Notice that the "current" variable is used as "true" only in the first
interation of the loop. That's because the alarm timer method in use is the
first element of the alarm_timers array. Also notice that QAPI lists are handled
by hand and we return the head of the list.
To test this you have to add the corresponding qmp-commands.hx entry:
{
.name = "query-alarm-methods",
.args_type = "",
.mhandler.cmd_new = qmp_marshal_input_query_alarm_methods,
},
Now Build qemu, run it as explained in the "Testing" section and try our new
command:
{ "execute": "query-alarm-methods" }
{
"return": [
{
"current": false,
"method-name": "unix"
},
{
"current": true,
"method-name": "dynticks"
}
]
}
The HMP counterpart is a bit more complex than previous examples because it
has to traverse the list, it's shown below for reference:
void hmp_info_alarm_methods(Monitor *mon)
{
TimerAlarmMethodList *method_list, *method;
Error *err = NULL;
method_list = qmp_query_alarm_methods(&err);
if (err) {
monitor_printf(mon, "Could not query alarm methods\n");
error_free(err);
return;
}
for (method = method_list; method; method = method->next) {
monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ',
method->value->method_name);
}
qapi_free_TimerAlarmMethodList(method_list);
}

View file

@ -1,128 +0,0 @@
XBZRLE (Xor Based Zero Run Length Encoding)
===========================================
Using XBZRLE (Xor Based Zero Run Length Encoding) allows for the reduction
of VM downtime and the total live-migration time of Virtual machines.
It is particularly useful for virtual machines running memory write intensive
workloads that are typical of large enterprise applications such as SAP ERP
Systems, and generally speaking for any application that uses a sparse memory
update pattern.
Instead of sending the changed guest memory page this solution will send a
compressed version of the updates, thus reducing the amount of data sent during
live migration.
In order to be able to calculate the update, the previous memory pages need to
be stored on the source. Those pages are stored in a dedicated cache
(hash table) and are accessed by their address.
The larger the cache size the better the chances are that the page has already
been stored in the cache.
A small cache size will result in high cache miss rate.
Cache size can be changed before and during migration.
Format
=======
The compression format performs a XOR between the previous and current content
of the page, where zero represents an unchanged value.
The page data delta is represented by zero and non zero runs.
A zero run is represented by its length (in bytes).
A non zero run is represented by its length (in bytes) and the new data.
The run length is encoded using ULEB128 (http://en.wikipedia.org/wiki/LEB128)
There can be more than one valid encoding, the sender may send a longer encoding
for the benefit of reducing computation cost.
page = zrun nzrun
| zrun nzrun page
zrun = length
nzrun = length byte...
length = uleb128 encoded integer
On the sender side XBZRLE is used as a compact delta encoding of page updates,
retrieving the old page content from the cache (default size of 512 MB). The
receiving side uses the existing page's content and XBZRLE to decode the new
page's content.
This work was originally based on research results published
VEE 2011: Evaluation of Delta Compression Techniques for Efficient Live
Migration of Large Virtual Machines by Benoit, Svard, Tordsson and Elmroth.
Additionally the delta encoder XBRLE was improved further using the XBZRLE
instead.
XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
ideal for in-line, real-time encoding such as is needed for live-migration.
Example
old buffer:
1001 zeros
05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 68 00 00 6b 00 6d
3074 zeros
new buffer:
1001 zeros
01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 68 00 00 67 00 69
3074 zeros
encoded buffer:
encoded length 24
e9 07 0f 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 03 01 67 01 01 69
Usage
======================
1. Verify the destination QEMU version is able to decode the new format.
{qemu} info migrate_capabilities
{qemu} xbzrle: off , ...
2. Activate xbzrle on both source and destination:
{qemu} migrate_set_capability xbzrle on
3. Set the XBZRLE cache size - the cache size is in MBytes and should be a
power of 2. The cache default value is 64MBytes. (on source only)
{qemu} migrate_set_cache_size 256m
4. Start outgoing migration
{qemu} migrate -d tcp:destination.host:4444
{qemu} info migrate
capabilities: xbzrle: on
Migration status: active
transferred ram: A kbytes
remaining ram: B kbytes
total ram: C kbytes
total time: D milliseconds
duplicate: E pages
normal: F pages
normal bytes: G kbytes
cache size: H bytes
xbzrle transferred: I kbytes
xbzrle pages: J pages
xbzrle cache miss: K
xbzrle overflow : L
xbzrle cache-miss: the number of cache misses to date - high cache-miss rate
indicates that the cache size is set too low.
xbzrle overflow: the number of overflows in the decoding which where the delta
could not be compressed. This can happen if the changes in the pages are too
large or there are many short changes; for example, changing every second byte
(half a page).
Testing: Testing indicated that live migration with XBZRLE was completed in 110
seconds, whereas without it would not be able to complete.
A simple synthetic memory r/w load generator:
.. include <stdlib.h>
.. include <stdio.h>
.. int main()
.. {
.. char *buf = (char *) calloc(4096, 4096);
.. while (1) {
.. int i;
.. for (i = 0; i < 4096 * 4; i++) {
.. buf[i * 4096 / 4]++;
.. }
.. printf(".");
.. }
.. }

View file

@ -1,34 +0,0 @@
= Save Devices =
QEMU has code to load/save the state of the guest that it is running.
These are two complementary operations. Saving the state just does
that, saves the state for each device that the guest is running.
These operations are normally used with migration (see migration.txt),
however it is also possible to save the state of all devices to file,
without saving the RAM or the block devices of the VM.
This operation is called "xen-save-devices-state" (see
QMP/qmp-commands.txt)
The binary format used in the file is the following:
-------------------------------------------
32 bit big endian: QEMU_VM_FILE_MAGIC
32 bit big endian: QEMU_VM_FILE_VERSION
for_each_device
{
8 bit: QEMU_VM_SECTION_FULL
32 bit big endian: section_id
8 bit: idstr (ID string) length
string: idstr (ID string)
32 bit big endian: instance_id
32 bit big endian: version_id
buffer: device specific data
}
8 bit: QEMU_VM_EOF