according to kvmtool github repository The description of the document is similar to the role of qemu. kvmtool is a host os user-mode virtual machine tool that supports running KVM guest os. It is a pure virtualization tool. guest os can run on it without modification. However, due to KVM is supported by CPU-based hardware virtualization, so similar to qemu-kvm, it only supports Guest OS based on the same architecture.
The code size of kvmtool is only 5KLOC. It is a clean, written from scratch, lightweight virtualization tool. Due to its light weight, it is very friendly for those who want to learn virtualization. Kvmtool is implemented as a KVM host tool, which can boot Linux images without BIOS and other related dependencies. Next, we try to build a kvmtool operating environment based on the ubuntu22 environment, and run another linux guest os on the virtual machine.
The host system used in this experiment is ubuntu22.04, please refer to the figure below for specific information:
$ git clone https://github.com/kvmtool/kvmtool.git
$ wget https://busybox.net/downloads/busybox-1.32.0.tar.bz2
Download the Linux kernel:
$ axel -a -n 80 https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.15.18.tar.gz
When choosing a version, you can deliberately choose the tool and the source code version for roughly the same period of time, without paying too much attention.
The kvmtool version used in this experiment is: e17d182ad3f797f01947fc234d95c96c050c534b, the compilation method is simple and direct, just enter the kvmtool directory and make directly:
The compiled executable program is lkvm, and a hard link vm of lkvm is established at the same time. The two are exactly the same.
Compile the Linux kernel
The kernel compilation method is very simple, refer to the blog
There are three points to note here:
Fix compilation errors related to missing .pem files, there are two
Only the bzImage target needs to be compiled, no modules are required
The default menuconfig is enough, KVM and VIRTIO related options have been opened
Finally generate the bzImage file:
Make a root file system based on busybox, build a directory structure, refer to the blog:
It should be noted that after performing the operations in the blog, you need to rename the linuxrc file in the top directory to init.
Then compress the rootfs directory to a cpio file.
$ find . | cpio -o --format=newc > root_fs.cpio
After completion, the directory structure is as follows:
After the above three steps are completed, you can start running.
run virtual machine
Before execution, confirm that the /dev/kvm device node exists on the host
Run the virtual machine and execute the following command
$ sudo ./lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -i ../busybox-1.32.0/_install/root_fs.cpio
zlcao@zlcao-RedmiBook-14:~/kvm/kvmtool$ sudo ./lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -i ../busybox-1.32.0/_install/root_fs.cpio # lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -m 704 -c 8 --name guest-100110 [ 0.000000] Linux version 5.15.18 (zlcao@zlcao-RedmiBook-14) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1 SMP Fri Jan 27 12:27:51 CST 2023 [ 0.000000] Command line: noapic noacpi pci=conf1 reboot=k panic=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 earlyprintk=serial i8042.noaux=1 console=ttyS0 root=/dev/vda rw [ 0.000000] KERNEL supported cpus: [ 0.000000] Intel GenuineIntel [ 0.000000] AMD AuthenticAMD [ 0.000000] Hygon HygonGenuine [ 0.000000] Centaur CentaurHauls [ 0.000000] zhaoxin Shanghai [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR' [ 0.000000] x86/fpu: xstate_offset: 576, xstate_sizes: 256 [ 0.000000] x86/fpu: xstate_offset: 832, xstate_sizes: 64 [ 0.000000] x86/fpu: xstate_offset: 896, xstate_sizes: 64 [ 0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format. [ 0.000000] signal: max sigframe size: 2032 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000ffffe] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000002bffffff] usable [ 0.000000] printk: bootconsole [earlyser0] enabled [ 0.000000] ERROR: earlyprintk= earlyser already used [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] Hypervisor detected: KVM [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00 [ 0.000000] kvm-clock: cpu 0, msr 11c01001, primary cpu clock [ 0.000004] kvm-clock: using sched offset of 198180346 cycles [ 0.000522] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [ 0.002007] tsc: Detected 1992.002 MHz processor [ 0.002444] last_pfn = 0x2c000 max_arch_pfn = 0x400000000 [ 0.002986] Disabled [ 0.003182] x86/PAT: MTRRs disabled, skipping PAT initialization too. [ 0.003765] CPU MTRRs all blank - virtualized system. [ 0.004236] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC Memory KASLR using RDRAND RDTSC... [ 0.005590] found SMP MP-table at [mem 0x000f03b0-0x000f03bf] [ 0.006456] Using GB pages for direct mapping [ 0.007160] RAMDISK: [mem 0x2bd00000-0x2bf83fff] [ 0.007640] ACPI: Early table checksum verification disabled [ 0.008311] ACPI BIOS Error (bug): A valid RSDP was not found (20210730/tbxfroot-210) [ 0.009234] No NUMA configuration found [ 0.009526] Faking a node at [mem 0x0000000000000000-0x000000002bffffff] [ 0.010001] NODE_DATA(0) allocated [mem 0x2bfd6000-0x2bffffff] [ 0.010937] Zone ranges: [ 0.011122] DMA [mem 0x0000000000001000-0x0000000000ffffff] [ 0.011581] DMA32 [mem 0x0000000001000000-0x000000002bffffff] [ 0.012074] Normal empty [ 0.012351] Device empty [ 0.012626] Movable zone start for each node [ 0.012971] Early memory node ranges [ 0.013292] node 0: [mem 0x0000000000001000-0x000000000009efff] [ 0.013732] node 0: [mem 0x0000000000100000-0x000000002bffffff] [ 0.014192] Initmem setup node 0 [mem 0x0000000000001000-0x000000002bffffff] [ 0.014710] On node 0, zone DMA: 1 pages in unavailable ranges [ 0.014878] On node 0, zone DMA: 97 pages in unavailable ranges [ 0.022910] On node 0, zone DMA32: 16384 pages in unavailable ranges [ 0.023633] Intel MultiProcessor Specification v1.4 [ 0.024453] MPTABLE: OEM ID: KVMCPU00 [ 0.024719] MPTABLE: Product ID: 0.1 [ 0.025000] MPTABLE: APIC at: 0xFEE00000 [ 0.025279] Processor #0 (Bootup-CPU) [ 0.025527] Processor #1 [ 0.025698] Processor #2 [ 0.025861] Processor #3 [ 0.026025] Processor #4 [ 0.026191] Processor #5 [ 0.026356] Processor #6 [ 0.026521] Processor #7 [ 0.026715] IOAPIC: apic_id 9, version 17, address 0xfec00000, GSI 0-23 [ 0.027163] Processors: 8 [ 0.027344] smpboot: Allowing 8 CPUs, 0 hotplug CPUs [ 0.027735] kvm-guest: KVM setup pv remote TLB flush [ 0.028059] kvm-guest: setup PV sched yield [ 0.028372] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff] [ 0.028859] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff] [ 0.029349] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff] [ 0.029843] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fefff] [ 0.030330] PM: hibernation: Registered nosave memory: [mem 0x000ff000-0x000fffff] [ 0.030820] [mem 0x2c000000-0xffffffff] available for PCI devices [ 0.031217] Booting paravirtualized kernel on KVM [ 0.031546] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns [ 0.032234] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1 [ 0.034042] percpu: Embedded 61 pages/cpu s212992 r8192 d28672 u262144 [ 0.034524] kvm-guest: setup async PF for cpu 0 [ 0.034866] kvm-guest: stealtime: cpu 0, msr 2ae33080 [ 0.035203] kvm-guest: PV spinlocks enabled [ 0.035483] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear) [ 0.035994] Built 1 zonelists, mobility grouping on. Total pages: 177152 [ 0.036454] Policy zone: DMA32 [ 0.036658] Kernel command line: noapic noacpi pci=conf1 reboot=k panic=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 earlyprintk=serial i8042.noaux=1 console=ttyS0 root=/dev/vda rw [ 0.037994] Unknown kernel command line parameters "noacpi", will be passed to user space. [ 0.039146] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear) [ 0.039968] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear) [ 0.040621] mem auto-init: stack:off, heap alloc:on, heap free:off [ 0.045493] Memory: 657968K/720504K available (16393K kernel code, 4387K rwdata, 10492K rodata, 2932K init, 4816K bss, 62276K reserved, 0K cma-reserved) [ 0.046448] random: get_random_u64 called from __kmem_cache_create+0x2f/0x520 with crng_init=0 [ 0.046702] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 [ 0.047702] ftrace: allocating 47928 entries in 188 pages [ 0.064484] ftrace: allocated 188 pages with 5 groups [ 0.065149] rcu: Hierarchical RCU implementation. [ 0.065448] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8. [ 0.065873] Rude variant of Tasks RCU enabled. [ 0.066157] Tracing variant of Tasks RCU enabled. [ 0.066456] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies. [ 0.066930] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8 [ 0.070850] NR_IRQS: 524544, nr_irqs: 488, preallocated irqs: 16 [ 0.071549] random: crng done (trusting CPU's manufacturer) [ 0.071979] Console: colour *CGA 80x25 [ 0.072283] printk: console [ttyS0] enabled [ 0.072283] printk: console [ttyS0] enabled [ 0.072969] printk: bootconsole [earlyser0] disabled [ 0.072969] printk: bootconsole [earlyser0] disabled [ 0.073921] APIC: Switch to symmetric I/O mode setup [ 0.074351] Not enabling interrupt remapping due to skipped IO-APIC setup [ 0.075319] kvm-guest: setup PV IPIs [ 0.075970] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x396d566cf43, max_idle_ns: 881590760263 ns [ 0.076947] Calibrating delay loop (skipped) preset value.. 3984.00 BogoMIPS (lpj=7968008) [ 0.077665] pid_max: default: 32768 minimum: 301 [ 0.081003] LSM: Security Framework initializing [ 0.081417] landlock: Up and running. [ 0.081733] Yama: becoming mindful. [ 0.082087] AppArmor: AppArmor initialized [ 0.082481] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear) [ 0.083131] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear) Poking KASLR using RDRAND RDTSC... [ 0.085044] x86/cpu: User Mode Instruction Prevention (UMIP) activated [ 0.085971] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8 [ 0.086434] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4 [ 0.086991] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization [ 0.088961] Spectre V2 : Mitigation: Full generic retpoline [ 0.089424] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch [ 0.090121] Spectre V2 : Enabling Restricted Speculation for firmware calls [ 0.090713] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier [ 0.091429] Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl [ 0.092018] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp [ 0.092950] SRBDS: Unknown: Dependent on hypervisor status [ 0.093435] MDS: Mitigation: Clear CPU buffers [ 0.101335] Freeing SMP alternatives memory: 40K [ 0.318017] smpboot: CPU0: Intel 06/8e (family: 0x6, model: 0x8e, stepping: 0xb) [ 0.319105] Performance Events: Skylake events, 32-deep LBR, full-width counters, Intel PMU driver. [ 0.321782] ... version: 2 [ 0.322127] ... bit width: 48 [ 0.322481] ... generic registers: 4 [ 0.322819] ... value mask: 0000ffffffffffff [ 0.323267] ... max period: 00007fffffffffff [ 0.323719] ... fixed-purpose events: 3 [ 0.324941] ... event mask: 000000070000000f [ 0.325598] rcu: Hierarchical SRCU implementation. [ 0.327121] smp: Bringing up secondary CPUs ... [ 0.327742] x86: Booting SMP configuration: [ 0.328094] .... node #0, CPUs: #1 [ 0.009568] kvm-clock: cpu 1, msr 11c01041, secondary cpu clock [ 0.329211] kvm-guest: setup async PF for cpu 1 [ 0.329667] kvm-guest: stealtime: cpu 1, msr 2ae73080 [ 0.330021] #2 [ 0.009568] kvm-clock: cpu 2, msr 11c01081, secondary cpu clock [ 0.009568] [Firmware Bug]: CPU2: APIC id mismatch. Firmware: 2 APIC: 7 [ 0.331227] kvm-guest: setup async PF for cpu 2 [ 0.331227] kvm-guest: stealtime: cpu 2, msr 2aeb3080 [ 0.333172] #3 [ 0.009568] kvm-clock: cpu 3, msr 11c010c1, secondary cpu clock [ 0.009568] [Firmware Bug]: CPU3: APIC id mismatch. Firmware: 3 APIC: 7 [ 0.334905] kvm-guest: setup async PF for cpu 3 [ 0.334905] kvm-guest: stealtime: cpu 3, msr 2aef3080 [ 0.334905] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details. [ 0.337190] #4 [ 0.009568] kvm-clock: cpu 4, msr 11c01101, secondary cpu clock [ 0.009568] [Firmware Bug]: CPU4: APIC id mismatch. Firmware: 4 APIC: 1 [ 0.339458] kvm-guest: setup async PF for cpu 4 [ 0.339458] kvm-guest: stealtime: cpu 4, msr 2af33080 [ 0.341165] #5 [ 0.009568] kvm-clock: cpu 5, msr 11c01141, secondary cpu clock [ 0.009568] [Firmware Bug]: CPU5: APIC id mismatch. Firmware: 5 APIC: 0 [ 0.343159] kvm-guest: setup async PF for cpu 5 [ 0.343159] kvm-guest: stealtime: cpu 5, msr 2af73080 [ 0.345078] #6 [ 0.009568] kvm-clock: cpu 6, msr 11c01181, secondary cpu clock [ 0.009568] [Firmware Bug]: CPU6: APIC id mismatch. Firmware: 6 APIC: 7 [ 0.346579] kvm-guest: setup async PF for cpu 6 [ 0.346579] kvm-guest: stealtime: cpu 6, msr 2afb3080 [ 0.346579] #7 [ 0.009568] kvm-clock: cpu 7, msr 11c011c1, secondary cpu clock [ 0.009568] [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 7 APIC: 6 [ 0.349375] kvm-guest: setup async PF for cpu 7 [ 0.349375] kvm-guest: stealtime: cpu 7, msr 2aff3080 [ 0.349687] smp: Brought up 1 node, 8 CPUs [ 0.349687] smpboot: Max logical packages: 1 [ 0.349897] smpboot: Total of 8 processors activated (31872.03 BogoMIPS) [ 0.353085] devtmpfs: initialized [ 0.353355] x86/mm: Memory block size: 128MB [ 0.354192] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns [ 0.354192] futex hash table entries: 2048 (order: 5, 131072 bytes, linear) [ 0.354845] pinctrl core: initialized pinctrl subsystem [ 0.357228] PM: RTC time: 05:49:54, date: 2023-01-27 [ 0.358851] NET: Registered PF_NETLINK/PF_ROUTE protocol family [ 0.359701] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations [ 0.361013] DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations [ 0.361921] DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations [ 0.362746] audit: initializing netlink subsys (disabled) [ 0.363325] audit: type=2000 audit(1674798594.637:1): state=initialized audit_enabled=0 res=1 [ 0.363325] thermal_sys: Registered thermal governor 'fair_share' [ 0.363325] thermal_sys: Registered thermal governor 'bang_bang' [ 0.363325] thermal_sys: Registered thermal governor 'step_wise' [ 0.364971] thermal_sys: Registered thermal governor 'user_space' [ 0.365682] thermal_sys: Registered thermal governor 'power_allocator' [ 0.366337] EISA bus registered [ 0.367271] cpuidle: using governor ladder [ 0.367616] cpuidle: using governor menu [ 0.369047] PCI: Using configuration type 1 for base access [ 0.371011] Kprobes globally optimized [ 0.371378] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages [ 0.371378] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages [ 0.373053] ACPI: Interpreter disabled. [ 0.373350] iommu: Default domain type: Translated [ 0.373350] iommu: DMA domain TLB invalidation policy: lazy mode [ 0.376980] vgaarb: loaded [ 0.377344] SCSI subsystem initialized [ 0.377600] usbcore: registered new interface driver usbfs [ 0.377600] usbcore: registered new interface driver hub [ 0.377741] usbcore: registered new device driver usb [ 0.378091] pps_core: LinuxPPS API ver. 1 registered [ 0.378415] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <email@example.com> [ 0.379028] PTP clock support registered [ 0.379316] EDAC MC: Ver: 3.0.0 [ 0.381070] NetLabel: Initializing [ 0.381298] NetLabel: domain hash size = 128 [ 0.381582] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO [ 0.381967] NetLabel: unlabeled traffic allowed by default [ 0.382356] PCI: Probing PCI hardware [ 0.382356] PCI host bridge to bus 0000:00 [ 0.382356] pci_bus 0000:00: root bus resource [io 0x0000-0xffff] [ 0.382356] pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff] [ 0.382388] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff] [ 0.383074] pci 0000:00:00.0: [1af4:1041] type 00 class 0x020000 [ 0.384986] pci 0000:00:00.0: reg 0x10: [io 0x6200-0x62ff] [ 0.385384] pci 0000:00:00.0: reg 0x14: [mem 0xd2000000-0xd20000ff] [ 0.385820] pci 0000:00:00.0: reg 0x18: [mem 0xd2000400-0xd20007ff] [ 0.394166] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00 [ 0.394690] clocksource: Switched to clocksource kvm-clock [ 0.407575] VFS: Disk quotas dquot_6.6.0 [ 0.407909] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes) [ 0.408491] AppArmor: AppArmor Filesystem Enabled [ 0.408831] pnp: PnP ACPI: disabled [ 0.410916] NET: Registered PF_INET protocol family [ 0.411442] IP idents hash table entries: 16384 (order: 5, 131072 bytes, linear) [ 0.412701] tcp_listen_portaddr_hash hash table entries: 512 (order: 1, 8192 bytes, linear) [ 0.413465] TCP established hash table entries: 8192 (order: 4, 65536 bytes, linear) [ 0.414181] TCP bind hash table entries: 8192 (order: 5, 131072 bytes, linear) [ 0.414825] TCP: Hash tables configured (established 8192 bind 8192) [ 0.415558] MPTCP token hash table entries: 1024 (order: 2, 24576 bytes, linear) [ 0.416173] UDP hash table entries: 512 (order: 2, 16384 bytes, linear) [ 0.416776] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes, linear) [ 0.417414] NET: Registered PF_UNIX/PF_LOCAL protocol family [ 0.417903] NET: Registered PF_XDP protocol family [ 0.418322] pci_bus 0000:00: resource 4 [io 0x0000-0xffff] [ 0.418794] pci_bus 0000:00: resource 5 [mem 0x00000000-0x7fffffffff] [ 0.419406] PCI: CLS 0 bytes, default 64 [ 0.419810] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x396d566cf43, max_idle_ns: 881590760263 ns [ 0.419933] Trying to unpack rootfs image as initramfs... [ 0.421289] clocksource: Switched to clocksource tsc [ 0.421757] platform rtc_cmos: registered platform RTC device (no PNP device found) [ 0.423248] Initialise system trusted keyrings [ 0.423671] Key type blacklist registered [ 0.424313] workingset: timestamp_bits=36 max_order=18 bucket_order=0 [ 0.426758] zbud: loaded [ 0.427453] squashfs: version 4.0 (2009/01/31) Phillip Lougher [ 0.428192] fuse: init (API version 7.34) [ 0.428890] integrity: Platform Keyring initialized [ 0.430492] Freeing initrd memory: 2576K [ 0.435013] Key type asymmetric registered [ 0.435289] Asymmetric key parser 'x509' registered [ 0.435621] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243) [ 0.436190] io scheduler mq-deadline registered [ 0.436884] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 [ 0.438064] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled [ 0.459372] serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a U6_16550A [ 0.480907] serial8250: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a U6_16550A [ 0.502620] serial8250: ttyS2 at I/O 0x3e8 (irq = 4, base_baud = 115200) is a U6_16550A [ 0.505001] Linux agpgart interface v0.103 [ 0.508374] loop: module loaded [ 0.509013] tun: Universal TUN/TAP device driver, 1.6 [ 0.509497] PPP generic driver version 2.4.2 [ 0.509993] VFIO - User Level meta-driver version: 0.3 [ 0.510593] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver [ 0.511181] ehci-pci: EHCI PCI platform driver [ 0.511689] ehci-platform: EHCI generic platform driver [ 0.512245] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver [ 0.512857] ohci-pci: OHCI PCI platform driver [ 0.513377] ohci-platform: OHCI generic platform driver [ 0.513947] uhci_hcd: USB Universal Host Controller Interface driver [ 0.514683] i8042: PNP detection disabled [ 0.515301] serio: i8042 KBD port at 0x60,0x64 irq 1 [ 0.516030] mousedev: PS/2 mouse device common for all mice [ 0.516659] input: AT Raw Set 2 keyboard as /devices/platform/i8042/serio0/input/input0 [ 0.517669] rtc_cmos rtc_cmos: only 24-hr supported [ 0.518179] i2c_dev: i2c /dev entries driver [ 0.518713] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log. [ 0.520248] device-mapper: uevent: version 1.0.3 [ 0.521055] device-mapper: ioctl: 4.45.0-ioctl (2021-03-22) initialised: firstname.lastname@example.org [ 0.522019] platform eisa.0: Probing EISA bus 0 [ 0.522587] eisa 00:00: EISA: Mainboard @@@0000 detected [ 0.523175] eisa 00:01: EISA: slot 1: @@@0000 detected (disabled) [ 0.523834] eisa 00:02: EISA: slot 2: @@@0000 detected (disabled) [ 0.524514] eisa 00:03: EISA: slot 3: @@@0000 detected (disabled) [ 0.525209] eisa 00:04: EISA: slot 4: @@@0000 detected (disabled) [ 0.525882] eisa 00:05: EISA: slot 5: @@@0000 detected (disabled) [ 0.526553] eisa 00:06: EISA: slot 6: @@@0000 detected (disabled) [ 0.527228] eisa 00:07: EISA: slot 7: @@@0000 detected (disabled) [ 0.527902] eisa 00:08: EISA: slot 8: @@@0000 detected (disabled) [ 0.528538] platform eisa.0: EISA: Detected 8 cards [ 0.529059] intel_pstate: CPU model not supported [ 0.529779] ledtrig-cpu: registered to indicate activity on CPUs [ 0.530507] intel_pmc_core intel_pmc_core.0: initialized [ 0.531105] drop_monitor: Initializing network drop monitor service [ 0.531923] NET: Registered PF_INET6 protocol family [ 0.535712] Segment Routing with IPv6 [ 0.536127] In-situ OAM (IOAM) with IPv6 [ 0.536550] NET: Registered PF_PACKET protocol family [ 0.537127] Key type dns_resolver registered [ 0.538631] IPI shorthand broadcast: enabled [ 0.539012] sched_clock: Marking stable (532988952, 5568579)->(558785633, -20228102) [ 0.540219] registered taskstats version 1 [ 0.540813] Loading compiled-in X.509 certificates [ 0.542057] Loaded X.509 cert 'Build time autogenerated kernel key: 25cc8cb7907826729975261abe82eb726e9a7e0c' [ 0.544528] zswap: loaded using pool lzo/zbud [ 0.545873] Key type ._fscrypt registered [ 0.546290] Key type .fscrypt registered [ 0.546703] Key type fscrypt-provisioning registered [ 0.549064] Key type encrypted registered [ 0.549676] AppArmor: AppArmor sha1 policy hashing enabled [ 0.550410] ima: No TPM chip found, activating TPM-bypass! [ 0.550989] Loading compiled-in module X.509 certificates [ 0.551999] Loaded X.509 cert 'Build time autogenerated kernel key: 25cc8cb7907826729975261abe82eb726e9a7e0c' [ 0.552950] ima: Allocated hash algorithm: sha1 [ 0.553710] ima: No architecture policies found [ 0.554232] evm: Initialising EVM extended attributes: [ 0.554765] evm: security.selinux [ 0.555142] evm: security.SMACK64 [ 0.555494] evm: security.SMACK64EXEC [ 0.555881] evm: security.SMACK64TRANSMUTE [ 0.556308] evm: security.SMACK64MMAP [ 0.556692] evm: security.apparmor [ 0.557060] evm: security.ima [ 0.557367] evm: security.capability [ 0.557750] evm: HMAC attrs: 0x1 [ 0.558401] PM: Magic number: 7:314:821 [ 0.558969] RAS: Correctable Errors collector initialized. [ 0.561277] Freeing unused decrypted memory: 2036K [ 0.562714] Freeing unused kernel image (initmem) memory: 2932K [ 0.589485] Write protecting the kernel read-only data: 30720k [ 0.597456] Freeing unused kernel image (text/rodata gap) memory: 2036K [ 0.604078] Freeing unused kernel image (rodata/data gap) memory: 1796K [ 0.659967] x86/mm: Checked W+X mappings: passed, no W+X pages found. [ 0.660538] Run /init as init process Please press Enter to activate this console. / #
Execute top in the virtual machine
The test platform has 8 cores, and the code defaults to the VCPU setting according to the actual number of cores, so we can see that 8 CPUs are active in the above picture.
As can be seen from the code, each VCPU corresponds to a thread on the HOST process, and we can specify as many VCPUs as we want through the --cpus option:
$ sudo ./lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -i ../busybox-1.32.0/_install/root_fs.cpio --cpus=32 --name zilong
The subsequent parameters of executing lkvm indicate the entry of the secondary function to be executed. For example, the lkvm run command is executed when the virtual machine is running, and the corresponding entry function is kvm_cmd_run:
kvm_cmd_run calls kvm_cmd_run_work to continue the Launch of the virtual machine, and creates a pthread to run the GUEST OS for each VCPU.
During the running process of the virtual machine, when executing cpuid to obtain the CPU number, it will exit the virtual machine and enter HOST for simulation:
Since each VCPU is bound to one thread of the HOST virtual machine process, each VCPU thread needs to write the CPUID number represented by itself into the HOST KVM driver when the virtual machine is initialized, for the Guest OS to exit from the NON-ROOT mode After entering ROOT mode, realize the simulation of CPUID in HOST KVM Driver. Therefore, next each VCPU thread will have the action of setting CPUID.
Execute in sequence: kvm_cpu_thread->kvm_cpu__start->kvm_cpu__reset_vcpu->kvm_cpu__setup_cpuid.
The kernel KVM module provides a mechanism to register an area as IOTRAP. When the guest os accesses this area, it will trigger it to exit the NON-ROOT mode and enter the HOST. With this mechanism, IO virtualization is realized. The core function is:
When a trap occurs, the GUEST OS exits to the HOST:
Execute in sequence:
kvm_cpu__emulate_io->kvm__emulate_io->mmio->mmio_fn(vcpu, port, data, size, is_write, mmio->ptr);
Finally, the callback function mmio_fn registered by kvm_register_iotrap is executed to implement differentiated IO settings, which is a bit similar to QEMU TCG, except that TCG uses manual translation to insert helper s to achieve traps, while KVM relies on hardware support for traps, and the callback process is very similar.
At this point, the test is completed, and the code implementation of KVMTOOL will be gradually dissected later, so as to deepen the understanding of the realization principle of virtualization.
The host system needs to support CPU virtualization hardware acceleration. For INTEL processors, it needs to support VT-X. For AMD processors, it needs to support AMD-V. If it does not support it, the following error will be reported during execution. You will know it after lscpu. The original operating platform is a VMWare virtual machine.
~/Workspace$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz Stepping: 7 CPU MHz: 2793.437 BogoMIPS: 5586.87 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 22528K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities ~/Workspace$ ~/Workspace$ ls -l /dev/kvm ls: cannot access '/dev/kvm': No such file or directory ~/Workspace$
The target OS architecture and the HOST OS architecture must be the same. Although it is known in advance that KVM only supports virtualization-capable CPU s with the same ISA, I did not pay attention to this at the beginning of the experiment. Use the ARM version image zImage and file system in the above blog to start, and the result is executed After getting stuck, I thought of QEMU later, and then suddenly realized.
It is also because of the second point of operation that I know a detail. The bzImage file is only available on x86. Although the ARM architecture also supports the make bzImage compilation command, the compiled one is actually a zImage, without bzImage.
KVM Hypervisor is a Type II virtual machine. Naturally, QEMU-KVM and kvmtool based on KVM both belong to the implementation of Type II virtual machines. Kvmtool and QEMU are very similar. The overall architecture is shown in the following figure:
As a lightweight KVM virtual machine implementation, you can study the code later to see how KVMTOOL starts a kernel from scratch, understand the principle of virtualization in depth, and then learn other modules, such as virtio and IO virtualization. helpful.