Learning notes of "do it yourself Docker" 2
1 Preface
As my graduation project is related to the field of cloud native, I have been learning about Docker recently. Writing Docker by myself covers all kinds of knowledge at the bottom of Docker and provides all kinds of experiments and code demos. It is a good book for getting started with cloud native. This paper mainly records my experimental process and some summary feelings.
The experimental environment involved in this paper:
Vmware Workstation 16 builds Ubuntu 20 04 environment
The Linux kernel is 5.10 x
Go version 1.17.1
Chapter II Basic Technology
2.1 Linux Namespace
Linux Namespace is a function of Kernel, which can isolate a series of system resources, such as PID (process ID), User ID, Network, etc.
Currently, Linux implements six different types of namespaces.
The Namespace API mainly uses the following system calls
-
clone() creates a new process. Determine which types of namespaces are created according to the system call parameters, and their child processes will also be included in these namespaces.
-
unshare() moves the process out of a Namespace.
-
setns() adds the process to the Namespace.
UTS NameSpace
#Create main in the / src directory Go file for experiment vim main.go #write in package main import ( "os/exec" "syscall" "os" "log" ) func main() { cmd := exec.Command("sh")//Specifies the initialization process in the new process forked () cmd.SysProcAttr = &syscall.SysProcAttr{ //The clone has been encapsulated. Just call it directly Cloneflags: syscall.CLONE_NEWUTS, } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) } } #Exit and save #Enter go run main Go into namespace space go run main.go
IPC NameSpace
IPC Namespace is used to isolate System V IPC and POSIX message queue s Each IPC queue and POS system has its own IPC Namespace.
package main import ( "log" "os" "os/exec" "syscall" ) func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS| syscall.CLONE_NEWIPC, #Add IPC namespace here } cmd.Stdin=os.Stdin cmd.Stdout=os.Stdout cmd.Stderr=os.Stderr if err :=cmd.Run(); err!=nil{ log.Fatal(err) } }
PID NameSpacce
PID Namespace is used to isolate the process id. The same process can have different PIDs in different PID namespaces.
It can be understood that in the docker container, we use ps -ef to find that the PID of the init process running in the foreground inside the container is 1, but outside the container, we use ps -ef to find that the same process has different PIDs, which is what PID namespace does.
Before modification
modify
package main import ( "log" "os" "os/exec" "syscall" ) func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags : syscall.CLONE_NEWUTS|syscall.CLONE_NEWIPC| syscall.CLONE_NEWPID, #Add PID namespace here } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run();err!=nil{ log.Fatal(err) } }
After modification, enter the namespace
Mount NameSpace
In the previous Domo, you can't use top and ps to view temporarily because it will use / proc content
Linux On system/proc A directory is a file system, i.e proc File system. Unlike other common file systems,/proc It is a pseudo file system (i.e. virtual file system), which stores a series of special files of the current kernel running state. Users can view the information about the system hardware and the currently running process through these files, and even change the running state of the kernel by changing some of them. be based on/proc Due to the particularity of the file system as mentioned above, the files in it are often called virtual files
Modify the code and add MNT namespace
package main import ( "log" "os" "os/exec" "syscall" ) func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS|syscall.CLONE_NEWIPC| syscall.CLONE_NEWPID|syscall.CLONE_NEWNS,#Add namespace mnt here } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run();err!=nil{ log.Fatal(err) } }
Before unmounting / proc
Mount / proc
mount -t proc proc/proc ps -ef
User NameSpace
User namespace is mainly the user Group ID of isolated users. In other words, the User ID and Group ID of a process can be different inside and outside the user namespace.
It is commonly used to create a User namespace by running as a non root user on the host computer, and then map it to root user in the User namespace.
This process has root permission in the User namespace, but it does not have root permission outside the User namespace.
Note here: CentOS turns off User namespace by default, and the kernel version is too high, which will make the code unable to run. Please confirm the kernel version and then modify it.
package main import ( "log" "os" "os/exec" "syscall" ) func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWUSER, #Add User namespace here } cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)} cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) } os.Exit(-1) }
NetWork NameSpace
Network namespace is a namespace used to isolate network devices, IP addresses, ports and other network stacks. Network namespace allows each container to have its own independent network device (virtual), and the applications in the container can be bound to its own ports. The ports in each namespace will not conflict with each other. After the network bridge is built on the host, the communication between containers can be easily realized, and the applications in each container can use the same port.
package main import ( "log" "os" "os/exec" "syscall" ) func main() { cmd := exec.Command("sh") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWIPC | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWUSER | syscall.CLONE_NEWNET, } cmd.SysProcAttr.Credential = &syscall.Credential{Uid: uint32(1), Gid: uint32(1)} cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { log.Fatal(err) } os.Exit(-1) }
Host network
Intranamespace network
You can see that there are lo, eth0, eth1 and other network devices on the host, but there are no network devices in the Namespace. This shows the network isolation between the Network namespace and the host computer.
2.2 Cgroups
Linux Cgroups (Control Groups) provides the ability to limit, control and count the resources of a group of processes and future sub processes, including CPU, memory, storage, network, etc.
Three components in Cgroups
- cgroup: a group of processes is often managed and associated with a group of subsystem s.
- subsystem: a group of resource control modules, mainly including the restrictions on memory and CPU
- hierarchy: string a group of cgroups into a tree structure, through which cgroups can inherit.
Using go language to limit container resources through cgroup
package main import ( "os/exec" "path" "os" "fmt" "io/ioutil" "syscall" "strconv" ) const cgroupMemoryHierarchyMount = "/sys/fs/cgroup/memory" func main() { if os.Args[0] == "/proc/self/exe" { //Container process fmt.Printf("current pid %d", syscall.Getpid()) fmt.Println() cmd := exec.Command("sh", "-c", `stress --vm-bytes 200m --vm-keep -m 1`) cmd.SysProcAttr = &syscall.SysProcAttr{ } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Run(); err != nil { fmt.Println(err) os.Exit(1) } } cmd := exec.Command("/proc/self/exe") cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS, } cmd.Stdin = os.Stdin cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr if err := cmd.Start(); err != nil { fmt.Println("ERROR", err) os.Exit(1) } else { //Get the pid of the process mapped in the external namespace from the fork fmt.Printf("%v", cmd.Process.Pid) // Create a cgroup on the Hierarchy with the memory subsystem attached by default os.Mkdir(path.Join(cgroupMemoryHierarchyMount, "testmemorylimit"), 0755) // Add the container process to this cgroup ioutil.WriteFile(path.Join(cgroupMemoryHierarchyMount, "testmemorylimit", "tasks") , []byte(strconv.Itoa(cmd.Process.Pid)), 0644) // Restrict cgroup process usage ioutil.WriteFile(path.Join(cgroupMemoryHierarchyMount, "testmemorylimit", "memory.limit_in_bytes") , []byte("100m"), 0644) } cmd.Process.Wait() }
By configuring the Cgroups virtual file system, we limit the memory occupation of the stress process in the container to 100m.
2.3 Union File System
It mainly involves the following key contents:
- Copy on write (hereinafter referred to as CoW), also known as implicit sharing, is a resource management technology to achieve efficient replication of modifiable resources. Generally speaking, each write is to operate on the copy of the original file, and the original file will not be modified, and the subsequent modifications are to operate on this copy, that is, it is copied only when it is modified for the first time.
- The layered file system mirrored by Docker uses AUFS mechanism. AUFS has the advantages of fast starting container and efficient utilization of storage and memory. Each image has a tree structure, and the upper image uses the lower image.
- Docker uses Cow technology to implement image layer and reduce disk space occupation. Cow means that once only a small part of a file has been changed, the whole file also needs to be copied. However, don't worry too much. For each container, each image layer only needs to be copied once at most. Subsequent changes will be made on the container layer of the first copy.