Video analysis related to linux server development:
On the scheme of high concurrency lock from the perspective of nginx "startling the group"
Program performance optimization - asynchrony helps you solve 80% of your problems
c/c++ linux server development free learning address: Senior architect of c/c++ linux background server
In our internal system, there is a tcp proxy service. All network related requests of users, such as accessing the external network or accessing some services in the internal network, need to pass through this service. On the one hand, it realizes the billing of external network access, and on the other hand, it also restricts the internal network access of applications through the white name single machine system.
With the increase of business volume, it is found that the load of machines providing services gradually increases. When the traffic is peak, clients often cannot connect. Originally, this service is also a stateless service, which can be easily expanded horizontally. While adding machines, try to analyze the bottleneck of the program itself to see whether the processing capacity of the program itself can be improved. Through analysis and optimization, Or has it improved the processing capacity to a certain extent
First, the following flame diagram is generated online using perf tool:
Flame diagram
It's strange to see this flame diagram for the first time. The main question is why__ accept_nocancel, that is, the call of accept, will be so frequent. The first thing I think of is the group shock effect. However, when the Linux kernel is 2.6.18, the group shock problem of accept has been solved in the kernel. The CentOS 6.5 kernel version we use online has reached 2.6.32. In theory, there should be no similar problem.
So go online and try strace to see the system call of the program:
epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(7, {{EPOLLIN, {u32=9484480, u64=9484480}}}, 1024, 500) = 1 accept(6, 0x7fff90830890, [128]) = -1 EAGAIN (Resource temporarily unavailable)
Most of them are epoll_wait returns and attempts to accept, but accept returns EAGAIN, which should be the reason for so many accept calls in the flame diagram. After searching the data again, it is found that the group alarm will not be handled in the epoll programming model. When a socket has an event, the kernel will wake up all the listening epoll_wait call, which causes this problem.
Let's first analyze the program. The current workflow of the program is:
- The master process starts, bind s and listens to a series of fd;
- fork n worker processes;
- In the worker process, obtain the fd monitored by the master process and call epoll_create create epoll instance and listen
fd events; - If there is a new connection, accept it directly, and then enter the agent process;
- Repeat step 4.
Because the online machine is 24 cores, it will fork 24 worker processes when running online. In fact, when the number of worker processes is relatively small, this phenomenon is not very obvious, but when there are more worker processes, the external loss caused by surprise seems to be negligible.
The solution of nginx to this problem is to create a global lock. Only the woker process that gets the lock will listen to the listen fd event and accept it. When some conditions are met, the worker will abandon the lock and stop listening to the listen fd event. After other wokers get the lock, they will continue to listen to the listen fd event.
In nginx version 1.9.1, a new feature reuseport is supported. In Linux kernel 3.9 or later, so can be turned on_ The reuseport option realizes the isolation similar to the previous accept through the operating system to avoid the phenomenon of group shock. At the same time, it can make better use of multi-core and improve the system performance.
Because the implementation of accpet lock similar to nginx is relatively complex, it happens to be in our online system CentOS 6.5, SO_REUSEPORT has been returned by RedHat backup, which means that so can also be enabled in the 2.6.32 kernel of CentOS 6.5_ Reuseport this option, so I modified the proxy code and tried to open SO_REUSEPORT:
- The master process starts;
- fork n worker processes;
- In the worker process, bind and listen to a series of fd, and set so before bind_ Reuseport option, call epoll_create create epoll
instance and listen to the event of listen fd; - If there is a new connection, accept it directly, and then enter the agent process;
- Repeat step 4.
After modification, strace is also performed. When there is a new connection, only one worker process will be awakened to accept, which greatly improves the efficiency. Take another look at the flame diagram:
Flame diagram
You can see that the call of accept is no longer visible. Instead of recv, send and connect, it is normal for a tcp agent to realize consumption on the network. Epoll_ There are still too many calls to wait, and there is still room for optimization.
Through a simple test, with 500 threads and 50000 requests, under the same machine and configuration, the overall time is reduced from about 14.38 seconds to about 10.27 seconds, and the performance is improved by about 30%.
PS: in the Linux 4.5 kernel, the EPOLLEXCLUSIVE option is introduced, which can also solve the group shock problem of epoll.