The of embedded software development -- an analysis of linux do_page_fault

Linux code version: linux4.0 four

Guide: playing linux programming can't get around the part of memory management after all. When I first came into contact with linux, I saw the copy on write mechanism. At that time, I was curious about how it was implemented. When contacting dpdk, use hugepage to reduce tlb miss to improve performance, and return the address first when the user is malloc, but the physical memory is not allocated. With the increase of working time, these knowledge can no longer only stay at the level of concept and adjustable interface, but need to go deep into the linux kernel code. Let's start with the code of arm64.

1, MMU related knowledge

From the contact with linux, we know that the addresses accessed by the CPU are virtual addresses, and MMU conversion is required to access the physical address, while MMU needs to perform address conversion through the page table. If there is no corresponding page table, an exception will occur. For example, accessing NULL will trigger an exception. In the user state, the application will die, the kernel state is more serious, and the system will collapse. To sum up, MMU has two main functions:

1. Address translation (according to page table)

2. Permission check (several bit s in the page table indicate readable, writable and executable)

Therefore, when the page table has errors and access rights errors, it will enter the exception. It sounds as if it will enter the exception when there is an error. linux uses this feature to realize normal functions, such as copy on write.

An abnormal entrance is a base note, then called to do_. mem_ abort 

asmlinkage void __exception do_mem_abort(unsigned long addr, unsigned int esr,
					 struct pt_regs *regs)
{
	const struct fault_info *inf = fault_info + (esr & 63);
	struct siginfo info;

	if (!inf->fn(addr, esr, regs))
		return;

	pr_alert("Unhandled fault: %s (0x%08x) at 0x%016lx\n",
		 inf->name, esr, addr);

	info.si_signo = inf->sig;
	info.si_errno = 0;
	info.si_code  = inf->code;
	info.si_addr  = (void __user *)addr;
	arm64_notify_die("", regs, &info, esr);
}

As can be seen from the above code, three parameters are passed in:

1. Exception address accessed by addr

2. esr exception type

3 value of CPU register when regs enters exception

At the same time, it can be seen from the code through fault_info and exception type to get the handler function. If there is no corresponding handler function, an Unhandled fault will be reported.

Let's look at "fault"_ info 

static const struct fault_info {
	int	(*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
	int	sig;
	int	code;
	const char *name;
} fault_info[] = {
	{ do_bad,		SIGBUS,  0,		"ttbr address size fault"	},
	{ do_bad,		SIGBUS,  0,		"level 1 address size fault"	},
	{ do_bad,		SIGBUS,  0,		"level 2 address size fault"	},
	{ do_bad,		SIGBUS,  0,		"level 3 address size fault"	},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 0 translation fault"	},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 1 translation fault"	},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 2 translation fault"	},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 3 translation fault"	},
	{ do_bad,		SIGBUS,  0,		"unknown 8"			},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 access flag fault"	},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 access flag fault"	},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 access flag fault"	},
	{ do_bad,		SIGBUS,  0,		"unknown 12"			},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
	{ do_bad,		SIGBUS,  0,		"unknown 17"			},
	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
	{ do_bad,		SIGBUS,  0,		"unknown 19"			},
	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous abort (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous parity error"	},
	{ do_bad,		SIGBUS,  0,		"unknown 25"			},
	{ do_bad,		SIGBUS,  0,		"unknown 26"			},
	{ do_bad,		SIGBUS,  0,		"unknown 27"			},
	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"synchronous parity error (translation table walk)" },
	{ do_bad,		SIGBUS,  0,		"unknown 32"			},
	{ do_bad,		SIGBUS,  BUS_ADRALN,	"alignment fault"		},
	{ do_bad,		SIGBUS,  0,		"unknown 34"			},
	{ do_bad,		SIGBUS,  0,		"unknown 35"			},
	{ do_bad,		SIGBUS,  0,		"unknown 36"			},
	{ do_bad,		SIGBUS,  0,		"unknown 37"			},
	{ do_bad,		SIGBUS,  0,		"unknown 38"			},
	{ do_bad,		SIGBUS,  0,		"unknown 39"			},
	{ do_bad,		SIGBUS,  0,		"unknown 40"			},
	{ do_bad,		SIGBUS,  0,		"unknown 41"			},
	{ do_bad,		SIGBUS,  0,		"unknown 42"			},
	{ do_bad,		SIGBUS,  0,		"unknown 43"			},
	{ do_bad,		SIGBUS,  0,		"unknown 44"			},
	{ do_bad,		SIGBUS,  0,		"unknown 45"			},
	{ do_bad,		SIGBUS,  0,		"unknown 46"			},
	{ do_bad,		SIGBUS,  0,		"unknown 47"			},
	{ do_bad,		SIGBUS,  0,		"TLB conflict abort"		},
	{ do_bad,		SIGBUS,  0,		"unknown 49"			},
	{ do_bad,		SIGBUS,  0,		"unknown 50"			},
	{ do_bad,		SIGBUS,  0,		"unknown 51"			},
	{ do_bad,		SIGBUS,  0,		"implementation fault (lockdown abort)" },
	{ do_bad,		SIGBUS,  0,		"implementation fault (unsupported exclusive)" },
	{ do_bad,		SIGBUS,  0,		"unknown 54"			},
	{ do_bad,		SIGBUS,  0,		"unknown 55"			},
	{ do_bad,		SIGBUS,  0,		"unknown 56"			},
	{ do_bad,		SIGBUS,  0,		"unknown 57"			},
	{ do_bad,		SIGBUS,  0,		"unknown 58" 			},
	{ do_bad,		SIGBUS,  0,		"unknown 59"			},
	{ do_bad,		SIGBUS,  0,		"unknown 60"			},
	{ do_bad,		SIGBUS,  0,		"section domain fault"		},
	{ do_bad,		SIGBUS,  0,		"page domain fault"		},
	{ do_bad,		SIGBUS,  0,		"unknown 63"			},
};

In fact, there are three processing functions {do_translation_fault ,do_page_fault ,do_bad

To sum up:

1. Page table conversion error calling do_translation_fault (do_page_fault will also be called)

2. Permission error calling do_page_fault

3. Other error calls do_bad (this function returns directly and does nothing)

2, do_translation_fault function

static int __kprobes do_translation_fault(unsigned long addr,
					  unsigned int esr,
					  struct pt_regs *regs)
{
    /*User space address*/
	if (addr < TASK_SIZE)
		return do_page_fault(addr, esr, regs);
    /*Kernel address or illegal address*/
	do_bad_area(addr, esr, regs);
	{
    	struct task_struct *tsk = current;
    	struct mm_struct *mm = tsk->active_mm;

    	/*
    	 * If we are in kernel mode at this point, we have no context to
    	 * handle this fault with.
    	 */
    	/*Judge whether it is an exception triggered in user mode*/
    	if (user_mode(regs))
            /*An attempt to access a memory address where the user state program does not exist generates a signal exception*/
    		__do_user_fault(tsk, addr, esr, SIGSEGV, SEGV_MAPERR, regs);  
    	else
    		__do_kernel_fault(mm, addr, esr, regs);
        {
        	/*
        	 * Are we prepared to handle this kernel fault?
        	 * We are almost certainly not prepared to handle instruction faults.
        	 */
        	 /*Check the exception table and try fixup (is it the case that the kernel address is passed in user mode?)*/
        	if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
        		return;

        	/*
        	 * No handler, we'll have to terminate things with extreme prejudice.
        	 */
        	/*Kernel exception, print exception information*/
        	bust_spinlocks(1);
        	pr_alert("Unable to handle kernel %s at virtual address %08lx\n",
        		 (addr < PAGE_SIZE) ? "NULL pointer dereference" :
        		 "paging request", addr);

        	show_pte(mm, addr);
        	die("Oops", regs, esr);
        	bust_spinlocks(0);
        	do_exit(SIGKILL);
        }
    }
	return 0;
}

It can be seen from the above:

1. The exception address is the user state address, and you can call {do directly_ page_ fault 

2. When the exception address is the kernel address and the exception is triggered by the user state, the program will report an exception and generate a signal that can not be ignored. If it is in the kernel state, try fixup. If it fails, it will report a kernel exception, that is, the kernel is dead.

Supplement: fixup is uaccess Several cases defined by H, such as copy_from_user, put its instruction address in "exception"_ Search whether the error instruction is "exception" during tables and fixup_ Instructions in tables. F:\linux4. 4 \ kernel \ documentation \ x86 \ exception tables. The excerpt is as follows:

When a process runs in kernel mode, it often has to access user
mode memory whose address has been passed by an untrusted program.
To protect itself the kernel has to verify this address.

In older versions of Linux this was done with the
int verify_area(int type, const void * addr, unsigned long size)
function (which has since been replaced by access_ok()).

This function verified that the memory area starting at address
'addr' and of size 'size' was accessible for the operation specified
in type (read or write). To do this, verify_read had to look up the
virtual memory area (vma) that contained the address addr. In the
normal case (correctly working program), this test was successful.
It only failed for a few buggy programs. In some kernel profiling
tests, this normally unneeded verification used up a considerable
amount of time.

To overcome this situation, Linus decided to let the virtual memory
hardware present in every Linux-capable CPU handle this test.

How does this work?

Whenever the kernel tries to access an address that is currently not
accessible, the CPU generates a page fault exception and calls the
page fault handler

void do_page_fault(struct pt_regs *regs, unsigned long error_code)

in arch/x86/mm/fault.c. The parameters on the stack are set up by
the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
regs is a pointer to the saved registers on the stack, error_code
contains a reason code for the exception.

do_page_fault first obtains the unaccessible address from the CPU
control register CR2. If the address is within the virtual address
space of the process, the fault probably occurred, because the page
was not swapped in, write protected or something similar. However,
we are interested in the other case: the address is not valid, there
is no vma that contains this address. In this case, the kernel jumps
to the bad_area label.

There it uses the address of the instruction that caused the exception
(i.e. regs->eip) to find an address where the execution can continue
(fixup). If this search is successful, the fault handler modifies the
return address (again regs->eip) and returns. The execution will
continue at the address in fixup.

Where does fixup point to?

Since we jump to the contents of fixup, fixup obviously points
to executable code. This code is hidden inside the user access macros.
I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
as an example. The definition is somewhat hard to follow, so let's peek at
the code generated by the preprocessor and the compiler. I selected
the get_user call in drivers/char/sysrq.c for a detailed examination.

The kernel mode deals with the address of user space, but the address may be passed by unreliable applications, so the old kernel uses "verify"_ Area to verify the legitimacy of the address. In most normal cases, it must pass the verification, but this mechanism of checking vma is a waste of time. The linux kernel can't tolerate this situation, so it improves "verify"_ The mechanism of area gives the problem to the hardware mechanism and uacess The instruction addresses of those situations defined in H are put into the exception table, and then search the exception table to filter out this kind of situation, and then try fixup.

3, do_page_fault function

Look do_page_fault is to handle two types of exceptions, one is page missing exception, the other is access permission exception. But in fact, it is more complicated. It can be roughly divided into the following situations (the summary may be incomplete):

Page missing exception:

1. Anonymous missing page

2. Missing page of file mapping

3. Exchange missing pages

4. Stack extension

5. Illegal address

Permission access exception:

1. copy on write 

2. Illegal address

static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
				   struct pt_regs *regs)
{
	struct task_struct *tsk;
	struct mm_struct *mm;
	int fault, sig, code;
	unsigned long vm_flags = VM_READ | VM_WRITE | VM_EXEC;
	unsigned int mm_flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
    /*If the kprobe function is turned on and triggered by the kernel state, it will be returned after kprobe processing*/
	if (notify_page_fault(regs, esr))
		return 0;
    /*Get current task*/
	tsk = current;
	mm  = tsk->mm;

	/* Enable interrupts if they were enabled in the parent context. */
    /*If the interrupt is on before entering the exception, the interrupt is also on, which means that the next processing can be interrupted by the interrupt again*/
	if (interrupts_enabled(regs))
		local_irq_enable();

	/*
	 * If we're in an interrupt or have no user context, we must not take
	 * the fault.
	 */
	 /*mm If it is BULL, it indicates that the interrupt is the kernel thread. In another case, it is the atomic context or the kernel state*/
	if (faulthandler_disabled() || !mm)
		goto no_context;
    /*Triggered by user status*/
	if (user_mode(regs))
		mm_flags |= FAULT_FLAG_USER;
    /*Triggered by executable permissions*/
	if (is_el0_instruction_abort(esr)) {
		vm_flags = VM_EXEC;
        /*Triggered by writable permissions*/
	} else if ((esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM)) {
		vm_flags = VM_WRITE;
		mm_flags |= FAULT_FLAG_WRITE;
	}
    /*In kernel mode, the user's virtual address access permission is wrong, which is to kill the user's state program, but the SIGSEGV signal is generated*/
	if (addr < USER_DS && is_permission_fault(esr, regs)) {
		/* regs->orig_addr_limit may be 0 if we entered from EL0 */
        /*The upper bound of the process address is KERNEL_DS */
		if (regs->orig_addr_limit == KERNEL_DS)
			die("Accessing user space memory with fs=KERNEL_DS", regs, esr);
        /*Attempt to execute user mode instruction*/
		if (is_el1_instruction_abort(esr))
			die("Attempting to execute userspace memory", regs, esr);
        /*The exception table cannot be queried, indicating that the kernel is accessing an illegal user space address*/
		if (!search_exception_tables(regs->pc))
			die("Accessing user space memory outside uaccess.h routines", regs, esr);
	}

	/*
	 * As per x86, we may deadlock here. However, since the kernel only
	 * validly references user space from well defined areas of the code,
	 * we can bug out early if this is from code which shouldn't.
	 */
	 /*Trying to get semaphore of mm*/
	if (!down_read_trylock(&mm->mmap_sem)) {
        /*The exception is not triggered in user mode, and the exception table cannot be searched. Go to the kernel error handling process*/
		if (!user_mode(regs) && !search_exception_tables(regs->pc))
			goto no_context;
retry:
		down_read(&mm->mmap_sem);
	} else {
		/*
		 * The above down_read_trylock() might have succeeded in which
		 * case, we'll have missed the might_sleep() from down_read().
		 */
		
		might_sleep();
#ifdef CONFIG_DEBUG_VM
		if (!user_mode(regs) && !search_exception_tables(regs->pc))
			goto no_context;
#endif
	}
    /*The kernel state errors are filtered out above. Go no_context, go here to explain the exceptions generated by the user state program*/
	fault = __do_page_fault(mm, addr, mm_flags, vm_flags, tsk);
            {
            	struct vm_area_struct *vma;
            	int fault;
                /*Find the user virtual address space where the address is located*/
            	vma = find_vma(mm, addr);
            	fault = VM_FAULT_BADMAP;
                /*Not found, indicating that it is an abnormal address*/
            	if (unlikely(!vma))
            		goto out;
                /*Extended stack*/
            	if (unlikely(vma->vm_start > addr))
            		goto check_stack;

            	/*
            	 * Ok, we have a good vm_area for this memory access, so we can handle
            	 * it.
            	 */
            good_area:
            	/*
            	 * Check that the permissions on the VMA allow for the fault which
            	 * occurred. If we encountered a write or exec fault, we must have
            	 * appropriate permissions, otherwise we allow any permission.
            	 */
            	 /*Permission check: access an address without permission and return directly. The subsequent processing still kills the process and generates SIGSEGV signal*/
            	if (!(vma->vm_flags & vm_flags)) {
            		fault = VM_FAULT_BADACCESS;
            		goto out;
            	}

            	return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
                       {
                        	int ret;

                        	__set_current_state(TASK_RUNNING);

                        	count_vm_event(PGFAULT);
                        	mem_cgroup_count_vm_event(mm, PGFAULT);

                        	/* do counter updates before entering really critical section. */
                        	check_sync_rss_stat(current);

                        	/*
                        	 * Enable the memcg OOM handling for faults triggered in user
                        	 * space.  Kernel faults are handled more gracefully.
                        	 */
                        	 /*Triggered by user status, enabling oom*/
                        	if (flags & FAULT_FLAG_USER)
                        		mem_cgroup_oom_enable();

                        	ret = __handle_mm_fault(mm, vma, address, flags);
                            /*Close oom after processing*/
                        	if (flags & FAULT_FLAG_USER) {
                        		mem_cgroup_oom_disable();
                                        /*
                                         * The task may have entered a memcg OOM situation but
                                         * if the allocation error was handled gracefully (no
                                         * VM_FAULT_OOM), there is no need to kill anything.
                                         * Just clean up the OOM state peacefully.
                                         */
                                        if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
                                                mem_cgroup_oom_synchronize(false);
                        	}

                        	return ret;
                        }

            check_stack:
            	if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
            		goto good_area;
            out:
            	return fault;
            }

	/*
	 * If we need to retry but a fatal signal is pending, handle the
	 * signal first. We do not need to release the mmap_sem because it
	 * would already be released in __lock_page_or_retry in mm/filemap.c.
	 */
	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
		if (!user_mode(regs))
			goto no_context;
		return 0;
	}

	/*
	 * Major/minor page fault accounting is only done on the initial
	 * attempt. If we go through a retry, it is extremely likely that the
	 * page will be found in page cache at that point.
	 */

	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
	if (mm_flags & FAULT_FLAG_ALLOW_RETRY) {
		if (fault & VM_FAULT_MAJOR) {
			tsk->maj_flt++;
			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
				      addr);
		} else {
			tsk->min_flt++;
			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
				      addr);
		}
		if (fault & VM_FAULT_RETRY) {
			/*
			 * Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk of
			 * starvation.
			 */
			mm_flags &= ~FAULT_FLAG_ALLOW_RETRY;
			mm_flags |= FAULT_FLAG_TRIED;
			goto retry;
		}
	}

	up_read(&mm->mmap_sem);

	/*
	 * Handle the "normal" case first - VM_FAULT_MAJOR / VM_FAULT_MINOR
	 */
	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
			      VM_FAULT_BADACCESS))))
		return 0;

	/*
	 * If we are in kernel mode at this point, we have no context to
	 * handle this fault with.
	 */
	if (!user_mode(regs))
		goto no_context;

	if (fault & VM_FAULT_OOM) {
		/*
		 * We ran out of memory, call the OOM killer, and return to
		 * userspace (which will retry the fault, or kill us if we got
		 * oom-killed).
		 */
		pagefault_out_of_memory();
		return 0;
	}

	if (fault & VM_FAULT_SIGBUS) {
		/*
		 * We had some memory, but were unable to successfully fix up
		 * this page fault.
		 */
		sig = SIGBUS;
		code = BUS_ADRERR;
	} else {
		/*
		 * Something tried to access memory that isn't in our memory
		 * map.
		 */
		sig = SIGSEGV;
		code = fault == VM_FAULT_BADACCESS ?
			SEGV_ACCERR : SEGV_MAPERR;
	}

	__do_user_fault(tsk, addr, esr, sig, code, regs);
	return 0;

no_context:
	__do_kernel_fault(mm, addr, esr, regs);
	return 0;
}

In the above process, first filter out the exceptions in the kernel state:

1. Illegal address (kernel crashed)

2. Such as copy_to_user/copy_from_user et al H, search the exception table fixup and illegal user state address. At this time, kill the user state program and generate SIGSEGV signal

The rest is the exception triggered by the user status program__ do_page_fault processing:

1. Search vma. If you can't find it, it is an illegal address. If you find it, you should also judge the access permission. Illegal address and wrong permission also kill the process and generate SIGSEGV signal

2. Expand the user status stack and call {expand_stack to expand

3. All that remains is that the address is legal and the permission is right, which is handled by {handle_mm_fault processing

handle_mm_fault:

handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
                       {
                        	int ret;

                        	__set_current_state(TASK_RUNNING);

                        	count_vm_event(PGFAULT);
                        	mem_cgroup_count_vm_event(mm, PGFAULT);

                        	/* do counter updates before entering really critical section. */
                        	check_sync_rss_stat(current);

                        	/*
                        	 * Enable the memcg OOM handling for faults triggered in user
                        	 * space.  Kernel faults are handled more gracefully.
                        	 */
                        	 /*Triggered by user status, enabling oom*/
                        	if (flags & FAULT_FLAG_USER)
                        		mem_cgroup_oom_enable();

                        	ret = __handle_mm_fault(mm, vma, address, flags);
                                  {
                                    	pgd_t *pgd;
                                    	pud_t *pud;
                                    	pmd_t *pmd;
                                    	pte_t *pte;
                                        /*The processing of large page file system will not be studied temporarily*/
                                    	if (unlikely(is_vm_hugetlb_page(vma)))
                                    		return hugetlb_fault(mm, vma, address, flags);
                                       /*Find pgd, pub, pmd and pte page table items. pgd must exist, but if pub, pmd and pte do not exist, create them*/
                                    	pgd = pgd_offset(mm, address);
                                    	pud = pud_alloc(mm, pgd, address);
                                    	if (!pud)
                                    		return VM_FAULT_OOM;
                                    	pmd = pmd_alloc(mm, pud, address);
                                    	if (!pmd)
                                    		return VM_FAULT_OOM;
                                        /*The things of large page file system will not be studied for the time being*/
                                    	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
                                    		int ret = create_huge_pmd(mm, vma, address, pmd, flags);
                                    		if (!(ret & VM_FAULT_FALLBACK))
                                    			return ret;
                                    	} else {
                                    		pmd_t orig_pmd = *pmd;
                                    		int ret;

                                    		barrier();
                                    		if (pmd_trans_huge(orig_pmd)) {
                                    			unsigned int dirty = flags & FAULT_FLAG_WRITE;

                                    			/*
                                    			 * If the pmd is splitting, return and retry the
                                    			 * the fault.  Alternative: wait until the split
                                    			 * is done, and goto retry.
                                    			 */
                                    			if (pmd_trans_splitting(orig_pmd))
                                    				return 0;

                                    			if (pmd_protnone(orig_pmd))
                                    				return do_huge_pmd_numa_page(mm, vma, address,
                                    							     orig_pmd, pmd);

                                    			if (dirty && !pmd_write(orig_pmd)) {
                                    				ret = wp_huge_pmd(mm, vma, address, pmd,
                                    							orig_pmd, flags);
                                    				if (!(ret & VM_FAULT_FALLBACK))
                                    					return ret;
                                    			} else {
                                    				huge_pmd_set_accessed(mm, vma, address, pmd,
                                    						      orig_pmd, dirty);
                                    				return 0;
                                    			}
                                    		}
                                    	}

                                    	/*
                                    	 * Use __pte_alloc instead of pte_alloc_map, because we can't
                                    	 * run pte_offset_map on the pmd, if an huge pmd could
                                    	 * materialize from under us from a different thread.
                                    	 */
                                    	if (unlikely(pmd_none(*pmd)) &&
                                    	    unlikely(__pte_alloc(mm, vma, pmd, address)))
                                    		return VM_FAULT_OOM;
                                    	/*
                                    	 * If a huge pmd materialized under us just retry later.  Use
                                    	 * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
                                    	 * didn't become pmd_trans_huge under us and then back to pmd_none, as
                                    	 * a result of MADV_DONTNEED running immediately after a huge pmd fault
                                    	 * in a different thread of this mm, in turn leading to a misleading
                                    	 * pmd_trans_huge() retval.  All we have to ensure is that it is a
                                    	 * regular pmd that we can walk with pte_offset_map() and we can do that
                                    	 * through an atomic read in C, which is what pmd_trans_unstable()
                                    	 * provides.
                                    	 */
                                    	if (unlikely(pmd_trans_unstable(pmd)))
                                    		return 0;
                                    	/*
                                    	 * A regular pmd is established and it can't morph into a huge pmd
                                    	 * from under us anymore at this point because we hold the mmap_sem
                                    	 * read mode and khugepaged takes it in write mode. So now it's
                                    	 * safe to run pte_offset_map().
                                    	 */
                                    	pte = pte_offset_map(pmd, address);

                                    	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
                                               {
                                                	pte_t entry;
                                                	spinlock_t *ptl;

                                                	/*
                                                	 * some architectures can have larger ptes than wordsize,
                                                	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
                                                	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
                                                	 * The code below just needs a consistent view for the ifs and
                                                	 * we later double check anyway with the ptl lock held. So here
                                                	 * a barrier will do.
                                                	 */
                                                	entry = *pte;
                                                	barrier();
                                                    /*The page table is not in memory*/
                                                	if (!pte_present(entry)) {
                                                        /*The page table item is 0, indicating that it has not been accessed*/
                                                		if (pte_none(entry)) {
                                                            /*Judge VMA - > VM_ Whether OPS is assigned. If it is not assigned, anonymous pages are assigned. If it is assigned, it is a file based mapping*/
                                                			if (vma_is_anonymous(vma))
                                                				return do_anonymous_page(mm, vma, address,
                                                							 pte, pmd, flags);
                                                			else
                                                                /*Handle file page anonymous mapping*/
                                                				return do_fault(mm, vma, address, pte, pmd,
                                                						flags, entry);
                                                                       {
                                                                        	pgoff_t pgoff = (((address & PAGE_MASK)
                                                                        			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

                                                                        	pte_unmap(page_table);
                                                                        	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
                                                                        	if (!vma->vm_ops->fault)
                                                                        		return VM_FAULT_SIGBUS;
                                                                            /*Error reading file page*/
                                                                        	if (!(flags & FAULT_FLAG_WRITE))
                                                                        		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
                                                                        				orig_pte);
                                                                            /*Write private file page*/
                                                                        	if (!(vma->vm_flags & VM_SHARED))
                                                                        		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
                                                                        				orig_pte);
                                                                            /*Write shared file page*/
                                                                        	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
                                                                        }
                                                		}
                                                        /*Exchange missing page processing, and the accessed memory is exchanged to the swap partition*/
                                                		return do_swap_page(mm, vma, address,
                                                					pte, pmd, flags, entry);
                                                	}

                                                	if (pte_protnone(entry))
                                                		return do_numa_page(mm, vma, address, entry, pte, pmd);

                                                	ptl = pte_lockptr(mm, pmd);
                                                	spin_lock(ptl);
                                                	if (unlikely(!pte_same(*pte, entry)))
                                                		goto unlock;
                                                    /*Handle copy on write mechanism*/
                                                	if (flags & FAULT_FLAG_WRITE) {
                                                		if (!pte_write(entry))
                                                			return do_wp_page(mm, vma, address,
                                                					pte, pmd, ptl, entry);
                                                		entry = pte_mkdirty(entry);
                                                	}
                                                	entry = pte_mkyoung(entry);
                                                	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
                                                		update_mmu_cache(vma, address, pte);
                                                	} else {
                                                		/*
                                                		 * This is needed only for protection faults but the arch code
                                                		 * is not yet telling us if this is a protection fault or not.
                                                		 * This still avoids useless tlb flushes for .text page faults
                                                		 * with threads.
                                                		 */
                                                		if (flags & FAULT_FLAG_WRITE)
                                                			flush_tlb_fix_spurious_fault(vma, address);
                                                	}
                                                unlock:
                                                	pte_unmap_unlock(pte, ptl);
                                                	return 0;
                                                }
                                    }
                            /*Close oom after processing*/
                        	if (flags & FAULT_FLAG_USER) {
                        		mem_cgroup_oom_disable();
                                        /*
                                         * The task may have entered a memcg OOM situation but
                                         * if the allocation error was handled gracefully (no
                                         * VM_FAULT_OOM), there is no need to kill anything.
                                         * Just clean up the OOM state peacefully.
                                         */
                                        if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
                                                mem_cgroup_oom_synchronize(false);
                        	}

                        	return ret;
                        }

The following treatment shall be carried out after entering:

1. First find pgd (must exist), then find pud, pmd, pte (create if it does not exist)

2. If it is not in memory and accessed for the first time, the anonymous page (VM - > OPS unassigned) calls do_anonymous_page, do for file mapping page_ Fault processing (read file page error, write private file mapping error, write shared file mapping error)

3. If it is not in memory and has been accessed, it is exchanged to the swap space. Call do_swap_page processing.

4. Write permission error, that is, copy on write mechanism, through {do_wp_page 

It is found from the above that linux makes full use of mmu exceptions, and the abnormal process is also very long. However, the next exception is a loss of CPU performance. Of course, there is no problem with normal application. For the application scenario of pursuing extreme performance, it still needs a certain degree of understanding. If it is exchanged to swap

Space needs to be transferred in. It's a waste of time, so you need to increase memory in exchange for performance. If you don't actually get the memory after malloc, you can allocate the actual physical memory directly through memset to avoid such a set of processing flow when you use it.

At the same time, the page found that the kernel will check the address passed to the kernel by the user state program, so there is no way to "trick the kernel" to obtain data illegally.

 

Tags: Linux MMU

Posted by BRUUUCE on Tue, 19 Apr 2022 07:58:57 +0930