catalogue
? 10.Match object and Pattern object
3, Regular expression syntax matching
Are there any systematic interview questions for many Python partners?
Today, I recommend a magical topic brush: Brush questions to interview artifact Niuke
Interview questions of major Internet companies. From basic to advanced and even principle analysis interview questions, you have everything. Hurry to equip yourself! Help you win the interview, solo interviewer
preface
Regular matching can be said to be the basic skill of data retrieval and traversal. In Python, we often use scenarios that require specific characters, especially when a web crawler crawls to extract data from a messy web page after encoding. Re library is required to extract data after database extraction or JSON file secondary processing. Therefore, learning re library and regular expression well is a required course for Python engineers. You may not be very proficient in regular expression, but you must understand the various functions existing in re library, which will greatly accelerate the time you need to complete a project and facilitate problem search. This article will introduce the functions of the regular expression function re Library in Python in detail.
1, Regular expression
1. Introduction
Regular Expression, also known as Regular Expression * *, * * (Regular Expression, often abbreviated as regex, regexp or RE in code), is a text pattern, including ordinary characters (for example, letters between a and z) and special characters (called "metacharacters"), which are computer science A concept of. Regular expression is a logical formula for the operation of strings (including ordinary characters (for example, letters between a and z) and special characters (called "metacharacters"). It is to use some specific characters defined in advance and the combination of these specific characters to form a "regular string". This "regular string" is used to express a kind of filtering logic for strings. A regular expression is a text pattern that describes one or more strings to match when searching for text. Regular expressions use a single string to describe and match a series of strings that match a certain syntactic rule. They are usually used to retrieve and replace the text that conforms to a certain pattern (rule).
Many programming languages support string operations using regular expressions. For example, in Perl A powerful regular expression engine is built in. The concept of regular expression was originally developed by Unix Tool software in (e.g sed and grep )It is widely used in Scala, PHP, c#, Java, c++, Objective-c, Perl, Swift, VBScript, Javascript, Ruby, Python and so on. Regular expressions are usually abbreviated to "regex", singular There are regexp, regex, complex There are regexps, regexes, and regexen.
2. Concept
Regular expression is a logical formula for string operation, which uses some specific characters defined in advance and the combination of these specific characters to form a "regular string". This "regular string" is used to express a kind of filtering logic for strings.
3. Purpose
Given a regular expression and another string, we can achieve the following goals:
- Determine whether the given string conforms to the filtering logic of regular expressions (called "match"):
- We can get the specific part we want from the string through regular expression.
4. Features
The characteristics of regular expressions are:
- Very flexible, logical and functional;
- It can quickly achieve the complex control of strings in a very simple way.
- For people who have just come into contact, it is more obscure and difficult to understand.
2, Re Library
The re library is Python's own standard library, which can be used without installation:
By default, the Re library adopts greedy matching, that is, it outputs the longest substring of matching
import re
Module common functions:
1.re.match()
The basic grammatical form is:
re.match(pattern,string,flags=0)
The function function matches the regular expression from the beginning of a string and returns the match object. If there is no match at the beginning, no matter whether there is one that can match at the back, the result will not be printed. This is the difference from search.
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- String: matching string
- Flags: control flags when regular expressions are used
The flags optional parameters are:
parameter
explain
re.l/re.IGNORECASE
ignore case
re.M/re.MULTILINE
Multi line mode, changing the behavior of '^' and '$'
re.S/re.DOTALL
Point to any matching pattern of '.' and change the behavior of '
re.L/re.LOCALE
Make the predetermined character class wWBsS depend on the current region setting
re.U/re.UNICODE
Make the predetermined character class wWBsSdD depend on the character attributes defined by unicode
re.X/re.VERBOSE
Detailed mode. In this mode, regular expressions can be multi line, white space characters are ignored, and comments can be added
Let's put the complex matching into Chapter 3. Here we just briefly show the usage and effect of this function:
strings='Fanstuck wants to leave alone' print(re.match('Fanstuck',strings)) #out: <re.Match object; span=(0, 8), match='Fanstuck'>
Where span is the position of the matched pattern in the string.
strings='Fanstuck wants to leave alone' print(re.match('anstuck',strings)) #out: None
2.fullmatch()
The basic syntax format is:
fullmatch(pattern, string, flags=0)
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- String: matching string
- Flags: control flags when regular expressions are used
The function functions are: try to apply the regular expression pattern to all string strings, and return the matching object if the matching is successful; "None" if no match is found.
Usage effect display:
strings='Fanstuck wants to leave alone' print(re.fullmatch('Fanstuck wants to leave alone',strings)) #out: <re.Match object; span=(0, 29), match='Fanstuck wants to leave alone'>
The pattern specified by this function for the greedy rule must fully correspond to the length and characters of strings, otherwise it is None:
strings='Fanstuck wants to leave alone' print(re.fullmatch('Fanstuck wants to leave alon',strings)) #out: None
3.search()
The basic syntax format is:
search(pattern, string, flags=0)
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- String: matching string
- Flags: control flags when regular expressions are used
The function functions are: scan the string to find the match with the regular expression pattern, and return the matching object; "None" if no match is found. re.search matches the entire string until a match is found.
strings='Fanstuck wants to leave alone' print(re.search('alone',strings)) #out: <re.Match object; span=(24, 29), match='alone'> strings='Fanstuck wants to leave alone' print(re.search('die',strings)) #out: None
4.sub()
The basic syntax format is:
sub(pattern, repl, string, count=0, flags=0)
Parameter Description:
-
Pattern: match the regular expression pattern or native string representation of the target object
-
Repl: replace the matched pattern with repl
-
String: matching string
-
count: the maximum number of times to replace after pattern matching. By default, 0 means to replace all matches
-
Flags: control flags when regular expressions are used
strings='Fanstuck wants to leave alone alonely'
print(re.sub('leave','die',strings))
#out:Fanstuck wants to die alone alonelystrings='Fanstuck wants to leave alone alonely'
print(re.sub('alone','sad',strings))
#out:Fanstuck wants to leave sad sadly
5.subn()
The basic syntax format is:
subn(pattern, repl, string, count=0, flags=0)
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- Repl: replace the matched pattern with repl
- String: matching string
- count: the maximum number of times to replace after pattern matching. By default, 0 means to replace all matches
- Flags: control flags when regular expressions are used
Compared with the previous function sub, it just increases the number of times:
strings='Fanstuck wants to leave alone alonely' print(re.subn('alone','sad',strings)) #out: ('Fanstuck wants to leave sad sadly', 2)
However, it is convenient to convert it into dictionary dict or pandas, without counting how many fields match.
6.findall()
The basic syntax format is:
findall(pattern, string, flags=0) or findall(pattern,string, pos, endpos)
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- String: matching string
- Flags: control flags when regular expressions are used
- pos: optional parameter, which specifies the starting position of the string. The default is 0.
- endpos: an optional parameter that specifies the end position of the string. The default is the length of the string
The function function is to match all objects that conform to the regular expression pattern in the string, and return these objects in the form of a list.
strings='Fanstuck wants to leave alone alonely' print(re.findall('alone',strings)) #out:['alone', 'alone'] strings='Fanstuck wants to leave alone alonely' print(re.findall('alonely',strings)) #out:['alonely'] strings='Fanstuck wants to leave alone alonely' pattern=re.compile('a') print(pattern.findall(strings,0,30))
7.finditer()
The basic syntax format is:
finditer(pattern, string, flags=0)
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- String: matching string
- Flags: control flags when regular expressions are used
The function functions are: match all objects that conform to the regular expression pattern in the string, and return these objects in the form of iterators.
strings='Fanstuck wants to leave alone alonely' result=re.finditer('alone',strings) for i in result: print(i) #out:<re.Match object; span=(24, 29), match='alone'> #out:<re.Match object; span=(30, 35), match='alone'>
8.compile()
The basic syntax format is:
compile(pattern, flags=0)
- Pattern: match the regular expression pattern or native string representation of the target object
- Flags: control flags when regular expressions are used
The compile function is used to compile regular expressions and generate a regular expression (Pattern) object for use by the match() and search() functions.
strings='Fanstuck wants to leave alone alonely' pattern=re.compile('to') pattern.search(strings) #out:<re.Match object; span=(15, 17), match='to'> strings='Fanstuck wants to leave alone alonely' pattern=re.compile('to') object_search=pattern.search(strings) object_search.group() #out:'to' object_search.start() #out:15 object_search.end() #out:17 object_search.span() #out:(15,17)
9.splite()
The basic syntax format is:
re.splite(pattern, string, maxsplit=0, flags=0)
Parameter Description:
- Pattern: match the regular expression pattern or native string representation of the target object
- String: matching string
- Maxplit: separation times, maxplit=1, separation once, the default is 0, and the number of times is not limited
- Flags: control flags when regular expressions are used
Pattern matches the substring to split the string. If parentheses are used in the pattern, the string matched by the pattern will also be part of the return value list. Maxplit is the number of strings that are split at most.
strings='Fanstuck wants to leave alone alonely' re.split(r' ', strings) #out:['Fanstuck', 'wants', 'to', 'leave', 'alone', 'alonely'] strings='Fanstuck wants to leave alone alonely' re.split(r' ', strings,maxsplit=2) #out:['Fanstuck', 'wants', 'to leave alone alonely'] strings='Fanstuck wants to leave alone alonely' re.split(r'( )', strings,maxsplit=2) #out:['Fanstuck', ' ', 'wants', ' ', 'to leave alone alonely']
10.Match object and Pattern object
re.match(),re. If search() matches successfully, it will return a Match object, which contains information about this Match. You can use the properties or methods provided by Match to obtain this information; The Pattern object is generated by re.compile(), and the purpose of the method is the same.
strings='Fanstuck wants to leave alone alonely' pattern=re.compile('to') object_search=pattern.search(strings) object_search.string #out:'Fanstuck wants to leave alone alonely' object_search.re #out:re.compile(r'to', re.UNICODE) object_search.pos #out:0 (where to start matching) object_search.endpos #Out:37 (end matching position) object_search.lastindex #out:None object_search.lastgroup #out:None object_search.groupdict() #out:{} object_search.group() #out:'to' object_search.start() #out:15 object_search.end() #out:17 object_search.span() #out:(15,17)
3, Regular expression syntax matching
Regular expression describes a pattern of string matching, which can be used to check whether a string contains a certain substring, replace the matched substring, or extract the substring that meets a certain condition from a string.
Later, we will use the functions in the re library for character matching. Here we can take a look at the syntax through an example:
import re a = "abbbbbccccd" b = re.findall('ab+c+d',a) print(b)
['abbbbbccccd']
The method of constructing regular expressions is the same as that of creating mathematical expressions. That is, small expressions can be combined with a variety of metacharacters and operators to create larger expressions. The components of a regular expression can be a single character, a character set, a character range, a choice between characters, or any combination of all these components.
Regular expressions are text patterns composed of ordinary characters (such as characters a to z) and special characters (called "metacharacters"). A pattern describes one or more strings to match when searching for text. Regular expressions are used as a template to match a character pattern with the searched string.
1 ordinary characters
Normal characters include all printable and nonprintable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols.
1.1alone
Ordinary strings. The above examples basically use ordinary strings. Here we use the findall() function to demonstrate better:
strings='Fanstuck wants to leave alone alonely' print(re.findall('alone',strings))
['alone', 'alone']
1.2[alone]
Match all characters in [...]:
strings='Fanstuck wants to leave alone alonely' print(re.findall('[alone]',strings))
['a', 'n', 'a', 'n', 'o', 'l', 'e', 'a', 'e', 'a', 'l', 'o', 'n', 'e', 'a', 'l', 'o', 'n', 'e', 'l']
1.3[^alone]
Match all characters except characters in [^...]
strings='Fanstuck wants to leave alone alonely' print(re.findall('[^alone]',strings))
['F', 'u', 'c', 'k', ' ', ' ', 'o', ' ', 'l', 'e', 'v', 'e', ' ', 'l', 'o', 'e', ' ', 'l', 'o', 'e', 'l', 'y']
1.4[A-Z]
[a-z] represents an interval, matching all uppercase letters, [a-z] represents all lowercase letters.
strings='Fanstuck wants to leave alone alonely' print(re.findall('[^A-F]',strings))
['a', 'n', 's', 't', 'u', 'c', 'k', ' ', 'w', 'a', 'n', 't', 's', ' ', 't', 'o', ' ', 'l', 'e', 'a', 'v', 'e', ' ', 'a', 'l', 'o', 'n', 'e', ' ', 'a', 'l', 'o', 'n', 'e', 'l', 'y']
1.5 .
Match any single character except the newline character (,), equal to [^].
strings='Fanstuck wants to leave alone alonely' print(re.findall('.',strings))
['F', 'a', 'n', 's', 't', 'u', 'c', 'k', ' ', 'w', 'a', 'n', 't', 's', ' ', 't', 'o', ' ', 'l', 'e', 'a', 'v', 'e', ' ', 'a', 'l', 'o', 'n', 'e', ' ', 'a', 'l', 'o', 'n', 'e', 'l', 'y']
1.6[sS]
Match all. S is to match all blank characters, including line breaks, and S is not a blank character, excluding line breaks.
strings='Fanstuck wants to leave alone alonely' print(re.findall('[sS]',strings))
['F', 'a', 'n', 's', 't', 'u', 'c', 'k', ' ', 'w', 'a', 'n', 't', 's', ' ', 't', 'o', ' ', 'l', 'e', 'a', 'v', 'e', ' ', 'a', 'l', 'o', 'n', 'e', ' ', 'a', 'l', 'o', 'n', 'e', 'l', 'y']
1.7w
Match letters, numbers, underscores. Equivalent to [A-Za-z0-9_]
strings='Fanstuck wants to leave alone alonely' print(re.findall('w',strings))
['F', 'a', 'n', 's', 't', 'u', 'c', 'k', 'w', 'a', 'n', 't', 's', 't', 'o', 'l', 'e', 'a', 'v', 'e', 'a', 'l', 'o', 'n', 'e', 'a', 'l', 'o', 'n', 'e', 'l', 'y']
2. Non printing characters
Nonprinting characters can also be part of regular expressions. The following table lists escape sequences that represent non print characters:
2.1cx
Match the control character indicated by x. For example, cM matches a Control-M or carriage return. The value of x must be A-Z or one of A-Z. Otherwise, treat c as a literal 'c' character.
2.2
Match a page feed. Equivalent to and cL.
strings='Fanstuck wants to leave alone alonely' print(re.findall('',strings))
2.3
Match a newline character. Equivalent to and cJ.
strings='Fanstuck wants to leave alone alonely' print(re.findall(' ',strings)) print(strings)
['
']
Fanstuck
wants to leave alone alonely
2.4
Match a carriage return. Equivalent to and cM.
strings=' Fanstuck wants to leave alone alonely' print(re.findall(' ',strings)) print(strings)
['
', '
', '
']
alone alonelye
2.5s
Match any white space characters, including spaces, tabs, page breaks, and so on. Equivalent to []. Note that Unicode regular expressions match the full width space character.
strings='Fanstuck wants to leave alone alonely' print(re.findall('s',strings))
[' ', ' ', ' ', ' ', ' ']
2.6S
Match any non blank characters. Equivalent to [^].
strings='Fanstuck wants to leave alone alonely' print(re.findall('S',strings))
['F', 'a', 'n', 's', 't', 'u', 'c', 'k', 'w', 'a', 'n', 't', 's', 't', 'o', 'l', 'e', 'a', 'v', 'e', 'a', 'l', 'o', 'n', 'e', 'a', 'l', 'o', 'n', 'e', 'l', 'y']
2.7
Match a tab. Equivalent to and cI.
strings='Fanstuck wants to leave alone alonely' print(re.findall(' ',strings))
No tab is written, and it is empty.
2.8
Match a vertical tab. Equivalent to and cK.
3. Special characters
The so-called special characters are characters with special meanings, such as * in runoo*b, which simply means that they represent any string. If you want to find the * symbol in the string, you need to escape * by adding one before it, runo*ob matches the string runo*ob.
Many metacharacters require special treatment when trying to match them. To match these special characters, you must first "escape" the characters, that is, put the backslash character in front of them. The following table lists the special characters in regular expressions:
3.1$
Matches the end of the input string. If the Multiline property of the RegExp object is set, $will also match the position before or. Then $also matches' 'or' '. To match the $character itself, use $.
strings='Fanstuck wants to leave alone alonely' print(re.findall(' alone alonely$',strings))
[' alone alonely']
3.2( )
Mark the beginning and end of a subexpression. Subexpressions can be obtained for later use. To match these characters, use (and).
strings='Fanstuck wants to leave alone alonely' print(re.findall('(Fanw{2,3}ck)',strings))
['Fanstuck']
3.3*
Match the previous subexpression zero or more times. To match the * character, use *.
strings='Fanstuck wants to leave alone alonely' print(re.findall('(Fanw*ck)',strings))
['Fanstuck']
3.4+
Match the previous subexpression one or more times. To match the + character, use +.
strings='Fanstuck wants to leave alone alonely' print(re.findall('(alone)+',strings))
['alone', 'alone']
3.5.
Matches any single character except the newline character. To match.
strings='Fanstuck wants to leave alone alonely' print(re.findall('Fa.s.u.k',strings))
['Fanstuck']
3.6[
Mark the beginning of a bracket expression. To match [, use [.
3.7
Match the previous subexpression zero or once, or indicate a non greedy qualifier. To match characters, use.
Here we should pay attention to greedy mode and non greedy mode.
Greedy mode: match data as much as possible, as shown by adding a meta character after W, such as w*:
strings='Fanstuck wants to leave alone alonely' print(re.findall('Fw*',strings))
['Fanstuck']
Non greedy mode: try to match as few data as possible, as shown by W followed by?, For example, w
strings='Fanstuck wants to leave alone alonely' print(re.findall('Fw?',strings))
['Fa']
3.8
Marks the next character as a special character, a literal character, a backward reference, or an octal escape character. For example, 'n' matches the character 'n'. ' Match line breaks. Sequence '\' matches' ', while' ('matches'' ('.
3.9^
strings='alone alonely' print(re.findall('^alone',strings))
['alone']
3.10{
Mark the beginning of the qualifier expression. To match {, use {.
3.11|
Indicate a choice between the two. To match |, use |.
4. Qualifier
Qualifiers are used to specify how many times a given component of a regular expression must appear to satisfy a match. There are 6 kinds of * or + or or {n} or {n,} or {n,m}.
Qualifiers of regular expressions are:
4.1*
The above has been shown
4.2+
The above has been shown
4.3?
The above has been shown
4.4{n}
N is a nonnegative integer. Match the determined n times. For example, 'o{2}' cannot match 'o' in "Bob", but it can match two o's in "food".
strings='Fanstuck wants to leave alone alonely' print(re.findall('Fanw{2}uck',strings))
4.5{n,}
n is a nonnegative integer. Match at least n times. greedy
strings='Fanstuck wants to leave alone alonely' print(re.findall('Fanw{1,}uck',strings))
['Fanstuck']
4.6{n,m}
Both M and N are non negative integers, where n < = M. Match at least N times and at most m times. greedy
strings='Fanstuck wants to leave alone alonely' print(re.findall('^(w{2,8}s*w{2,8})+',strings))a
['Fanstuck wants']
strings='Fanstuck wants to leave alone alonely' print(re.findall('^Fanstuck[(w{2,8}s*)+]+ly',strings))
['Fanstuck wants to leave alone alonely']
*And + qualifiers are greedy, because they will match as many words as possible. Only adding one after them can achieve non greedy or minimum matching.
5. Locator
5.1^
The above has been demonstrated
5.2$
The above has been demonstrated
5.3
Match a word boundary, that is, the position between the word and the space.
strings='Fanstuck wants to leave alone alonely' print(re.findall(r'alone',strings))
['alone']
5.4B
In contrast to '/b', it only matches non boundary characters.
strings='Fanstuck wants to leave alone alonely' print(re.findall('alone\Bly',strings))
['alonely']
Are there any systematic interview questions for many Python partners?
Today, I recommend a magical topic brush: Brush questions to interview artifact Niuke
Interview questions of major Internet companies. From basic to advanced and even principle analysis interview questions, you have everything. Hurry to equip yourself! Help you win the interview, solo interviewer
Pay attention to prevent losing. If there is any mistake, please leave a message for advice. Thank you very much
The above is the whole content of this issue. I am. If you have any questions, please feel free to leave a message for discussion. I'll see you next time.
see:
Python standard library notes (2) - re module
python regular expressions: the use of re Library
Detailed explanation of python Library
python | the most complete regular expression in history
First of all, I would like to introduce myself. I graduated from Jiaotong University in 13 years. I once worked in a small company, went to large factories such as Huawei OPPO, and joined Alibaba in 18 years, until now. I know that most junior and intermediate Java engineers who want to improve their skills often need to explore and grow by themselves or sign up for classes, but there is a lot of pressure on training institutions to pay nearly 10000 yuan in tuition fees. The self-study efficiency of their own fragmentation is very low and long, and it is easy to encounter the ceiling technology to stop. Therefore, I collected a "full set of learning materials for java development" and gave it to you. The original intention is also very simple. I hope to help friends who want to learn by themselves and don't know where to start, and reduce everyone's burden at the same time. Add the business card below to get a full set of learning materials