jieba, Python library for Chinese word segmentation

jieba, Python library for Chinese word segmentation

Chinese word segmentation, generally speaking, is to divide a sentence (paragraph) into words, idioms and single words according to certain rules (algorithms).
Chinese word segmentation is the front-end technology of many application technologies, such as search engine, machine translation, part of speech tagging, similarity analysis, etc. it first processes the text information word segmentation, and then searches, translates and compares the word segmentation results.
In Python, the best Chinese word segmentation library is jieba. Using "stutter" to name a Chinese word segmentation library is very vivid and vivid, with a programmer's sense of humor.

1, The best Python Chinese word segmentation component

"Stuttering" Chinese word segmentation: make the best Python Chinese word segmentation component

This is the slogan of jieba participle. If you open the GitHub and PyPI sources of jieba participle, you will see this slogan in the introduction. This fully reflects the vision and goal of jieba development team. At present, jieba has been called the best Python Chinese Thesaurus.
At the time of writing this article in April 2022, jieba had obtained 28.3K Star on GitHub, and the number was growing rapidly, which was enough to prove that jieba was very popular.
In addition to the Python language version, jieba also has versions of more than a dozen programming languages such as C + +, JAVA and iOS, which can be supported from PC to mobile. This is worth praising jieba's maintenance team. Maybe in the future, jieba can be the best Chinese word segmentation component in all languages.

2, How to use jieba
Step1. Install jieba

pip install jieba

jieba is a third-party library. You need to install it before you can use it. You can install it directly using pip. jieba is compatible with Python 2 and python 3, and the installation commands are the same. If the installation is slow, you can add the - i parameter to specify the image source.
Step2. Call jieba for word segmentation

import jieba

test_content = 'The thunder can't hide one's ears, the bell rings, and jingle benevolence doesn't make the world full of love'
cut_res = jieba.cut(test_content, cut_all=True)

Operation results:

['Thunder', 'Lightning fast', 'Lightning can't cover your ears', 'Inferior', 'Cover your ears', 'plug one 's ears while stealing a bell', 
'son', 'Jingle', 'jingle', 'not pass on to others what one is called upon to do', 'Not let', 'world', 'full', 'love',
 'of', 'power']

The use of jieba word segmentation is very simple. Directly import the jieba library, call the cut() method, pass in the content to be segmented, and then the word segmentation result can be returned. The returned result is an iteratable generator, which can be traversed or converted into a list to print the result.

3, Four patterns of jieba word segmentation
jieba word segmentation supports four word segmentation modes:
1. Precise mode: try to cut the sentence most accurately, which is suitable for text analysis.

cut_res = jieba.cut(test_content, cut_all=False)
print('[Precise mode]: ', list(cut_res))
cut_res = jieba.cut(test_content, cut_all=False, HMM=False)
print('[Precise mode]: ', list(cut_res))
[Precise mode]:  ['Lightning fast', 'plug one 's ears while stealing a bell', 'Ring', 'Sting', 'not pass on to others what one is called upon to do', 
'world', 'full', 'Love potential']
[Precise mode]:  ['Lightning fast', 'plug one 's ears while stealing a bell', 'son', 'ring', 'Sting', 'not pass on to others what one is called upon to do',
 'world', 'full', 'love', 'of', 'power']

Precise mode is the most commonly used word segmentation mode, and there is no redundant data in the word segmentation results.
HMM parameter is True by default, and new words are automatically recognized according to HMM model (hidden Markov model). For example, in the above example, HMM is True. In the result, "Er Xiang" and "potential of love" are recognized as new words. HMM is False. These words can only be formed into separate words and divided into single words.
2. Full mode: scan all words that can be formed into words in the sentence. It is very fast, but it can not solve ambiguity.

cut_res = jieba.cut(test_content, cut_all=True)
print('[Full mode]: ', list(cut_res))
[Full mode]:  ['Thunder', 'Lightning fast', 'Lightning can't cover your ears', 'Inferior', 'Cover your ears', 'plug one 's ears while stealing a bell', 
'son', 'Jingle', 'jingle', 'not pass on to others what one is called upon to do', 'Not let', 'world', 'full', 'love', 'of', 'power']

The full mode starts from the first word of the content to be segmented, takes each word as the first word of the word, and returns all possible words. Words and words will be reused, so multiple meanings may appear.
cut_ The all parameter is False by default, that is, the default is not full mode, and cut is set_ If all is set to True, full mode word segmentation is adopted.
3. Search engine mode: on the basis of accurate mode, long words are segmented again to improve the recall rate. It is suitable for search engine word segmentation.

cut_res = jieba.cut_for_search(test_content)
print('[Search engine mode]: ', list(cut_res))
[Search engine mode]:  ['Thunder', 'Inferior', 'Lightning fast', 'Cover your ears', 'plug one 's ears while stealing a bell', 'Er Xiang', 
'Sting', 'Not let', 'not pass on to others what one is called upon to do', 'world', 'full', 'Love potential']

On the basis of the exact pattern, the search engine pattern further divides the long words in the exact pattern according to the full pattern, which can match more results when used in the search.
4.paddle mode: using PaddlePaddle deep learning framework, training sequence marking (bidirectional GRU) network model to achieve segmentation. It also supports part of speech tagging.
To use the paddle mode, you need to install paddlepaddle tiny first. The installation command is PIP install paddlepaddle tiny = = 1.6.1. Currently, the pad mode supports Jieba v0 40 and above. jieba v0. For versions below 40, please upgrade Jieba, pip install jieba --upgrade.
The above is an official description, but there is no paddlepaddle-tiny source at present. Interested can go to PaddlePaddle official website to find a way.
We usually don't use the pad pattern, so we can understand the first three patterns.
5. Summary
The cut() method has four parameters. sentence receives the content of the word to be segmented; cut_all set whether to use full mode; HMM sets whether to use HMM model to identify new words; use_paddle sets whether to use pandle mode.
cut_for_search() has two parameters, sense and HMM.
cut() and cut_for_search() returns the generator. If you want to return the list directly, you can use the corresponding lcut() and lcut_for_search() is used in exactly the same way.

4, Custom word segmentation dictionary
When using jieba word segmentation, the word segmentation result needs to be matched with jieba's dictionary library before it can be returned to the word segmentation result. Therefore, some words need to be customized by users in order to be recognized.
1. Add custom words to the dictionary

jieba.add_word('jingle bells ')
jieba.add_word('Make the world full of love')
jieba.add_word('with lightning speed in a whirlwind drive')
lcut_res = jieba.lcut(test_content, cut_all=True, HMM=False)
print('[Add custom words]: ', lcut_res)
[Add custom words]:  ['Thunder', 'Lightning fast', 'Lightning can't cover your ears', 'Inferior', 'Cover your ears', 'plug one 's ears while stealing a bell',
 'jingle bells ', 'Jingle', 'jingle', 'not pass on to others what one is called upon to do', 'Not let', 'Make the world full of love', 'world', 
 'full', 'love', 'of', 'power']

add_word() has three parameters: added word, word frequency and part of speech. Word frequency and part of speech can be omitted.
After adding a user-defined word, if the user-defined word can match, it will be returned to the word segmentation result. If the user-defined words do not have continuous matching results in the sentence to be segmented, they will not be reflected in the word segmentation results.
2. Add the specified file as a word segmentation dictionary
The custom dictionary format should be the same as the default dictionary dict.txt. A word occupies one line, and each line is divided into three parts: word, word frequency (can be omitted), part of speech (can be omitted), separated by spaces, and the order cannot be reversed. file_ If name is a file opened in path or binary mode, the file must be UTF-8 encoded.
This article customizes a mydict TXT text file, as follows:

Lightning can't hide your ears 3 a
 Hide one's ears and steal a bell 3 a
 Jingle 3 a
 Do not yield 3 a
 Make the world full of love 3 n

To set the file encoding to UTF-8, click File > Settings > file encoding in PyCharm to set the Global Encoding and Project Encoding to UTF-8.
Then use load_userdict() loads the custom dictionary.

lcut_res = jieba.lcut(test_content, cut_all=True, HMM=False)
print('[Use custom dictionary]: ', lcut_res)
[Use custom dictionary]:  ['Thunder', 'Lightning fast', 'Lightning can't cover your ears', 'Inferior', 'Cover your ears', 'plug one 's ears while stealing a bell', 
'jingle bells ', 'Jingle', 'jingle', 'not pass on to others what one is called upon to do', 'Not let', 'Make the world full of love', 'world', 
'full', 'love', 'of', 'power']

If a user-defined dictionary is used, word segmentation will be carried out according to jieba's default dictionary and user-defined dictionary at the same time. Adding a custom dictionary has the same effect as adding a single word. The difference is that it can be added in batch without calling add repeatedly_ word().
3. Delete words from the dictionary

jieba.del_word('Not let')
lcut_res = jieba.lcut(test_content, cut_all=True, HMM=False)
print('[Delete words]: ', lcut_res)
[Delete words]:  ['Thunder', 'Lightning fast', 'Lightning can't cover your ears', 'Cover your ears', 'plug one 's ears while stealing a bell', 'son', 
'Jingle', 'jingle', 'not pass on to others what one is called upon to do', 'world', 'full', 'love', 'of', 'power']

The deleted words are generally mood particles, logical connectives, etc. these words have no practical significance for text analysis, but will become interference.
After setting the deleted words, there will be no deleted words in the result, but for a single word, it will be independent words, so it still exists in the result after deletion.
4. Adjust the word frequency of words
Adjust the word frequency of words and adjust the possibility of being separated in the result to make the word segmentation result meet the expectation. There are two cases: one is to split a long word in the word segmentation result into multiple words, and the other is to combine multiple words in the word segmentation result into one word.

lcut_res = jieba.lcut(test_content, cut_all=False, HMM=False)
print('[Before setting]: ', lcut_res)
jieba.suggest_freq('Make the world full of love', True)
lcut_res = jieba.lcut(test_content, cut_all=False, HMM=False)
print('[After setting]: ', lcut_res)
[Before setting]:  ['Lightning fast', 'plug one 's ears while stealing a bell', 'son', 'ring', 'Sting', 'not pass on to others what one is called upon to do', 'world', 'full', 'love', 'of', 'power']
[After setting]:  ['Lightning fast', 'plug one 's ears while stealing a bell', 'son', 'Jingle', 'kernel', 'no', 'Make the world full of love', 'of', 'power']

suggest_freq() has two parameters. The segment parameter represents the fragment of word segmentation. If a word is disassembled, the disassembled tuple is passed in. If a word is specified as a whole, the string is passed in; If the tune parameter is True, adjust the word frequency of the word.
Note: the automatically calculated word frequency may not be valid when using HMM new word discovery function.

5, Keyword extraction
Keyword extraction uses the analyze module in jieba and provides two different methods based on two different algorithms.
1. Keyword extraction based on TF-IDF algorithm

from jieba import analyse

key_word = analyse.extract_tags(test_content, topK=3)
print('[key_word]: ', list(key_word))
key_word = analyse.extract_tags(test_content, topK=3, withWeight=True)
print('[key_word]: ', list(key_word))
[key_word]:  ['Lightning fast', 'Ring', 'Love potential']
[key_word]:  [('Lightning fast', 1.7078239289857142), ('Ring', 1.7078239289857142), ('Love potential', 1.7078239289857142)]

extract_ The tags () method has four parameters, and sentence is the text to be extracted; topK is the number of keywords that return the maximum weight, and the default value is 20; withWeight indicates whether to return the weight. If yes, the list of (word, weight) is returned. The default is False; allowPOS is to filter the words with the specified part of speech. It is empty by default, that is, it is not filtered.
2. Keyword extraction based on TextRank algorithm

key_word = analyse.textrank(test_content, topK=3)
print('[key_word]: ', list(key_word))
allow = ['ns', 'n', 'vn', 'v', 'a', 'm', 'c']
key_word = analyse.textrank(test_content, topK=3, allowPOS=allow)
print('[key_word]: ', list(key_word))
[key_word]:  ['Ring', 'world']
Prefix dict has been built successfully.
[key_word]:  ['full', 'Ring', 'world']

textrank() method and extract_ The usage of tags () method is similar. It should be noted that allowPOS has default values ('ns', 'n', 'vn', 'v') to filter the words of these four parts of speech by default, which can be set by yourself. Other parameters are the same as extract_ The tags () method is the same.

6, Part of speech tagging
Part of speech tagging uses the posseg module in jieba to mark the part of speech of each word after word segmentation, using a tagging method compatible with ictclas.

from jieba import posseg

pos_word = posseg.lcut(test_content)
[pair('Lightning fast', 'i'), pair('plug one 's ears while stealing a bell', 'i'), pair('Ring', 'n'),
 pair('Sting', 'v'), pair('not pass on to others what one is called upon to do', 'i'), pair('world', 'n'), 
 pair('full', 'a'), pair('love', 'v'), pair('of', 'u'), pair('power', 'ng')]

posseg.lcut() has two parameters, sense and HMM.
Part of speech and part of speech labels refer to the following table:

nCommon nounfLocation NOUNsLocative NOUNttime
nrnamensplace namentInstitution namenwWork name
nzOther proper namesvCommon verbvdverbal adverbvnNoun verb
aadjectiveadAdverbial wordsanNoun form wordsdadverb
cconjunctionuauxiliary wordxcOther function wordswpunctuation
PERnameLOCplace nameORGInstitution nameTIMEtime

7, Returns the starting and ending position of the word in the original text
The Tokenize module in jieba is used for the starting and ending positions of the returned words in the original text, and the tokenize() method is used in the actual call.

res = jieba.tokenize(test_content)
for r in res:
    if len(r[0]) > 3:
        print('word:{}\t start:{}\t end:{}'.format(*r))
    elif len(r[0]) > 1:
        print('word:{}\t\t start:{}\t end:{}'.format(*r))
        print('word:{}\t\t\t start:{}\t end:{}'.format(*r))
word:Lightning fast	 start:0	 end:4
word:plug one 's ears while stealing a bell	 start:4	 end:8
word:Ring		 start:8	 end:10
word:Sting			 start:10	 end:11
word:not pass on to others what one is called upon to do	 start:11	 end:15
word:world		 start:15	 end:17
word:full		 start:17	 end:19
word:Love potential		 start:19	 end:22

The tokenize() method has three parameters, unicode_sentence is the content to be segmented. Note that only unicode encoded content is accepted; The mode parameter is the specified word segmentation mode. If you need to use the search engine mode, set mode = 'search'; HMM defaults to True.
The above is an introduction to the common functions of jieba participle. For more usage, please visit GitHub from the reference document below.

Reference documents: https://github.com/fxsjy/jieba

Tags: Python

Posted by LemonInflux on Sat, 16 Apr 2022 00:52:09 +0930