Tesseract OCR picture extraction Chinese and conversion to Excel example (with Python code)

1. Background description:

Such problems will be encountered in daily work. When receiving the form or text information forwarded in the form of screenshot in the work group, it is necessary to extract a large amount of data, text and other information from the picture, and store and statistically process it in the form of Excel form.

2. Process Description:

1. Identify the information in the picture (text and data)
Apply pytesseract to recognize the text (English, Chinese) and data in the picture and convert them into strings
2. Extract key information as required
Use regular expressions to extract useful key information (text and data): such as date, place, telephone number, quantity, etc
3. Sort it into a data frame and save it to Excel
Sort out the extracted information, merge it into a data table, and write and save it into Excel table

3. Preparation of test environment:

  1. Install the pyteseract library and the pilot library
    Pyteseract Library: PIP install pyteseract – used to identify and convert text and data in pictures into strings
    Pilot Library: PIP install pilot – used to convert the input picture file into image

  2. Download the application of Tesseract OCR and install it
    Note: Please record the installation address, which will be used to change the system configuration and save the data file for Chinese recognition
    Tessoract_ Download address of OCR: https://digi.bib.uni-mannheim.de/tesseract/

    My Tesseract OCR installation path is: C: \ program files (x86) \ Tesseract OCR

  3. Download the Chinese font files for Chinese recognition and copy them to the "tessdata" folder in Tesseract OCR:
    As shown in the following figure, the path of tessdata folder is: C: \ program files (x86) \ Tesseract OCR \ tessdata

    Download address of Chinese font library file: https://tesseract-ocr.github.io/tessdoc/Data-Files

    chi_sim – simplified Chinese; chi_tra – traditional Chinese

  4. Configuring system variables (2 steps)
    Step 1: modify the path in the environment variable: add C: \ program files (x86) \ Tesseract OCR, as shown in the following figure

    Step 2: because Chinese identification is required, the system variable: tessdata needs to be configured_ Prefix, as shown in the figure below
    TESSDATA_ The prefix variable is set to: C: \ program files (x86) \ Tesseract OCR \ tessdata

  5. Modify the relevant file path (2 steps)
    Step 1: find the path: anaconda – > lib – > site packages – > pytesserac under pytesseract Py file:

    Step 2: document modification:
    Set testseract_ cmd = ‘tesseract.exe 'is modified to: tesseract_cmd = ‘C:/Program Files (x86)/Tesseract-OCR/tesseract.exe’

    Warm reminder: Please complete the above steps before testing, otherwise the system will always report errors!

4. Sample demo (Python code)

The test pictures to be imported are:

4.1 identify the information in the picture (text and data)

# 4.1 import the picture file, convert it to image, and identify the image in it
import pandas as pd
import pytesseract
from PIL import Image

# Set the address of input picture and output table file
infile = r"D:\1.png"
outfile = r"D:\result.xlsx"
# Convert the imported picture to image, 
image = Image.open(infile)
# lang="chi_sim" refers to using Chinese data to identify Chinese information in pictures
result = pytesseract.image_to_string (image, lang='chi_sim')

Summary: it can be observed from the above results that the discrimination accuracy of Chinese and English letters and numbers is good, with two errors: "Ba body 1008", and the correct information is "powder Y1008"; In addition, there are spaces between Chinese fonts, between Chinese fonts and English letters, and between text information and data. Therefore, it is necessary to avoid using spaces to intercept information when truncating subsequent information.

4.2 extract key information as required

Demand: extract the product name, number and corresponding quantity

# 4.2 extract the product name, number and corresponding quantity in the picture
# Create function StringToList: convert the string information (Str) read from the graph into List information (List)
def StringToList(result):
    result = result.strip()
    # Because there are spaces between Chinese fonts, between Chinese fonts and English letters, and between text messages and numbers,
    # Therefore, "," is used as the truncation identifier of each line of data information (generally, the space character "" is directly used for truncation)
    pattern = re.compile("\n")
    line = pattern.sub(",", result)
    line += ","
    str_li = []
    str0 = ""
    for i in range(len(line)):
        if line[i] == ",":
            str0 = ""
            str0 += line[i]
        i += 1
    return str_li

str_list = StringToList(result)

4.3 organize it into data frame and save it into Excel

Because the elements in the list contain many spaces, you can't directly use spaces to extract information; It is observed that there is a space between the text message and the quantity, so start from the right of the field and intercept the number directly

# 4.3 extract text information and data and organize them into DataFrame data table
# Establish information extraction and conversion function
def ListToDataframe(str_list):
    code, qty = [], []
    # There is a space between the text message and the quantity, so the number is intercepted from the right of the field
    for s in str_list:
        if s == "": continue
        s = s[::-1]
        # Considering that there are many spaces in the field, set the parameter maxplit = 1 of split()
        # That is: just intercept the first space
        s = s.split(" ", maxsplit=1)
        ser2, ser1 = s[0], s[1]  
        ser1 = ser1[::-1]
        ser2 = ser2[::-1]

    df = pd.DataFrame([code, qty]).T
    df.columns = ["Product number", "quantity"]
    return df
dataframe = ListToDataframe(str_list)

Check Excel table:

Summary: the data was successfully extracted and converted into Excel. Unfortunately, there were some errors in the data conversion process, which need to be checked and corrected manually.

5. Conclusion:

The pyteseract library is used to extract information and data. The results of many tests show that the extraction results of the same picture will not be the same. The test results show that the accuracy of information extraction is related to the clarity, size, filling color and other factors of the picture. Of course, it is also limited by the data quality of Chi SIM file used in execution.

Tags: Python

Posted by jmantra on Sat, 16 Apr 2022 10:57:59 +0930