Six common ways to read and write excel and other data files in python

Let's sort out how python can read data files.

1. python built-in methods (read, readline, readlines)

  • read(): read the entire file at one time. The read(size) method is recommended. The larger the size, the longer the running time
  • readline(): read one line at a time. Used when there is not enough memory. Generally, it is not used very much
  • readlines(): read the contents of the whole file at one time and return to the list by line, which is convenient for us to traverse

2. Built in module (csv)

python has built-in csv module for reading and writing csv files. csv is a comma delimited file, which is one of the most common data storage formats in data science.
csv module can easily read and write various volume data. Of course, large amount of data needs code level optimization.

  • csv module reads files
# Read csv file
import csv  
with open('test.csv','r') as myFile:  
    lines=csv.reader(myFile)  
    for line in lines:  
        print (line)  
  • csv module write file
'''
No one answers any questions? Xiaobian created a Python exchange of learning QQ Group: 531509025
 Look for like-minded partners to help each other,There are also good videos and tutorials in the group PDF e-book!
'''
import csv  
with open('test.csv','w+') as myFile:      
    myWriter=csv.writer(myFile)  
    # Writerow write line by line
    myWriter.writerow([7,8,9])  
    myWriter.writerow([8,'h','f'])  
    # writerow multiline write
    myList=[[1,2,3],[4,5,6]]  
    myWriter.writerows(myList)  

3. Use numpy Library (loadtext, load, fromfile)

  • Loadtext method

Loadtext is used to read text files (including txt, csv, etc.) and gz or bz2 format compressed file, on the premise that each line of file data must have the same number of values.

import numpy as np
# The dtype parameter in loadtext() is set to float by default
# This is set to str string for easy display
np.loadtxt('test.csv',dtype=str)
# out: array(['1,2,3', '4,5,6', '7,8,9'], dtype='<U5')
  • load method

load is used to read numpy npy, .npz or pickled persistent files.

import numpy as np
# Mr. Cheng npy file
np.save('test.npy', np.array([[1, 2, 3], [4, 5, 6]]))
# Load npy files using load
np.load('test.npy')
'''
out:array([[1, 2, 3],
       [4, 5, 6]])
'''
  • fromfile method

The fromfile method can read simple text data or binary data. The data comes from the binary data saved by the tofile method. When reading data, you need to specify the element type and modify the shape of the array appropriately.

'''
No one answers any questions? Xiaobian created a Python exchange of learning QQ Group: 531509025
 Look for like-minded partners to help each other,There are also good videos and tutorials in the group PDF e-book!
'''
import numpy as np
x = np.arange(9).reshape(3,3)
x.tofile('test.bin')
np.fromfile('test.bin',dtype=np.int)
# out:array([0, 1, 2, 3, 4, 5, 6, 7, 8])

4. Use pandas Library (read_csv, read_excel, etc.)

pandas is one of the most commonly used analysis libraries for data processing. It can read data files in various formats and generally output dataframe format.

Such as: txt, csv, excel, json, clipboard, database, html, hdf, parquet, pickled file, sas, stata, etc

  • read_csv method

read_csv method is used to read csv format files and output dataframe format.

import pandas as pd
pd.read_csv('test.csv')
  • read_excel method

Read excel files, including xlsx, xls and xlsm formats

import pandas as pd
pd.read_excel('test.xlsx')

read_table method

Control the reading of any parameter (sep) of the file through the separator

  • read_json method

Read json format file

df = pd.DataFrame([['a', 'b'], ['c', 'd']],index=['row 1', 'row 2'],columns=['col 1', 'col 2'])
j = df.to_json(orient='split')
pd.read_json(j,orient='split')
  • read_html method

Read html table

  • read_clipboard method

Read clipboard contents

  • read_pickle method

Read plckled persistent file

  • read_sql method

Read the database data, connect the database, and then pass in the sql statement

  • read_dhf method

Read hdf5 files, suitable for reading large files

  • read_parquet method

Read parquet file

  • read_sas method

Read sas file

  • read_stata method

Read stata file

  • read_gbq method

Reading google bigquery data

5. Read and write excel files (xlrd, xlwt, openpyxl, etc.)

python has many libraries for reading and writing excel files. In addition to pandas mentioned earlier, there are xlrd, xlwt, openpyxl, xlwings and so on.

Main modules:

  • xlrd Library: read data from excel and support xls and xlsx
  • xlwt Library: modify excel. xlsx format cannot be modified
  • xlutils Library: modify an existing file in xlw and xlrd
  • openpyxl: it mainly reads and edits excel in xlsx format
  • xlwings: read, write and modify xlsx, xls and xlsm format files
  • xlsxwriter: used to generate excel tables, insert data, insert icons and other table operations. Reading is not supported
  • Microsoft Excel API: pywin32 needs to be installed to communicate directly with Excel process. You can do anything you can do in Excel, but it is slow

6. Operate the database (pymysql, cx#u Oracle, etc.)

python supports the interaction of almost all databases. After connecting to the database, you can use sql statements to add, delete, modify and query.

Main modules:

  • pymysql: used to interact with mysql database
  • sqlalchemy: used to interact with mysql database
  • cx_Oracle: used to interact with oracle database
  • sqlite3: built-in library, which is used to interact with sqlite database
  • pymssql: used to interact with sql server database
  • pymongo: used to interact with mongodb non relational database
  • redis and pyredis: used to interact with non relational databases

Tags: Python

Posted by johntp on Tue, 08 Mar 2022 15:06:19 +1030