Let's sort out how python can read data files.
1. python built-in methods (read, readline, readlines)
- read(): read the entire file at one time. The read(size) method is recommended. The larger the size, the longer the running time
- readline(): read one line at a time. Used when there is not enough memory. Generally, it is not used very much
- readlines(): read the contents of the whole file at one time and return to the list by line, which is convenient for us to traverse
2. Built in module (csv)
python has built-in csv module for reading and writing csv files. csv is a comma delimited file, which is one of the most common data storage formats in data science.
csv module can easily read and write various volume data. Of course, large amount of data needs code level optimization.
- csv module reads files
# Read csv file import csv with open('test.csv','r') as myFile: lines=csv.reader(myFile) for line in lines: print (line)
- csv module write file
''' No one answers any questions? Xiaobian created a Python exchange of learning QQ Group: 531509025 Look for like-minded partners to help each other,There are also good videos and tutorials in the group PDF e-book! ''' import csv with open('test.csv','w+') as myFile: myWriter=csv.writer(myFile) # Writerow write line by line myWriter.writerow([7,8,9]) myWriter.writerow([8,'h','f']) # writerow multiline write myList=[[1,2,3],[4,5,6]] myWriter.writerows(myList)
3. Use numpy Library (loadtext, load, fromfile)
- Loadtext method
Loadtext is used to read text files (including txt, csv, etc.) and gz or bz2 format compressed file, on the premise that each line of file data must have the same number of values.
import numpy as np # The dtype parameter in loadtext() is set to float by default # This is set to str string for easy display np.loadtxt('test.csv',dtype=str) # out: array(['1,2,3', '4,5,6', '7,8,9'], dtype='<U5')
- load method
load is used to read numpy npy, .npz or pickled persistent files.
import numpy as np # Mr. Cheng npy file np.save('test.npy', np.array([[1, 2, 3], [4, 5, 6]])) # Load npy files using load np.load('test.npy') ''' out:array([[1, 2, 3], [4, 5, 6]]) '''
- fromfile method
The fromfile method can read simple text data or binary data. The data comes from the binary data saved by the tofile method. When reading data, you need to specify the element type and modify the shape of the array appropriately.
''' No one answers any questions? Xiaobian created a Python exchange of learning QQ Group: 531509025 Look for like-minded partners to help each other,There are also good videos and tutorials in the group PDF e-book! ''' import numpy as np x = np.arange(9).reshape(3,3) x.tofile('test.bin') np.fromfile('test.bin',dtype=np.int) # out:array([0, 1, 2, 3, 4, 5, 6, 7, 8])
4. Use pandas Library (read_csv, read_excel, etc.)
pandas is one of the most commonly used analysis libraries for data processing. It can read data files in various formats and generally output dataframe format.
Such as: txt, csv, excel, json, clipboard, database, html, hdf, parquet, pickled file, sas, stata, etc
- read_csv method
read_csv method is used to read csv format files and output dataframe format.
import pandas as pd pd.read_csv('test.csv')
- read_excel method
Read excel files, including xlsx, xls and xlsm formats
import pandas as pd pd.read_excel('test.xlsx')
read_table method
Control the reading of any parameter (sep) of the file through the separator
- read_json method
Read json format file
df = pd.DataFrame([['a', 'b'], ['c', 'd']],index=['row 1', 'row 2'],columns=['col 1', 'col 2']) j = df.to_json(orient='split') pd.read_json(j,orient='split')
- read_html method
Read html table
- read_clipboard method
Read clipboard contents
- read_pickle method
Read plckled persistent file
- read_sql method
Read the database data, connect the database, and then pass in the sql statement
- read_dhf method
Read hdf5 files, suitable for reading large files
- read_parquet method
Read parquet file
- read_sas method
Read sas file
- read_stata method
Read stata file
- read_gbq method
Reading google bigquery data
5. Read and write excel files (xlrd, xlwt, openpyxl, etc.)
python has many libraries for reading and writing excel files. In addition to pandas mentioned earlier, there are xlrd, xlwt, openpyxl, xlwings and so on.
Main modules:
- xlrd Library: read data from excel and support xls and xlsx
- xlwt Library: modify excel. xlsx format cannot be modified
- xlutils Library: modify an existing file in xlw and xlrd
- openpyxl: it mainly reads and edits excel in xlsx format
- xlwings: read, write and modify xlsx, xls and xlsm format files
- xlsxwriter: used to generate excel tables, insert data, insert icons and other table operations. Reading is not supported
- Microsoft Excel API: pywin32 needs to be installed to communicate directly with Excel process. You can do anything you can do in Excel, but it is slow
6. Operate the database (pymysql, cx#u Oracle, etc.)
python supports the interaction of almost all databases. After connecting to the database, you can use sql statements to add, delete, modify and query.
Main modules:
- pymysql: used to interact with mysql database
- sqlalchemy: used to interact with mysql database
- cx_Oracle: used to interact with oracle database
- sqlite3: built-in library, which is used to interact with sqlite database
- pymssql: used to interact with sql server database
- pymongo: used to interact with mongodb non relational database
- redis and pyredis: used to interact with non relational databases