In fact, the articles on merging data sets with pandas library are not few, but I always feel that they are not perfect.
Especially for beginners of Python, so I'm going to sort out and summarize today. There are many contents, which are recommended to collect, like praise support, and provide a technical exchange group at the end of the article.
The general structure of this paper is
-
Brief introduction to concat() method
-
A brief introduction to the append() method
-
A brief introduction to the merge() method
-
A brief introduction to the join() method
-
Introduction to merging multiple row indexes
-
Rename column names after table merging
-
A brief introduction to the combine() method
-
combine_ A brief introduction to the first () method
Brief introduction to Concat() method
Before we begin the introduction of concat() method, let's take a look at a simple example
df1 = pd.DataFrame( { "A": ["A0", "A1", "A2", "A3"], "B": ["B0", "B1", "B2", "B3"], "C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"], }, index=[0, 1, 2, 3], ) df2 = pd.DataFrame( { "A": ["A4", "A5", "A6", "A7"], "B": ["B4", "B5", "B6", "B7"], "C": ["C4", "C5", "C6", "C7"], "D": ["D4", "D5", "D6", "D7"], }, index=[4, 5, 6, 7], ) df3 = pd.DataFrame( { "A": ["A8", "A9", "A10", "A11"], "B": ["B8", "B9", "B10", "B11"], "C": ["C8", "C9", "C10", "C11"], "D": ["D8", "D9", "D10", "D11"], }, index=[8, 9, 10, 11], )
Let's take a look at the effect of using the concat() method
frames = [df1, df2, df3] result = pd.concat(frames) result
output
A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7 8 A8 B8 C8 D8 9 A9 B9 C9 D9 10 A10 B10 C10 D10 11 A11 B11 C11 D11
The approximate merging direction is to merge according to the direction perpendicular to the axis, as shown in the following figure
The following small series will introduce the functions of various parameters in the concat() method in detail
pd.concat( objs, axis=0, join="outer", ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True, )
-
objs: the data set that needs to be merged, which can be of Series type or DataFrame type
-
axis: it can be understood as the direction of merging. The default is 0
-
join: it can be understood as merging. There are two ways: Union or intersection. The default is union
-
ignore_index: ignore the index. The default is False
-
keys: used for multiple indexes in row direction
You may be confused. What is multiple index? Look at the following example
result = pd.concat(frames, keys=["x", "y", "z"]) result
output
In this way, we can obtain the data of each part through the elements of "x", "y" and "z", for example
result.log["x"]
output
A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3
In addition, the keys parameter can also be used on column indexes
s3 = pd.Series([0, 1, 2, 3], name="foo") s4 = pd.Series([0, 1, 2, 3]) s5 = pd.Series([0, 1, 4, 5]) pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])
output
red blue yellow 0 0 0 0 1 1 1 1 2 2 2 4 3 3 3 5
The column name becomes an element in the keys list
For the join parameter, the default is to merge the two tables in the way of outer, that is, union
df4 = pd.DataFrame( { "B": ["B2", "B3", "B6", "B7"], "D": ["D2", "D3", "D6", "D7"], "F": ["F2", "F3", "F6", "F7"], }, index=[2, 3, 6, 7], ) result = pd.concat([df1, df4], axis=1)
output
When we set the join parameter to inner, that is, the intersection method to merge, the results will be different
result = pd.concat([df1, df4], axis=1, join="inner")
output
Next, let's take a look at ignore_ The function of the index parameter, which can re integrate the row index
result = pd.concat([df1, df4], ignore_index=True, sort=False)
output
For one table in DataFrame format and the other in Series format, concat() method can also combine the two,
s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X") result = pd.concat([df1, s1], axis=1)
output
If you're adding ignore_ If you use the index parameter, see how it works
result = pd.concat([df1, s1], axis=1, ignore_index=True)
output
A brief introduction to the append() method
The append() method is a simple generalization of the concat() method above. Let's take a look at a simple example
result = df1.append(df2) result
output
Of course, multiple DataFrame tables can also be placed in the append() method. The code is as follows
result = df1.append([df2, df3])
output
Similar to the concat() method above, the append() method also has ignore_index parameter,
result = df1.append(df4, ignore_index=True, sort=False)
output
Similarly, we can add several rows of data to the DataFrame table through the append() method
s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"]) result = df1.append(s2, ignore_index=True)
output
Introduction to Merge() method
These parameters are available in the merge() method
pd.merge( left, right, how="inner", on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=("_x", "_y"), copy=True, indicator=False, validate=None, )
-
left/right: that is, the two tables to be merged
-
on: the common column name of the two tables to be merged
-
left_on/right_on: the field corresponding to the merging of two tables
-
how: there are four merge methods: left, right, outer and inner. The default is inner
-
Suffixes: suffixes added after duplicate column names when merging two tables
-
left_index: if True, connect two data sets according to the index of the left table
-
right_index: if True, connect two data sets according to the index of the right table
Let's start with a simple example
left = pd.DataFrame( { "key": ["K0", "K1", "K2", "K3"], "A": ["A0", "A1", "A2", "A3"], "B": ["B0", "B1", "B2", "B3"], } ) right = pd.DataFrame( { "key": ["K0", "K1", "K2", "K3"], "C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"], } ) result = pd.merge(left, right, on="key") result
output
In the process of merge(), there are three types of merges: one-to-one, many to one, and many to many. The "one-to-one" type, that is, the merge() method, will find the same column in the two tables, such as the "key" above, and automatically sort with this column as the key. It should be noted that the positions of elements in the common column can be inconsistent.
Let's take a look at the "many to one" consolidation type. For example, the following two tables have a common column "group", and the "group" in the first table has two identical values,
df1: employee group hire_date 0 Bob Accounting 2008 1 Jake Engineering 2002 2 Mike Engineering 2005 3 Linda HR 2010 df2: group supervisor 0 Accounting Cathey 1 Engineering Dylan 2 HR James
Then let's merge
pd.merge(df_1, df_2)
output
employee group hire_date supervisor 0 Bob Accounting 2008 Cathey 1 Jake Engineering 2002 Dylan 2 Mike Engineering 2005 Dylan 3 Linda HR 2010 James
Finally, the "many to many" consolidation type can be understood as that there are duplicate values in the common columns of the two tables, such as
df3: employee group 0 Bob Accounting 1 Jake Engineering 2 Lisa Engineering 3 Sue HR df4: group skills 0 Accounting math 1 Accounting spreadsheets 2 Engineering coding 3 Engineering linux 4 HR spreadsheets 5 HR organization
Then, after we merge, let's see the results
df = pd.merge(df3, df4) print(df)
output
employee group skills 0 Bob Accounting math 1 Bob Accounting programming 2 Jake Engineering linux 3 Jake Engineering python 4 Lisa Engineering linux 5 Lisa Engineering python 6 Sue HR java 7 Sue HR c++
There are four ways to merge the parameters how, including "left", "right", "inner" and "outer", which represent
-
inner: that is, intersection. When using the merge() method, the merging method of intersection is adopted by default
-
outer: it can be understood as the merging method of union
-
left/right: merge Union in one direction
Let's first look at the merging of union sets in the "left" direction
result = pd.merge(left, right, how="left", on=["key1", "key2"]) result
output
Let's take another look at the merging of union sets in the "right" direction
result = pd.merge(left, right, how="right", on=["key1", "key2"]) result
output
Merge in "outer" mode
result = pd.merge(left, right, how="outer", on=["key1", "key2"]) result
output
Merge in "inner" mode
result = pd.merge(left, right, how="inner", on=["key1", "key2"]) result
output
A brief introduction to the join() method
The join() method is used to merge two tables with different column indexes. Let's take a simple example first
left = pd.DataFrame( {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"] ) right = pd.DataFrame( {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"] ) result = left.join(right)
output
In the join() method, there is also a parameter how to define the merging method, which is similar to the merge() method. I won't repeat it here
When multiple row index encounters join() method
When you encounter a table in which the row index is multiple row index, for example
left = pd.DataFrame( {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=pd.Index(["K0", "K1", "K2"], name="key"), ) index = pd.MultiIndex.from_tuples( [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"], ) right = pd.DataFrame( {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index, ) result = left.join(right, how="inner")
output
So what if the two tables to be merged are multiple row indexes?
leftindex = pd.MultiIndex.from_product( [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"] ) left = pd.DataFrame({"v1": range(12)}, index=leftindex)
output
v1 abc xy num a x 1 0 2 1 y 1 2 2 3 b x 1 4 2 5 y 1 6 2 7 c x 1 8 2 9 y 1 10 2 11
The second table is as follows
rightindex = pd.MultiIndex.from_product( [list("abc"), list("xy")], names=["abc", "xy"] ) right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
output
v2 abc xy a x 100 y 200 b x 300 y 400 c x 500 y 600
Merge the above two tables
left.join(right, on=["abc", "xy"], how="inner")
output
v1 v2 abc xy num a x 1 0 100 2 1 100 y 1 2 200 2 3 200 b x 1 4 300 2 5 300 y 1 6 400 2 7 400 c x 1 8 500 2 9 500 y 1 10 600 2 11 600
Renaming of column names
If the column names of two tables are the same, their column names will be renamed after merging, such as
left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]}) right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]}) result = pd.merge(left, right, on="k")
output
Here we have to mention the suffixes parameter, which is used to rename a column, for example
result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))
output
combine_ A brief introduction to the first () method
If one of the two tables to be merged has a null value, you can use combine_first() method,
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) df1.combine_first(df2)
output
A B 0 1.0 3.0 1 0.0 4.0
The null value in the table will be replaced by the non null value in another table
A brief introduction to the combine() method
The combine() method combines the two tables according to the column direction, but the difference is that a third-party function or method needs to be passed in. Let's take a simple example
df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2
We define a simple method to extract the value with small sum in the process of merging
df1.combine(df2, take_smaller)
output
A B 0 0 3 1 0 3
If there are null values in the table, the combine() method also has fill_value to handle
df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) df2 = pd.DataFrame({'A': [2, 2], 'B': [3 3]}) df1.combine(df2, take_smaller, fill_value=-5)
output
A B 0 0 -5.0 1 0 4.0
Recommended articles
- Haha, I can't extricate myself. Recently, I helped the team build a cool data Kanban (with source code)
- The large data visualization screen is really not a thing. Take these 30 exquisite templates away
- 2022 machine learning video tutorials and Python books are enough
- Goodbye, English version. Here comes the Chinese version of Python data science quick look-up table
- Select 17 knowledge and thinking maps, and this time we should clarify statistics
- Here comes the Mandarin Course of machine learning (2022), Mr. Li Hongyi
Technical exchange
Welcome to collect and support!
At present, a technical exchange group has been opened, with more than 2000 group friends. The best way to add notes is: source + Interest direction, which is convenient to find like-minded friends
- Method ① send the following pictures to wechat, long press to identify, and the background replies: add group;
- Mode ②. Add micro signal: dkl88191, remarks: from CSDN
- WeChat search official account: Python learning and data mining, background reply: add group