In depth inventory: the most detailed summary of Pandas consolidated dataset operation in the whole network

In fact, the articles on merging data sets with pandas library are not few, but I always feel that they are not perfect.

Especially for beginners of Python, so I'm going to sort out and summarize today. There are many contents, which are recommended to collect, like praise support, and provide a technical exchange group at the end of the article.

The general structure of this paper is

  • Brief introduction to concat() method

  • A brief introduction to the append() method

  • A brief introduction to the merge() method

  • A brief introduction to the join() method

  • Introduction to merging multiple row indexes

  • Rename column names after table merging

  • A brief introduction to the combine() method

  • combine_ A brief introduction to the first () method

Brief introduction to Concat() method

Before we begin the introduction of concat() method, let's take a look at a simple example

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

Let's take a look at the effect of using the concat() method

frames = [df1, df2, df3]
result = pd.concat(frames)
result

output

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11

The approximate merging direction is to merge according to the direction perpendicular to the axis, as shown in the following figure

The following small series will introduce the functions of various parameters in the concat() method in detail

pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)
  • objs: the data set that needs to be merged, which can be of Series type or DataFrame type

  • axis: it can be understood as the direction of merging. The default is 0

  • join: it can be understood as merging. There are two ways: Union or intersection. The default is union

  • ignore_index: ignore the index. The default is False

  • keys: used for multiple indexes in row direction

You may be confused. What is multiple index? Look at the following example

result = pd.concat(frames, keys=["x", "y", "z"])
result

output

In this way, we can obtain the data of each part through the elements of "x", "y" and "z", for example

result.log["x"]

output

    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

In addition, the keys parameter can also be used on column indexes

s3 = pd.Series([0, 1, 2, 3], name="foo")
s4 = pd.Series([0, 1, 2, 3])
s5 = pd.Series([0, 1, 4, 5])
pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])

output

   red  blue  yellow
0    0     0       0
1    1     1       1
2    2     2       4
3    3     3       5

The column name becomes an element in the keys list

For the join parameter, the default is to merge the two tables in the way of outer, that is, union

df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
result = pd.concat([df1, df4], axis=1)

output

When we set the join parameter to inner, that is, the intersection method to merge, the results will be different

result = pd.concat([df1, df4], axis=1, join="inner")

output

Next, let's take a look at ignore_ The function of the index parameter, which can re integrate the row index

result = pd.concat([df1, df4], ignore_index=True, sort=False)

output

For one table in DataFrame format and the other in Series format, concat() method can also combine the two,

s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")
result = pd.concat([df1, s1], axis=1)

output

If you're adding ignore_ If you use the index parameter, see how it works

result = pd.concat([df1, s1], axis=1, ignore_index=True)

output

A brief introduction to the append() method

The append() method is a simple generalization of the concat() method above. Let's take a look at a simple example

result = df1.append(df2)
result

output

Of course, multiple DataFrame tables can also be placed in the append() method. The code is as follows

result = df1.append([df2, df3])

output

Similar to the concat() method above, the append() method also has ignore_index parameter,

result = df1.append(df4, ignore_index=True, sort=False)

output

Similarly, we can add several rows of data to the DataFrame table through the append() method

s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
result = df1.append(s2, ignore_index=True)

output

Introduction to Merge() method

These parameters are available in the merge() method

pd.merge(
    left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)
  • left/right: that is, the two tables to be merged

  • on: the common column name of the two tables to be merged

  • left_on/right_on: the field corresponding to the merging of two tables

  • how: there are four merge methods: left, right, outer and inner. The default is inner

  • Suffixes: suffixes added after duplicate column names when merging two tables

  • left_index: if True, connect two data sets according to the index of the left table

  • right_index: if True, connect two data sets according to the index of the right table

Let's start with a simple example

left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
result = pd.merge(left, right, on="key")
result

output

In the process of merge(), there are three types of merges: one-to-one, many to one, and many to many. The "one-to-one" type, that is, the merge() method, will find the same column in the two tables, such as the "key" above, and automatically sort with this column as the key. It should be noted that the positions of elements in the common column can be inconsistent.

Let's take a look at the "many to one" consolidation type. For example, the following two tables have a common column "group", and the "group" in the first table has two identical values,

df1:

employee        group  hire_date
0      Bob   Accounting       2008
1     Jake  Engineering       2002
2     Mike  Engineering       2005
3     Linda          HR       2010

df2:

       group supervisor
0   Accounting      Cathey
1  Engineering      Dylan
2           HR      James

Then let's merge

pd.merge(df_1, df_2)

output

  employee        group  hire_date supervisor
0      Bob   Accounting       2008     Cathey
1     Jake  Engineering       2002      Dylan
2     Mike  Engineering       2005      Dylan
3    Linda           HR       2010      James

Finally, the "many to many" consolidation type can be understood as that there are duplicate values in the common columns of the two tables, such as

df3:

employee        group
0      Bob   Accounting
1     Jake  Engineering
2     Lisa  Engineering
3      Sue           HR

df4: 

       group        skills
0   Accounting          math
1   Accounting  spreadsheets
2  Engineering        coding
3  Engineering         linux
4           HR  spreadsheets
5           HR  organization

Then, after we merge, let's see the results

df = pd.merge(df3, df4)
print(df)

output

  employee        group       skills
0      Bob   Accounting         math
1      Bob   Accounting  programming
2     Jake  Engineering        linux
3     Jake  Engineering       python
4     Lisa  Engineering        linux
5     Lisa  Engineering       python
6      Sue           HR         java
7      Sue           HR          c++

There are four ways to merge the parameters how, including "left", "right", "inner" and "outer", which represent

  • inner: that is, intersection. When using the merge() method, the merging method of intersection is adopted by default

  • outer: it can be understood as the merging method of union

  • left/right: merge Union in one direction

Let's first look at the merging of union sets in the "left" direction

result = pd.merge(left, right, how="left", on=["key1", "key2"])
result

output

Let's take another look at the merging of union sets in the "right" direction

result = pd.merge(left, right, how="right", on=["key1", "key2"])
result

output

Merge in "outer" mode

result = pd.merge(left, right, how="outer", on=["key1", "key2"])
result

output

Merge in "inner" mode

result = pd.merge(left, right, how="inner", on=["key1", "key2"])
result

output

A brief introduction to the join() method

The join() method is used to merge two tables with different column indexes. Let's take a simple example first

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
)
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
)

result = left.join(right)

output

In the join() method, there is also a parameter how to define the merging method, which is similar to the merge() method. I won't repeat it here

When multiple row index encounters join() method

When you encounter a table in which the row index is multiple row index, for example

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
    index=pd.Index(["K0", "K1", "K2"], name="key"),
)
index = pd.MultiIndex.from_tuples(
    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
    names=["key", "Y"],
)
right = pd.DataFrame(
    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
    index=index,
)
result = left.join(right, how="inner")

output

So what if the two tables to be merged are multiple row indexes?

leftindex = pd.MultiIndex.from_product(
    [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
)
left = pd.DataFrame({"v1": range(12)}, index=leftindex)

output

            v1
abc xy num    
a   x  1     0
       2     1
    y  1     2
       2     3
b   x  1     4
       2     5
    y  1     6
       2     7
c   x  1     8
       2     9
    y  1    10
       2    11

The second table is as follows

rightindex = pd.MultiIndex.from_product(
    [list("abc"), list("xy")], names=["abc", "xy"]
)
right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)

output

         v2
abc xy     
a   x   100
    y   200
b   x   300
    y   400
c   x   500
    y   600

Merge the above two tables

left.join(right, on=["abc", "xy"], how="inner")

output

            v1   v2
abc xy num         
a   x  1     0  100
       2     1  100
    y  1     2  200
       2     3  200
b   x  1     4  300
       2     5  300
    y  1     6  400
       2     7  400
c   x  1     8  500
       2     9  500
    y  1    10  600
       2    11  600

Renaming of column names

If the column names of two tables are the same, their column names will be renamed after merging, such as

left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})
right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})
result = pd.merge(left, right, on="k")

output

Here we have to mention the suffixes parameter, which is used to rename a column, for example

result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))

output

combine_ A brief introduction to the first () method

If one of the two tables to be merged has a null value, you can use combine_first() method,

df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)

output

     A    B
0  1.0  3.0
1  0.0  4.0

The null value in the table will be replaced by the non null value in another table

A brief introduction to the combine() method

The combine() method combines the two tables according to the column direction, but the difference is that a third-party function or method needs to be passed in. Let's take a simple example

df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2

We define a simple method to extract the value with small sum in the process of merging

df1.combine(df2, take_smaller)

output

   A  B
0  0  3
1  0  3

If there are null values in the table, the combine() method also has fill_value to handle

df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [2, 2], 'B': [3 3]})
df1.combine(df2, take_smaller, fill_value=-5)

output

   A    B
0  0 -5.0
1  0  4.0

Recommended articles

Technical exchange

Welcome to collect and support!

At present, a technical exchange group has been opened, with more than 2000 group friends. The best way to add notes is: source + Interest direction, which is convenient to find like-minded friends

  • Method ① send the following pictures to wechat, long press to identify, and the background replies: add group;
  • Mode ②. Add micro signal: dkl88191, remarks: from CSDN
  • WeChat search official account: Python learning and data mining, background reply: add group

Tags: Python

Posted by gsb on Sat, 16 Apr 2022 22:21:19 +0930