[Python] GDP analysis and forecast

✨ Blogger wangzirui32
💖 You can like, collect and pay attention to what you like~~
👏 My 162nd original work
👉 This article was first published in CSDN. It is forbidden to reprint without permission

hello, I'm wangzirui32. Today, let's learn how to analyze and forecast GDP. Let's start!

1. data source

csv file of GDP covering the range of 2010-2021, from National Bureau of Statistics Official data:

Select the time (2010-2021) and download the CSV file named 2010 csv,2011.csv (year +.csv), the results are as follows:

2. cleaning data file

The data files for 2010 are as follows:

As you can see, the real data starts from the third line to the penultimate line of the file. We need to extract the data and perform gbk encoding conversion (the source file is gbk encoding, and the requirement is UTF8 encoding), put the data file into the datafiles folder, and then create the Python file collate in the upper directory of the directory_ data. Py, write code:

import os
import codecs

for i in os.listdir("datafiles"):
    path = "datafiles/{}".format(i)

    try:
	    # Convert gbk encoding of data file to utf-8
        with codecs.open(path, "rb", "gbk") as f:
            content = f.read()
        with codecs.open(path, "wb", "utf-8") as f:
            f.write(content)
    except: pass
    # Introduction information removal of data file
    with codecs.open(path, "wb", "utf-8") as f:
        f.writelines(new_content)
    with codecs.open(path, "rb", "utf-8") as f:
        new_content = f.readlines()[2:-5]

Execute this code and the data will be cleaned.

3. analysis

Here, read the data with pandas, draw a statistical chart with matplotlib for analysis, and install the library command:

pip install pandas matplotlib

3.1 broken line statistical chart

The codes are as follows:

import pandas
import matplotlib
import matplotlib.pyplot as plt
import os

# Extract data
gdp_Q1 = []        # gdp in the first quarter
gdp_Q2 = []        # gdp in the second quarter
gdp_Q3 = []        # gdp in the second quarter
gdp_Q4 = []        # gdp in the second quarter
gdp_all_year = []  # Annual gdp
years = []         # All years

# Solve Chinese garbled code
matplotlib.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

# Read data in each file
for filename in os.listdir("datafiles"):
    year = filename.split(".")[0]
    path = "datafiles/{}".format(filename)
    df = pandas.read_csv(path)
    df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter ']
  
    gdp_Q1.append(df['first quarter '][0])
    gdp_Q2.append(df['Second quarter'][0])
    gdp_Q3.append(df['Third quarter'][0])
    gdp_Q4.append(df['Fourth quarter'][0])
    gdp_all_year.append(df['Fourth quarter'][1])

    years.append(year)
# Draw statistical chart

Q1_line, = plt.plot(years, gdp_Q1, color="blue")
Q2_line, = plt.plot(years, gdp_Q2, color="pink")
Q3_line, = plt.plot(years, gdp_Q3, color="green")
Q4_line, = plt.plot(years, gdp_Q4, color="orange")
all_year_line, = plt.plot(years, gdp_all_year, color="red")

plt.title("2010-2021 GDP analysis and forecast")
plt.xlabel("particular year")
plt.ylabel("GDP (100 million yuan)")
plt.xticks(years)

plt.legend([Q1_line, Q2_line, Q3_line, Q4_line, all_year_line],
            ['first quarter ','Second quarter', 'Third quarter', 'Fourth quarter', 'Annual total value'],
            loc='upper right')

plt.show()

The effect is as follows:

3.2 column statistical chart

The codes are as follows:

import pandas
import matplotlib
import matplotlib.pyplot as plt
import os

# Extract data
gdp_all_year = []  # Annual gdp
years = []         # All years

# Solve Chinese garbled code
matplotlib.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

# Read data in each file
for filename in os.listdir("datafiles"):
    year = filename.split(".")[0]
    path = "datafiles/{}".format(filename)
    df = pandas.read_csv(path)
    df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter ']

    gdp_all_year.append(df['Fourth quarter'][1])
    years.append(year)

  

# Draw statistical chart
plt.bar(years, gdp_all_year, width=0.5, label="numbers")
all_year_line, = plt.plot(years, gdp_all_year, color="red")
plt.title("2010-2021 GDP analysis", loc="center")
plt.xlabel("particular year", fontsize=14)
plt.ylabel("GDP (100 million yuan)", fontsize=14)

plt.show()

The effect is as follows:

It can be seen that the GDP decreased in 2020 due to the epidemic, but the overall situation has been rising in recent years.

4. fitting linear regression equation

Next, we will use the sklearn machine learning library to fit the linear regression equation. Its installation command is as follows:

pip install scikit-learn

4.1 take the first quarter as the parameter

The idea is as follows. We use the sklearn library to fit the linear regression equation, and take the first quarter as the parameter to generate the prediction equation. The code is as follows:

import pandas
import matplotlib
import matplotlib.pyplot as plt
import os
from sklearn import linear_model

# Extract data
gdp_Q1 = []        # gdp in the first quarter
gdp_Q2 = []        # gdp in the second quarter
gdp_all_year = []  # Annual gdp
years = []         # All years

# Solve Chinese garbled code
matplotlib.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

# Read data in each file
for filename in os.listdir("datafiles"):
    year = filename.split(".")[0]
    path = "datafiles/{}".format(filename)
    df = pandas.read_csv(path)
    df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter ']
  
    gdp_Q1.append(df['first quarter '][0])
    gdp_Q2.append(df['Second quarter'][0])
    gdp_all_year.append(df['Fourth quarter'][1])
    years.append(year)
# Draw statistical chart
Q1_line, = plt.plot(years, gdp_Q1, color="blue")
Q2_line, = plt.plot(years, gdp_Q2, color="pink")
all_year_line, = plt.plot(years, gdp_all_year, color="red")

plt.title("2010-2021 GDP analysis and forecast")
plt.xlabel("particular year")
plt.ylabel("GDP (100 million yuan)")
plt.xticks(years)

# Formulation of equations
# Create a linear regression model
model = linear_model.LinearRegression()

model.fit(list(zip(gdp_Q1)), gdp_all_year)
# Acquisition coefficient
coef = model.coef_
# Get intercept
intercept = model.intercept_
# equation
equation = "y = x*{} + {}".format(coef[0], intercept)
print("Linear regression equation:", equation)
# Calculation equation data
forecast_value = [i*coef[0]+intercept for i in gdp_Q1]
# Draw equation polyline
forecast_line, = plt.plot(years, forecast_value, color="green")

plt.legend([Q1_line, Q2_line, all_year_line, forecast_line],
            ['first quarter ','Second quarter', 'Annual total value', 'Equation simulation'],
            loc='upper right')

plt.show()

The effect is as follows:

It can be seen that the equation basically fits the curve. However, due to the decline of GDP in the first quarter of 2020, the predicted annual data also have a large gap. What should we do?

4.2 take the first and second quarters as parameters

We can take Q1 and Q2 as parameters, and the code is as follows:

import pandas
import matplotlib
import matplotlib.pyplot as plt
import os
from sklearn import linear_model

# Extract data
gdp_Q1 = []        # gdp in the first quarter
gdp_Q2 = []        # gdp in the second quarter
gdp_all_year = []  # Annual gdp
years = []         # All years

# Solve Chinese garbled code
matplotlib.rcParams['font.family'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = False

# Read data in each file
for filename in os.listdir("datafiles"):
    year = filename.split(".")[0]
    path = "datafiles/{}".format(filename)
    df = pandas.read_csv(path)
    df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter ']

    gdp_Q1.append(df['first quarter '][0])
    gdp_Q2.append(df['Second quarter'][0])
    gdp_all_year.append(df['Fourth quarter'][1])
    years.append(year)

# Draw statistical chart
Q1_line, = plt.plot(years, gdp_Q1, color="blue")
Q2_line, = plt.plot(years, gdp_Q2, color="pink")
all_year_line, = plt.plot(years, gdp_all_year, color="red")

plt.title("2010-2021 GDP analysis and forecast")
plt.xlabel("particular year")
plt.ylabel("GDP (100 million yuan)")
plt.xticks(years)

# Formulation of equations
model = linear_model.LinearRegression()
model.fit(list(zip(gdp_Q1, gdp_Q2)), gdp_all_year)
coef = model.coef_
intercept = model.intercept_
equation = "y = x1*{} + x2*{} + {}".format(coef[0], coef[1], intercept)

print("Linear regression equation:", equation)

# Calculate forecast results
forecast_value = [i[0]*coef[0]+i[1]*coef[1]+intercept for i in list(zip(gdp_Q1, gdp_Q2))]
forecast_line, = plt.plot(years, forecast_value, color="green")

plt.legend([Q1_line, Q2_line, all_year_line, forecast_line],
            ['first quarter ','Second quarter', 'Annual total value', 'Equation simulation'],
            loc='upper right')

plt.show()

The effect is as follows:

It can be seen that the fitting result of this equation is very good, and it can be used as a prediction equation.
The linear regression equation is:

y = x1*0.20405068090604006 + x2*3.8656156020304238 + 9671.424027125235

Equivalent to:

gross domestic product = First quarter GDP*0.20405068090604006 + Second quarter GDP*3.8656156020304238 + 9671.424027125235

This completes the whole process of analysis and prediction.

🎉🎉🎉 Well, that's all for today's course. I'm wangzirui32. You can collect and pay attention to what you like. See you next time!

Tags: Python Machine Learning matplotlib programming language

Posted by flash gordon on Fri, 01 Jul 2022 02:11:09 +0930