✨ Blogger wangzirui32
💖 You can like, collect and pay attention to what you like~~
👏 My 162nd original work
👉 This article was first published in CSDN. It is forbidden to reprint without permission
hello, I'm wangzirui32. Today, let's learn how to analyze and forecast GDP. Let's start!
1. data source
csv file of GDP covering the range of 2010-2021, from National Bureau of Statistics Official data:
Select the time (2010-2021) and download the CSV file named 2010 csv,2011.csv (year +.csv), the results are as follows:
2. cleaning data file
The data files for 2010 are as follows:
As you can see, the real data starts from the third line to the penultimate line of the file. We need to extract the data and perform gbk encoding conversion (the source file is gbk encoding, and the requirement is UTF8 encoding), put the data file into the datafiles folder, and then create the Python file collate in the upper directory of the directory_ data. Py, write code:
import os import codecs for i in os.listdir("datafiles"): path = "datafiles/{}".format(i) try: # Convert gbk encoding of data file to utf-8 with codecs.open(path, "rb", "gbk") as f: content = f.read() with codecs.open(path, "wb", "utf-8") as f: f.write(content) except: pass # Introduction information removal of data file with codecs.open(path, "wb", "utf-8") as f: f.writelines(new_content) with codecs.open(path, "rb", "utf-8") as f: new_content = f.readlines()[2:-5]
Execute this code and the data will be cleaned.
3. analysis
Here, read the data with pandas, draw a statistical chart with matplotlib for analysis, and install the library command:
pip install pandas matplotlib
3.1 broken line statistical chart
The codes are as follows:
import pandas import matplotlib import matplotlib.pyplot as plt import os # Extract data gdp_Q1 = [] # gdp in the first quarter gdp_Q2 = [] # gdp in the second quarter gdp_Q3 = [] # gdp in the second quarter gdp_Q4 = [] # gdp in the second quarter gdp_all_year = [] # Annual gdp years = [] # All years # Solve Chinese garbled code matplotlib.rcParams['font.family'] = 'SimHei' plt.rcParams['axes.unicode_minus'] = False # Read data in each file for filename in os.listdir("datafiles"): year = filename.split(".")[0] path = "datafiles/{}".format(filename) df = pandas.read_csv(path) df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter '] gdp_Q1.append(df['first quarter '][0]) gdp_Q2.append(df['Second quarter'][0]) gdp_Q3.append(df['Third quarter'][0]) gdp_Q4.append(df['Fourth quarter'][0]) gdp_all_year.append(df['Fourth quarter'][1]) years.append(year) # Draw statistical chart Q1_line, = plt.plot(years, gdp_Q1, color="blue") Q2_line, = plt.plot(years, gdp_Q2, color="pink") Q3_line, = plt.plot(years, gdp_Q3, color="green") Q4_line, = plt.plot(years, gdp_Q4, color="orange") all_year_line, = plt.plot(years, gdp_all_year, color="red") plt.title("2010-2021 GDP analysis and forecast") plt.xlabel("particular year") plt.ylabel("GDP (100 million yuan)") plt.xticks(years) plt.legend([Q1_line, Q2_line, Q3_line, Q4_line, all_year_line], ['first quarter ','Second quarter', 'Third quarter', 'Fourth quarter', 'Annual total value'], loc='upper right') plt.show()
The effect is as follows:
3.2 column statistical chart
The codes are as follows:
import pandas import matplotlib import matplotlib.pyplot as plt import os # Extract data gdp_all_year = [] # Annual gdp years = [] # All years # Solve Chinese garbled code matplotlib.rcParams['font.family'] = 'SimHei' plt.rcParams['axes.unicode_minus'] = False # Read data in each file for filename in os.listdir("datafiles"): year = filename.split(".")[0] path = "datafiles/{}".format(filename) df = pandas.read_csv(path) df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter '] gdp_all_year.append(df['Fourth quarter'][1]) years.append(year) # Draw statistical chart plt.bar(years, gdp_all_year, width=0.5, label="numbers") all_year_line, = plt.plot(years, gdp_all_year, color="red") plt.title("2010-2021 GDP analysis", loc="center") plt.xlabel("particular year", fontsize=14) plt.ylabel("GDP (100 million yuan)", fontsize=14) plt.show()
The effect is as follows:
It can be seen that the GDP decreased in 2020 due to the epidemic, but the overall situation has been rising in recent years.
4. fitting linear regression equation
Next, we will use the sklearn machine learning library to fit the linear regression equation. Its installation command is as follows:
pip install scikit-learn
4.1 take the first quarter as the parameter
The idea is as follows. We use the sklearn library to fit the linear regression equation, and take the first quarter as the parameter to generate the prediction equation. The code is as follows:
import pandas import matplotlib import matplotlib.pyplot as plt import os from sklearn import linear_model # Extract data gdp_Q1 = [] # gdp in the first quarter gdp_Q2 = [] # gdp in the second quarter gdp_all_year = [] # Annual gdp years = [] # All years # Solve Chinese garbled code matplotlib.rcParams['font.family'] = 'SimHei' plt.rcParams['axes.unicode_minus'] = False # Read data in each file for filename in os.listdir("datafiles"): year = filename.split(".")[0] path = "datafiles/{}".format(filename) df = pandas.read_csv(path) df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter '] gdp_Q1.append(df['first quarter '][0]) gdp_Q2.append(df['Second quarter'][0]) gdp_all_year.append(df['Fourth quarter'][1]) years.append(year) # Draw statistical chart Q1_line, = plt.plot(years, gdp_Q1, color="blue") Q2_line, = plt.plot(years, gdp_Q2, color="pink") all_year_line, = plt.plot(years, gdp_all_year, color="red") plt.title("2010-2021 GDP analysis and forecast") plt.xlabel("particular year") plt.ylabel("GDP (100 million yuan)") plt.xticks(years) # Formulation of equations # Create a linear regression model model = linear_model.LinearRegression() model.fit(list(zip(gdp_Q1)), gdp_all_year) # Acquisition coefficient coef = model.coef_ # Get intercept intercept = model.intercept_ # equation equation = "y = x*{} + {}".format(coef[0], intercept) print("Linear regression equation:", equation) # Calculation equation data forecast_value = [i*coef[0]+intercept for i in gdp_Q1] # Draw equation polyline forecast_line, = plt.plot(years, forecast_value, color="green") plt.legend([Q1_line, Q2_line, all_year_line, forecast_line], ['first quarter ','Second quarter', 'Annual total value', 'Equation simulation'], loc='upper right') plt.show()
The effect is as follows:
It can be seen that the equation basically fits the curve. However, due to the decline of GDP in the first quarter of 2020, the predicted annual data also have a large gap. What should we do?
4.2 take the first and second quarters as parameters
We can take Q1 and Q2 as parameters, and the code is as follows:
import pandas import matplotlib import matplotlib.pyplot as plt import os from sklearn import linear_model # Extract data gdp_Q1 = [] # gdp in the first quarter gdp_Q2 = [] # gdp in the second quarter gdp_all_year = [] # Annual gdp years = [] # All years # Solve Chinese garbled code matplotlib.rcParams['font.family'] = 'SimHei' plt.rcParams['axes.unicode_minus'] = False # Read data in each file for filename in os.listdir("datafiles"): year = filename.split(".")[0] path = "datafiles/{}".format(filename) df = pandas.read_csv(path) df.columns = ['index','Fourth quarter', 'Third quarter', 'Second quarter', 'first quarter '] gdp_Q1.append(df['first quarter '][0]) gdp_Q2.append(df['Second quarter'][0]) gdp_all_year.append(df['Fourth quarter'][1]) years.append(year) # Draw statistical chart Q1_line, = plt.plot(years, gdp_Q1, color="blue") Q2_line, = plt.plot(years, gdp_Q2, color="pink") all_year_line, = plt.plot(years, gdp_all_year, color="red") plt.title("2010-2021 GDP analysis and forecast") plt.xlabel("particular year") plt.ylabel("GDP (100 million yuan)") plt.xticks(years) # Formulation of equations model = linear_model.LinearRegression() model.fit(list(zip(gdp_Q1, gdp_Q2)), gdp_all_year) coef = model.coef_ intercept = model.intercept_ equation = "y = x1*{} + x2*{} + {}".format(coef[0], coef[1], intercept) print("Linear regression equation:", equation) # Calculate forecast results forecast_value = [i[0]*coef[0]+i[1]*coef[1]+intercept for i in list(zip(gdp_Q1, gdp_Q2))] forecast_line, = plt.plot(years, forecast_value, color="green") plt.legend([Q1_line, Q2_line, all_year_line, forecast_line], ['first quarter ','Second quarter', 'Annual total value', 'Equation simulation'], loc='upper right') plt.show()
The effect is as follows:
It can be seen that the fitting result of this equation is very good, and it can be used as a prediction equation.
The linear regression equation is:
y = x1*0.20405068090604006 + x2*3.8656156020304238 + 9671.424027125235
Equivalent to:
gross domestic product = First quarter GDP*0.20405068090604006 + Second quarter GDP*3.8656156020304238 + 9671.424027125235
This completes the whole process of analysis and prediction.
🎉🎉🎉 Well, that's all for today's course. I'm wangzirui32. You can collect and pay attention to what you like. See you next time!