4-02-3 Matplotlib scatter chart, divergent bar chart and pie chart

Next, we want to discuss the process of data visualization. We hope to express the characteristics of data through charts. A good chart should have the following characteristics:

  • Provide accurate and needed information without distorting the facts.
  • The design is simple and the acquisition will not be too laborious.
  • Beauty is to support this information, not to cover it up;
  • Don't provide too much information and too complex structure.

Graphics can be divided into the following categories

  • Correlation: it is used to display the relationship between different variables. There are the following graphs - scatter plot, bubble plot with encircling, scatter plot with line of best fit, dithering with stripplot, counts plot, marginal histogram Marginal boxplot, correlogram, pairwise plot, etc.
  • Deviation: it is used to show the differences between different variables. The commonly built graphs include divergent bars, divergent texts, divergent dot plot, divergent lollipop chart with markers, area chart, etc.
  • Ranking: it effectively expresses the arrangement order of things or items. The common ones are ordered bar chart, lollipop chart, dot plot, slope chart and dumbbell plot.
  • Distribution: draw distribution map in probability and statistics, including histogram for continuous variable, histogram for categorical variable, density plot, density curves with histogram, joy plot, distributed dot plot, box plot Dot + box plot, violin plot, population pyramid, categorical plots, etc.
  • Composition: used to represent the distribution of components, including waffle chart, pie chart, treemap, bar chart, etc.
  • Change: used to highlight changes in time or space, including time series plot, time series with peaks and troughs annotated, autocorrelation plot, cross correlation plot, time series decomposition plot, multiple time series Use the secondary Y-axis to plot with different scales using secondary y-axis, time series with error bands, stacked area chart, area chart unstacked, calendar heat map, seasonal plot, etc.
  • Groups: distinguish different groups, including dendrogram, cluster plot, Andrews curve and parallel coordinates.

Correlation scatter plot
The following uses the population distribution cases of Midwest states (midwest.csv) in the United States to observe. First, cluster according to the state, with a total of 16 categories. Give each category different colors, and display the information of the midweld data set, with a total of 437 data and 28 fields, Then draw a scatter chart with the proportion below the poverty line (percentage below poverty) and the proportion of college students (percentage, college educated). It is obvious that most of the points are concentrated in the lower left corner. If the proportion of college education increases, the proportion below the poverty line is relatively small, indicating the correlation between high school education and poverty.

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv")
# Data set information is displayed, including 437 data entries and 28 fields
# Assign a color to each state
states = np.unique(midwest['state'])
colors = [plt.cm.tab10(i/float(len(states)-1)) for i in range(len(states))]
# Specify canvas size
plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
# Draw a scatter map of each state
for i, state in enumerate(states):
    plt.scatter('percollege', 'percbelowpoverty', data=midwest.loc[midwest.state==state, :], s=20, color=colors[i], label=str(state))
# Specify axis length, X/Y label
plt.gca().set(xlim=(0, 60), ylim=(0, 60), xlabel='Proportion of college degree', ylabel='Proportion below the poverty line')
plt.title("Scatter plot of the ratio of college education to the ratio below the poverty line in the Midwestern states of the United States", fontsize=22)

Figure 4-2-5 scatter diagram of tertiary education level and below the poverty line

Deviation diverging bars
If you want to see how a project changes based on a single indicator and visualize the order and number of changes, the divergence bar is a good tool. It helps to quickly distinguish the performance of groups in data, is very intuitive, and can immediately convey key points. The following data set mtcars is used to observe. These data are extracted from the American Journal of automobile trends in 1974, including the fuel consumption of 32 vehicles and the vehicle design and performance in 14 aspects. The performance index to be observed is the miles per gallon. It can be seen that the fuel consumption performance of Ferrari Dino is in the middle, and the fuel consumption performance of Toyota Corolla is the best, Lincoln Continental consumes the most fuel.

Description of mtcars dataset

mpgHow many miles per gallon
cylNumber of cylinders
dispExhaust volume (in cubic inches)
hpTotal horsepower
dratRear axle ratio
wtWeight (1000 lbs)
qsecMultiply by 1 / 4 mile or 400 meters
vsEngine type (0 = V-shaped, 1 = straight)
amTransmission (0 = automatic, 1 = manual)
gearNumber of forward gears
carbNumber of carburetors
carsCar model
carnameCar model
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
# Draw plot
plt.figure(figsize=(14,10), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)
# Decorations
plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
The output results are as follows:
RangeIndex: 32 entries, 0 to 31
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   mpg      32 non-null     float64
 1   cyl      32 non-null     int64
 2   disp     32 non-null     float64
 3   hp       32 non-null     int64
 4   drat     32 non-null     float64
 5   wt       32 non-null     float64
 6   qsec     32 non-null     float64
 7   vs       32 non-null     int64
 8   am       32 non-null     int64
 9   gear     32 non-null     int64
 10  carb     32 non-null     int64
 11  fast     32 non-null     int64
 12  cars     32 non-null     object
 13  carname  32 non-null     object
dtypes: float64(5), int64(7), object(2)
memory usage: 3.6+ KB

Figure 4-2-6 divergent bar chart of automobile fuel consumption performance

Constituent elements pie chart
Pie chart is a classic method to display constituent elements. It is generally used. If pie chart is used, it is strongly recommended to clearly write down the percentage or number of each part of the pie chart. Based on the data set of mtcars and the number of cylinders, it is found that there are 14 vehicles with 8 cylinders, accounting for 43.8% of all vehicles.

import pandas as pd
import matplotlib.pyplot as plt
df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
# Prepare Data
df = df_raw.groupby('cyl').size().reset_index(name='counts')
# Draw Plot
fig, ax = plt.subplots(figsize=(12, 7), subplot_kw=dict(aspect="equal"), dpi= 80)
data = df['counts']
categories = df['cyl']
explode = [0,0.1,0]
def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}% ({:d} )".format(pct, absolute)
wedges, texts, autotexts = ax.pie(data, 
                                  autopct=lambda pct: func(pct, data),
# Decoration
ax.legend(wedges, categories, title="Cylinders ", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
plt.setp(autotexts, size=10, weight=700)
ax.set_title("Pie chart of automobile cylinder number")

Figure 4-2-7 shows the pie chart of the proportion of vehicles with different cylinder numbers

Tags: Python

Posted by Darklink on Mon, 18 Apr 2022 11:13:29 +0930