Next, we want to discuss the process of data visualization. We hope to express the characteristics of data through charts. A good chart should have the following characteristics:
- Provide accurate and needed information without distorting the facts.
- The design is simple and the acquisition will not be too laborious.
- Beauty is to support this information, not to cover it up;
- Don't provide too much information and too complex structure.
Graphics can be divided into the following categories
- Correlation: it is used to display the relationship between different variables. There are the following graphs - scatter plot, bubble plot with encircling, scatter plot with line of best fit, dithering with stripplot, counts plot, marginal histogram Marginal boxplot, correlogram, pairwise plot, etc.
- Deviation: it is used to show the differences between different variables. The commonly built graphs include divergent bars, divergent texts, divergent dot plot, divergent lollipop chart with markers, area chart, etc.
- Ranking: it effectively expresses the arrangement order of things or items. The common ones are ordered bar chart, lollipop chart, dot plot, slope chart and dumbbell plot.
- Distribution: draw distribution map in probability and statistics, including histogram for continuous variable, histogram for categorical variable, density plot, density curves with histogram, joy plot, distributed dot plot, box plot Dot + box plot, violin plot, population pyramid, categorical plots, etc.
- Composition: used to represent the distribution of components, including waffle chart, pie chart, treemap, bar chart, etc.
- Change: used to highlight changes in time or space, including time series plot, time series with peaks and troughs annotated, autocorrelation plot, cross correlation plot, time series decomposition plot, multiple time series Use the secondary Y-axis to plot with different scales using secondary y-axis, time series with error bands, stacked area chart, area chart unstacked, calendar heat map, seasonal plot, etc.
- Groups: distinguish different groups, including dendrogram, cluster plot, Andrews curve and parallel coordinates.
Correlation scatter plot
The following uses the population distribution cases of Midwest states (midwest.csv) in the United States to observe. First, cluster according to the state, with a total of 16 categories. Give each category different colors, and display the information of the midweld data set, with a total of 437 data and 28 fields, Then draw a scatter chart with the proportion below the poverty line (percentage below poverty) and the proportion of college students (percentage, college educated). It is obvious that most of the points are concentrated in the lower left corner. If the proportion of college education increases, the proportion below the poverty line is relatively small, indicating the correlation between high school education and poverty.
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest.csv") # Data set information is displayed, including 437 data entries and 28 fields midwest.info() # Assign a color to each state states = np.unique(midwest['state']) colors = [plt.cm.tab10(i/float(len(states)-1)) for i in range(len(states))] # Specify canvas size plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k') # Draw a scatter map of each state for i, state in enumerate(states): plt.scatter('percollege', 'percbelowpoverty', data=midwest.loc[midwest.state==state, :], s=20, color=colors[i], label=str(state)) # Specify axis length, X/Y label plt.gca().set(xlim=(0, 60), ylim=(0, 60), xlabel='Proportion of college degree', ylabel='Proportion below the poverty line') plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.title("Scatter plot of the ratio of college education to the ratio below the poverty line in the Midwestern states of the United States", fontsize=22) plt.legend(fontsize=12) plt.show()
Figure 4-2-5 scatter diagram of tertiary education level and below the poverty line
Deviation diverging bars
If you want to see how a project changes based on a single indicator and visualize the order and number of changes, the divergence bar is a good tool. It helps to quickly distinguish the performance of groups in data, is very intuitive, and can immediately convey key points. The following data set mtcars is used to observe. These data are extracted from the American Journal of automobile trends in 1974, including the fuel consumption of 32 vehicles and the vehicle design and performance in 14 aspects. The performance index to be observed is the miles per gallon. It can be seen that the fuel consumption performance of Ferrari Dino is in the middle, and the fuel consumption performance of Toyota Corolla is the best, Lincoln Continental consumes the most fuel.
Description of mtcars dataset
Field | explain |
---|---|
mpg | How many miles per gallon |
cyl | Number of cylinders |
disp | Exhaust volume (in cubic inches) |
hp | Total horsepower |
drat | Rear axle ratio |
wt | Weight (1000 lbs) |
qsec | Multiply by 1 / 4 mile or 400 meters |
vs | Engine type (0 = V-shaped, 1 = straight) |
am | Transmission (0 = automatic, 1 = manual) |
gear | Number of forward gears |
carb | Number of carburetors |
cars | Car model |
carname | Car model |
import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv") df.info() x = df.loc[:, ['mpg']] df['mpg_z'] = (x - x.mean())/x.std() df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']] df.sort_values('mpg_z', inplace=True) df.reset_index(inplace=True) # Draw plot plt.figure(figsize=(14,10), dpi= 80) plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5) # Decorations plt.gca().set(ylabel='$Model$', xlabel='$Mileage$') plt.yticks(df.index, df.cars, fontsize=12) plt.title('Diverging Bars of Car Mileage', fontdict={'size':20}) plt.grid(linestyle='--', alpha=0.5) plt.show() The output results are as follows: RangeIndex: 32 entries, 0 to 31 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mpg 32 non-null float64 1 cyl 32 non-null int64 2 disp 32 non-null float64 3 hp 32 non-null int64 4 drat 32 non-null float64 5 wt 32 non-null float64 6 qsec 32 non-null float64 7 vs 32 non-null int64 8 am 32 non-null int64 9 gear 32 non-null int64 10 carb 32 non-null int64 11 fast 32 non-null int64 12 cars 32 non-null object 13 carname 32 non-null object dtypes: float64(5), int64(7), object(2) memory usage: 3.6+ KB
Figure 4-2-6 divergent bar chart of automobile fuel consumption performance
Constituent elements pie chart
Pie chart is a classic method to display constituent elements. It is generally used. If pie chart is used, it is strongly recommended to clearly write down the percentage or number of each part of the pie chart. Based on the data set of mtcars and the number of cylinders, it is found that there are 14 vehicles with 8 cylinders, accounting for 43.8% of all vehicles.
import pandas as pd import matplotlib.pyplot as plt df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv") # Prepare Data df = df_raw.groupby('cyl').size().reset_index(name='counts') # Draw Plot fig, ax = plt.subplots(figsize=(12, 7), subplot_kw=dict(aspect="equal"), dpi= 80) data = df['counts'] categories = df['cyl'] explode = [0,0.1,0] def func(pct, allvals): absolute = int(pct/100.*np.sum(allvals)) return "{:.1f}% ({:d} )".format(pct, absolute) wedges, texts, autotexts = ax.pie(data, autopct=lambda pct: func(pct, data), textprops=dict(color="w"), colors=plt.cm.Dark2.colors, startangle=140, explode=explode) # Decoration ax.legend(wedges, categories, title="Cylinders ", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1)) plt.setp(autotexts, size=10, weight=700) ax.set_title("Pie chart of automobile cylinder number") plt.show()
Figure 4-2-7 shows the pie chart of the proportion of vehicles with different cylinder numbers