Test score analysis
📖 background
Your best friend is an administrator at a big school. The school requires every student to take end-of-year math, reading and writing exams.
Since you recently learned about data manipulation and visualization, I suggest you help your friend analyze the scoring results. School principals want to know if test prep courses are helpful. She also wanted to explore the effect of parental education level on test scores.
💾 data
This file has the following fields:
- "gender" - male / female
- "race/ethnicity" - one of 5 combinations of race/ethnicity
- "parent_education_level" - highest education level of either parent
- "lunch" - whether the student receives free/reduced or standard lunch
- "test_prep_course" - whether the student took the test preparation course
- "math" - exam score in math
- "reading" - exam score in reading
- "writing" - exam score in writing
💪 challenge
Create a report to answer the principal's questions. include:
- What is the average reading score for students with/without test prep classes?
- What is the average score for students at different parental education levels?
- Compare mean scores for students with and without test prep classes at different levels of parental education.
- The principal wants to know whether children who do well in one subject also do well in other subjects. See correlations between scores.
exploratory data analysis
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np df = pd.read_csv('exams.csv') df.head()
gender | race/ethnicity | parent_education_level | lunch | test_prep_course | math | reading | writing | |
---|---|---|---|---|---|---|---|---|
0 | female | group B | bachelor's degree | standard | none | 72 | 72 | 74 |
1 | female | group C | some college | standard | completed | 69 | 90 | 88 |
2 | female | group B | master's degree | standard | none | 90 | 95 | 93 |
3 | male | group A | associate's degree | free/reduced | none | 47 | 57 | 44 |
4 | male | group C | some college | standard | none | 76 | 78 | 75 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 1000 non-null object 1 race/ethnicity 1000 non-null object 2 parent_education_level 1000 non-null object 3 lunch 1000 non-null object 4 test_prep_course 1000 non-null object 5 math 1000 non-null int64 6 reading 1000 non-null int64 7 writing 1000 non-null int64 dtypes: int64(3), object(5) memory usage: 62.6+ KB
Q1: What is the average reading score of students with/without test preparation courses?
df.groupby('test_prep_course')['reading'].mean()
test_prep_course completed 73.893855 none 66.534268 Name: reading, dtype: float64
df.groupby('test_prep_course')['reading'].mean().plot(kind='bar') plt.ylabel('score') plt.title('the average reading scores for students with/without the test preparation course');
with_test_prep = df[df['test_prep_course'] == 'completed'] without_test_prep = df[df['test_prep_course'] == 'none'] fig, ax = plt.subplots(1, 3, figsize=(15,6), sharey=True) cols = ['math', 'reading', 'writing'] for i, col in enumerate(cols): sns.kdeplot(with_test_prep[col], ax=ax[i], label=str(col) + 'with test prep') sns.kdeplot(without_test_prep[col], linestyle='--', ax=ax[i], label=str(col) + 'without test prep') ax[i].legend() plt.suptitle('KDE Plots of Exam Scores With and Without Preparation', fontsize = 20, fontweight = 'bold');
- Students who took test prep classes improved their scores in all subjects.
- The average reading score for students who took the test prep course was about 74, compared to 66.5 for those who didn't
Q2: What is the average score of students under different parental education levels?
df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean().style.background_gradient(cmap='RdYlGn_r')
 | math | reading | writing |
---|---|---|---|
parent_education_level | Â | Â | Â |
associate's degree | 67.882883 | 70.927928 | 69.896396 |
bachelor's degree | 69.389831 | 73.000000 | 73.381356 |
high school | 62.137755 | 64.704082 | 62.448980 |
master's degree | 69.745763 | 75.372881 | 75.677966 |
some college | 67.128319 | 69.460177 | 68.840708 |
some high school | 63.497207 | 66.938547 | 64.888268 |
# x = df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean() # x.style.apply(lambda m: ["background: red" if i == m.argmax() else '' for i,_ in enumerate(m)])
df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean().plot(kind='bar') plt.legend(loc='upper left', bbox_to_anchor=(1, 1));
avg_scores = df.groupby('parent_education_level')[['math', 'reading', 'writing']].mean() fig, ax = plt.subplots() sns.pointplot(data = avg_scores, x = avg_scores.index, y = 'math',label='Math') sns.pointplot(data = avg_scores, x = avg_scores.index, y = 'reading', label='Reading',color='r') sns.pointplot(data = avg_scores, x = avg_scores.index, y = 'writing', label='Writing',color='g') ax.legend(handles=ax.lines[::len(avg_scores)+1], labels=["Math","Reading","Writing"]) # ax.set_xticklabels([t.get_text().split("T")[0] for t in ax.get_xticklabels()]) # plt.gcf().autofmt_xdate() plt.xticks(rotation=45);
- With the exception of high school, in general, the more educated the parents, the higher the academic performance of their children.
- Parents with master's degrees had the highest average scores in each subject.
Q3: Compare the mean scores of students with/without taking test prep classes at different levels of parental education.
cols = ['math', 'reading', 'writing'] fig, axes = plt.subplots(3,1, figsize=(10, 6), sharex=True, gridspec_kw={'hspace': 0.5}) for i, col in enumerate(cols): sns.boxplot(x='parent_education_level', y=col, hue='test_prep_course', data=df, ax=axes[i]) axes[i].set_title(col.capitalize() + ' Scores') #axes[i].set_xlabel('Parent Education Level') axes[i].set_ylabel(col.capitalize() + ' Score') axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=30) axes[i].legend(title='Test Preparation Course', loc='upper left', bbox_to_anchor=(1, 1)) #plt.xlabel('Parent Education Level') fig.suptitle('Test Scores by Parent Education Level and Test Preparation Course', y=1.05); #fig.tight_layout()
- Overall, those who completed the test prep course performed better than those who did not.
- The educational level of the parents is some college, and there are more outliers in the performance of their children, especially those who have not participated in the test preparation courses.
new_index = ['some high school', 'high school', 'some college', 'associate\'s degree', 'bachelor\'s degree', 'master\'s degree'] with_test_prep_mean_scores = with_test_prep.groupby('parent_education_level')[['math', 'reading', 'writing']].mean().reindex(new_index) with_test_prep_mean_scores = with_test_prep_mean_scores.unstack().reset_index().rename(columns={'level_0': 'test', 0:'score'}) with_test_prep_mean_scores['education_level'] = with_test_prep_mean_scores['parent_education_level'] + '_' + with_test_prep_mean_scores['test'] with_test_prep_mean_scores.set_index('education_level', inplace=True) without_test_prep_mean_scores = without_test_prep.groupby('parent_education_level')[['math', 'reading', 'writing']].mean().reindex(new_index) without_test_prep_mean_scores = without_test_prep_mean_scores.unstack().reset_index().rename(columns={'level_0': 'test', 0:'score'}) without_test_prep_mean_scores['education_level'] = without_test_prep_mean_scores['parent_education_level'] + '_' + without_test_prep_mean_scores['test'] without_test_prep_mean_scores.set_index('education_level', inplace=True)
plt.figure(figsize=(15,8)) sns.set(style="darkgrid") y_range = np.arange(1, len(with_test_prep_mean_scores.index) + 1) with_test_prep_mean_scores['change'] = round(with_test_prep_mean_scores['score'] - without_test_prep_mean_scores['score'], 2) plt.hlines(y=y_range, xmin=without_test_prep_mean_scores['score'], xmax=with_test_prep_mean_scores['score'], color='#d9d9d9', lw=10) plt.scatter(without_test_prep_mean_scores['score'], y_range, color='#0096d7', s=300, label='No Test Preparation', zorder=2) plt.scatter(with_test_prep_mean_scores['score'], y_range, color='#003953', s=300 , label='With Test Preparation', zorder=2) for (_, row), y in zip(with_test_prep_mean_scores.iterrows(), y_range): plt.annotate(f"+{row['change']}", (row['score'] + 1 , y - 0.25), size = 16) plt.annotate(f"{round(row['score'])}", (row['score']-0.25 , y - 0.17), size = 11.5, color = 'white') for (_, row), y in zip(without_test_prep_mean_scores.iterrows(), y_range): plt.annotate(f"{round(row['score'])}", (row['score']-0.25 , y - 0.18), size = 11.5, color = 'white') plt.axhline(y=6.55,linestyle='--', color = 'lightgray', lw=2) plt.axhline(y=12.45, linestyle='--', color = 'lightgray', lw=2) plt.legend(ncol=2, bbox_to_anchor=(1., 1.01), loc="lower right", frameon=False) plt.yticks(y_range, with_test_prep_mean_scores.index) plt.title("Average Test Scores With and Without \nExam Preparation And Their Difference On Parental Education Levels", loc='left', fontsize = 15) # plt.rc('ytick', labelsize=16) # plt.rc('xtick', labelsize=16) plt.xlim(50, 90) # plt.gcf().subplots_adjust(left=0.35) # plt.tight_layout() # plt.show()
(50.0, 90.0)
- After the preparation course, the test scores have been improved to a certain extent
- The improvement in writing is the largest, while the improvement in mathematics is relatively small, especially the mathematics of the corresponding master's degree has only increased by 1.29
- The overall trend still tells us that the higher the degree, the higher the score
- For assocaite and some college s, their increase is very high, they are both top two, this test preparation course seems to help them more
Q4: See correlations between scores.
corr_matrix = df[['math', 'reading', 'writing']].corr() mask = np.array(corr_matrix) mask[np.tril_indices_from(mask)] = False sns.heatmap(corr_matrix, cmap='coolwarm', mask=mask, square=True, annot=True) plt.title('Correlation Heatmap of Test Scores');
- Test scores are positively correlated, meaning that if a person has a high score on one test, it is very likely that the person has high scores on the other tests as well.
- There is a strong positive correlation between subjects, especially between reading and writing, with a correlation coefficient as high as 0.95.
Summarize:
- The education level of parents has a great influence on the achievement of children. In general, the higher the level of education, the better the child's grades.
- There is a strong positive correlation between grades in each subject.
- Taking a test preparation course is a good choice, and your grades will generally improve relatively.
The principal's task is basically completed, and then do some other explorations:
- Distribution of Scores
- The relationship between race and parental education level
- Relationship between lunch and score
- Effect of lunch and education level on scores
fig, ax = plt.subplots(1, 3, figsize=(15,6), sharey=True) cols = ['math', 'reading', 'writing'] color = ['#bc50f5', '#C20078', '#ea9e0e'] for i, col in enumerate(cols): sns.histplot(df[col], bins=25, color=color[i], ax=ax[i], kde=True) ax[i].axvline(x=df[col].mean(), label= cols[i] + ' score average', linestyle='--', color='darkblue') ax[i].legend() plt.suptitle('Distribution Plot of Exam Scores', fontsize=20, fontweight='bold');
- The distributions are all skewed slightly to the left, meaning that most students get high marks
- The math distribution has more and lower scores than reading and writing
df['race/ethnicity'].value_counts()
group C 319 group D 262 group B 190 group E 140 group A 89 Name: race/ethnicity, dtype: int64
race_df = round(pd.crosstab(df['parent_education_level'], df['race/ethnicity'], normalize = 'index'), 3) def highlight_top2(s): max_value = s.max() second_max = s.nlargest(2).iloc[-1] return [ 'background: red' if val == max_value else 'background: lightgreen' if val == second_max else '' for val in s ] new_index = ['some high school', 'high school', 'some college', 'associate\'s degree', 'bachelor\'s degree', 'master\'s degree'] race_df.reindex(new_index).T.style.apply(highlight_top2)
- From the above analysis, race C has the largest number of people, followed by race D, and race A is the least
- It can be seen that under different education levels, the top two of almost every diploma are contributed by C and D, and 71% of the master's degree comes from C and D
- So many masters are contributed by C and D. Does it mean that they are very smart? Let’s see how much of each race goes to each level of education
race_df2 = round(pd.crosstab(df['parent_education_level'], df['race/ethnicity'], normalize = 'columns'), 3) race_df2.reindex(new_index).T.style.apply(highlight_top2)
race_df2.reindex(new_index).T.plot(kind='barh', stacked=True, figsize=(10, 6)) plt.legend(loc='upper left', bbox_to_anchor=(1, 1));
- C and D races are indeed the top two with the highest proportion of master's degrees, 8.8% of C and 6% of D went to master's degree
- From the perspective of the length of the graph, the proportion of B going to the degree above the bachelor's degree is the least, and the others are almost the same
ps: Students whose family income is between 130% and 185% of the federal poverty level are eligible for reduced price meals, while children whose family income is below 130% of the poverty level are eligible for fully subsidized or "free" meals
df.lunch.value_counts()
standard 645 free/reduced 355 Name: lunch, dtype: int64
with_test = df[df['test_prep_course'] == 'completed'] without_test = df[df['test_prep_course'] == 'none'] with_test = with_test.groupby('lunch')[['math', 'reading', 'writing']].mean().unstack().reset_index().rename(columns={'level_0': 'test', 0: 'score'}) with_test['level'] = with_test['lunch'] + '_' + with_test['test'] with_test.set_index('level', inplace=True) without_test = without_test.groupby('lunch')[['math', 'reading', 'writing']].mean().unstack().reset_index().rename(columns={'level_0': 'test', 0: 'score'}) without_test['level'] = without_test['lunch'] + '_' + without_test['test'] without_test.set_index('level', inplace=True)
with_test
test | lunch | score | |
---|---|---|---|
level | |||
free/reduced_math | math | free/reduced | 63.045802 |
standard_math | math | standard | 73.533040 |
free/reduced_reading | reading | free/reduced | 69.870229 |
standard_reading | reading | standard | 76.215859 |
free/reduced_writing | writing | free/reduced | 70.351145 |
standard_writing | writing | standard | 76.766520 |
plt.figure(figsize=(12,8)) y_range = np.arange(1, len(with_test)+1) with_test['change'] = round(with_test['score'] - without_test['score'], 2) plt.hlines(y=y_range, xmin=with_test['score'], xmax=without_test['score'], color='#d9d9d9', lw=10) plt.scatter(with_test['score'], y_range, color='#0096d7', s=300, label='Test Preparation', zorder=3) plt.scatter(without_test['score'], y_range, color='#003953', s=300, label='No Test Preparation', zorder=3) for (_, row), y in zip(with_test.iterrows(), y_range): plt.annotate(f"+{row['change']}", (row['score'] + 1, y-0.08), size = 16) plt.annotate(f"{round(row['score'])}", (row['score']-0.35, y-0.06),size=11.5,color='white') for (_, row), y in zip(without_test.iterrows(), y_range): plt.annotate(f"{round(row['score'])}", (row['score']-0.35, y-0.06),size=11.5, color='white') for l in np.arange(2.5, 5, 2): plt.axhline(y=l,linestyle='--', color = 'lightgray', lw=2) plt.legend(ncol=2, bbox_to_anchor=(1., 1.01), loc="lower right", frameon=False) plt.yticks(y_range, with_test.index) plt.title("Average Test Scores With and Without \nExam Preparation And Their Difference On Lunch", loc='left', fontsize = 15) plt.xlim(50, 90)
(50.0, 90.0)
- Significantly improved grades after preparatory courses
- Students with standard lunch have better grades than students with half-price free lunch. Perhaps this is the inclination of educational resources. People with good families can get more and better educational channels
- In terms of test preparation courses, half-price and free lunch students improved their grades in all subjects more than students with standard lunch
with_test = df[df['test_prep_course'] == 'completed'] without_test = df[df['test_prep_course'] == 'none'] with_test = with_test.groupby('gender')[['math', 'reading', 'writing']].mean().unstack().reset_index().rename(columns={'level_0': 'test', 0: 'score'}) with_test['level'] = with_test['gender'] + '_' + with_test['test'] with_test.set_index('level', inplace=True) without_test = without_test.groupby('gender')[['math', 'reading', 'writing']].mean().unstack().reset_index().rename(columns={'level_0': 'test', 0: 'score'}) without_test['level'] = without_test['gender'] + '_' + without_test['test'] without_test.set_index('level', inplace=True)
sns.set(style="white") plt.figure(figsize=(12,8)) y_range = np.arange(1, len(with_test)+1) with_test['change'] = round(with_test['score'] - without_test['score'], 2) plt.hlines(y=y_range, xmin=with_test['score'], xmax=without_test['score'], color='#d9d9d9', lw=10) plt.scatter(with_test['score'], y_range, color='#0096d7', s=300, label='Test Preparation', zorder=3) plt.scatter(without_test['score'], y_range, color='#003953', s=300, label='No Test Preparation', zorder=3) for (_, row), y in zip(with_test.iterrows(), y_range): plt.annotate(f"+{row['change']}", (row['score'] + 1, y-0.08), size = 16) plt.annotate(f"{round(row['score'])}", (row['score']-0.35, y-0.06),size=11.5,color='white') for (_, row), y in zip(without_test.iterrows(), y_range): plt.annotate(f"{round(row['score'])}", (row['score']-0.35, y-0.06),size=11.5, color='white') for l in np.arange(2.5, 5, 2): plt.axhline(y=l,linestyle='--', color = 'lightgray', lw=2) plt.legend(ncol=2, bbox_to_anchor=(1., 1.01), loc="lower right", frameon=False) plt.yticks(y_range, with_test.index) plt.title("Average Test Scores With and Without \nExam Preparation And Their Difference On Gender", loc='left', fontsize = 15) plt.xlim(50, 90)
(50.0, 90.0)
- Boys do better than girls in math, but far worse in reading and writing
- Exam preparation courses have slightly improved boys’ grades in all subjects slightly more than girls’
df.parent_education_level = pd.Categorical(df.parent_education_level, categories=["some high school","high school","some college","associate's degree", "bachelor's degree", "master's degree"], ordered=True) with_test_prep_all = df[df['test_prep_course'] == 'completed'] without_test_prep_all = df[df['test_prep_course'] == 'none'] with_test_prep_all = with_test_prep_all.groupby(['parent_education_level', 'lunch', 'gender'])[['math', 'reading', 'writing']].mean().reset_index() without_test_prep_all = without_test_prep_all.groupby(['parent_education_level', 'lunch', 'gender'])[['math', 'reading', 'writing']].mean().reset_index() with_test_prep_all['mean_score'] = round((with_test_prep_all['math'] + with_test_prep_all['reading'] + with_test_prep_all['writing'])/3, 2) without_test_prep_all['mean_score'] = round((without_test_prep_all['math'] + without_test_prep_all['reading'] + without_test_prep_all['writing'])/3, 2) with_test_prep_all['label'] = with_test_prep_all['parent_education_level'].astype('str') + '_' + \ with_test_prep_all['lunch'].astype('str') + '_' + \ with_test_prep_all['gender'].astype('str') without_test_prep_all['label'] = without_test_prep_all['parent_education_level'].astype('str') + '_' + \ without_test_prep_all['lunch'].astype('str') +'_' + \ without_test_prep_all['gender'].astype('str') with_test_prep_all = with_test_prep_all.set_index('label') without_test_prep_all = without_test_prep_all.set_index('label') plt.figure(figsize=(15,10)) sns.set(style="white") y_range = np.arange(1, len(with_test_prep_all.index) + 1) with_test_prep_all['change'] = round(with_test_prep_all['mean_score'] - without_test_prep_all['mean_score'], 2) plt.hlines(y=y_range, xmin=without_test_prep_all['mean_score'], xmax=with_test_prep_all['mean_score'], color='#d9d9d9', lw=10) plt.scatter(without_test_prep_all['mean_score'], y_range, color='#0096d7', s=300, label='No Test Preparation', zorder=3) plt.scatter(with_test_prep_all['mean_score'], y_range, color='#003953', s=300 , label='With Test Preparation', zorder=3) for (_, row), y in zip(with_test_prep_all.iterrows(), y_range): plt.annotate(f"+{row['change']}", (row['mean_score'] + 1 , y - 0.25), size = 12) plt.annotate(f"{round(row['mean_score'])}", (row['mean_score']-0.45 , y - 0.17), size = 11, color = 'white') for (_, row), y in zip(without_test_prep_all.iterrows(), y_range): plt.annotate(f"{round(row['mean_score'])}", (row['mean_score']-0.45 , y - 0.18), size = 11, color = 'white') line = np.arange(4.5, 23, 4) for l in line: plt.axhline(y=l,linestyle='--', color = 'lightgray', lw=2) plt.legend(ncol=2, bbox_to_anchor=(1., 1.01), loc="lower right", frameon=False) plt.yticks(y_range, with_test_prep_all.index) plt.title("Average Test Scores With and Without \nExam Preparation And Their Difference On Parental Education Levels", loc='left', fontsize = 15) plt.rc('ytick', labelsize=16) plt.rc('xtick', labelsize=16) plt.xlim(30, 100) plt.gcf().subplots_adjust(left=0.35) plt.tight_layout() plt.show()
- In the past, except for mathematics, boys’ grades were all beaten by girls, but looking at the breakdown, there is a difference. Except for the master’s degree, boys who did not take the test preparation course had lower scores than girls, while those who took the test preparation course , except for the high school, the score is higher than that of the girls. In the end, although only the master's score is higher than the girls, the scores have a unique flavor in it. It seems that boys with standard meals have benefited from this test preparation course, which has largely narrowed the previous gap.
- Generally speaking, the grades of students with half-price meals and free meals are lower than those of students with standard meals at all levels of parental education, but associate girls have improved their scores by a huge margin, from the lowest score to that of boys with standard meals, an overall increase of 15.3, which is Get the most from this course.
- The most effective test preparation courses are the high school standard meal boys. Except for the high school, the scores are higher than the girls. Why is this, it is worth exploring the reasons.
- Generally speaking, half-price and free girls can get more benefits from the test preparation courses and get a lot of points. Each education level exceeds the corresponding boys, but it is still not as good as the corresponding standard meal girls. Can this side show that the gap between the rich and the poor has a big impact?