Do household factors significantly impact educational achievement?

Amogh Giri, Ian Yeh, Rivke Weingarten

Introduction

A constantly divisive issue in America has been the large difference in education from state to state. One of the questions that follow the divide in education in our country is what factors within a household might impact the educational achievement of students. Considering the wide disparity in the average level of education from state to state or even county to county, it is worth an in depth analysis to determine what may be causing this gap. In this project, the goal is to analyze household factors and determine if they significantly impact the educational ability and achievement of students in America.

Data Source

The dataset that we used for this project is from https://www.openintro.org/data/?data=county_complete which gives a wide spread of data in which we take the specific columns of median household incomes, the amount of computers in the house, the number of people per household, and the number of high school and bachelors graduates in the house. These are some of the chosen types of data that we analyze in order to determine whether or not they impact the educational achievement of the student.

This dataset was originally prepared by the USDA, Economic Research Service, and the most recent data was gathered by the following sources:

  1. Unemployment - Bureau of Labor Statistics - LAUS data - https://www.bls.gov/lau.

  2. Median Household Income - Census Bureau - Small Area Income and Poverty Estimates (SAIPE) data.

  3. Census Bureau.

  4. 2012-16 American Community Survey 5-yr average.

Libraries Used

In this tutorial, we used the following libraries in Python:

  1. Pandas --> Store and organize larger data (https://pandas.pydata.org)

  2. Matplotlib --> Visualize data in various forms (https://matplotlib.org)

  3. NumPy --> Store and organize smaller data (https://numpy.org)

  4. Scikit-Learn --> Utilize machine learning tools (https://scikit-learn.org/stable/)

  5. Statsmodels --> Utilize statistical models (https://www.statsmodels.org/stable/index.html)

Data Collection

The first part of the project is to tackle the data collection from the aforementioned source. First, we'll read in the data from the csv into our dataframe using Pandas's function read_csv(). The next step is to clean up the data and create a new dataframe that includes only the specific factors that we will be analyzing in this project instead of having an unnecessarily large data frame with a lot of unneeded information. We will be looking at data specifically from 2017 as it was the most recent year that the dataset had values for all variables.

The data in this table represents:

  1. fips: FIPS codes that represent geographic areas

  2. state: The state that the county is in

  3. name: The county within the state

  4. median_household_income_2017: This represents the median household income in 2017

  5. computer_2017: This represents the percentage of households that own a computer in 2017

  6. persons_per_household_2017: This represents the average number of people per household in 2017

  7. hs_grad_2017: This represents the percentage of people that is a high school graduate in 2017

  8. bachelors_2017: This represents the percentage of people that has a bachelor's degree in 2017

Now, let's go ahead and replace any missing values in the five important columns (median household income, access to a computer, persons per household, high school graduation rate, bachelors degree graduation rate) of the dataset with the average of the respective variable for the entire dataset. These are the most important columns in our dataset as the first three are our household factors and the last two are related to educational achievement.

The five False's show that we have successfully replaced all missing values in those five columns.

We define educational achievement to be high school graduation rate and bachelors degree graduation rate. In order to visualize the relationship between household factors and education achievement, we decided to combine these two graduation rates into one variable. This was done by taking the average of the two graduation rates and adding to a new column in the dataframe, which is a new variable the represents the combination of the high school and bachelors degree graduation rates. As a result, it is far easier to visualize the relationships given we only have one y (dependent) variable now.

We now have all the necessary and cleaned up data we need to perform analysis on and visualize relationships between the household factors and education achievement.

Data Analysis & Visualization

The next step in the process is to begin the analysis and visualization process. We want to see if we can identify any relationships between each of the three household factors and educational achievement through various scatter plots. To start off, we take the three household factors of median household income, percentage of computers, and people per household and plot them against the combined percentage of high school and bachelors degree graduation rates to try to spot a correlation between the chosen factors and education achievement.

It's clear that there is a positive, linear relationship for median household income and access to a computer between educational achievement. The same cannot be necessarily said about persons per household though. In order to visualize a linear relationship for each of these three factors in more depth, we look specifically at the state of Maryland to see if the trend nationwide is also resembled within a specific state. We repeat the same process used in the last three graphs, except this time including a generated linear regression line and plotting the data for each county in Maryland. The generated linear regression line for each plot will now give us a better idea of the linear relationship for the data.

Given that these plots now contain regression lines, we can clearly see a well-fitted, positive, and linear relationship for median household income and access to a computer versus education achievement. In addition, the regression line displayed in the persons per household plot seems to a lot of data not surrounding the line itself, showing that its linear relationship might be far weaker when compared to the other household factors. Hence, the same trends seen with the earlier states plots are also seen in these Maryland counties plots, suggesting that there might be some household factors that do significantly impact educational achievement in a linear fashion, where median household income and access to a computer might be some of them while persons per household might not.

Hypothesis Testing & Machine Learning

In order to further analyze the existence and strength of the linear relationships for each of the three household factors vs. educational achievement, we can use hypothesis testing and machine learning. The goal for hypothesis testing is to determine the strength of the fit of each linear regression relationship by using statsmodels. We use ordinary least squares regression, which is one of the simplest and most common methods for estimating unknown parameters in a linear regression model. After performing hypothesis testing on each household factor, we'll be able to determine if there is a linear relationship between each household factor and educational achievement.

R-squared values represent the strength of a relationship and how close the data is to the fitted regression line on a scale from 0.0 to 1.0, where 1.0 means that the data perfectly fits the regression line. This value is a good indicator for us to determine if a strong linear relationship does exist between each of the household factors and educational achievement. The R-squared value for the median household income relationship is 0.706, for the access to a computer relationship is 0.711, and for the persons per household relationship is 0.197. This means that the median household income and access to a omputer relationships were fairly strong relationships, while the persons per household relationship was a far weaker relationship.

From the hypothesis testing, we have sufficient evidence to reject the null hypotheses of no relationship between median household income and access to a computer and combined graduation rates, given the p-values are less than our significance level of 0.05. However, we cannot reject the null hypothesis of no relationship between persons per household and combined graduation rates, given one of the p-values is greater than the significance level of 0.05.

In order to further validate our results, we can use machine learning, which are algorithms that learn about data by being fed more data and provide us with detailed insights about the data as it learns. Linear regression is a specific algorithm within machine learning that takes in two sets of data, the predictors and actual values, minimizes the loss between the predicted and actual values, and produces an equation that allows for us to highly accurately predict values on new data. We can create and fit linear regression models using machine learning and compare their R-squared values to the ones seen in hypothesis testing. We will now create, fit, and test machine learning models on each household factor against the combined graduation rates using linear regression.

After creating, fitting, and testing our machine learning models for each of the three household factors, we see similar results to the ones seen in hypothesis testing. The R-squared value for the testing median household income relationship is 0.655, for the access to a computer relationship is 0.670, and for the persons per household relationship is 0.155. These R-squared values are fairly similar to the ones retrieved from hypothesis testing and resemble the same overall conclusion that there is a significant relationship between median household income and educational achievement and access to a computer and educational achievement, while there's not enough significant evidence to say the same for persons per household and educational achievement.

Conclusion

The goal of this tutorial was to analyze the data trends of household factors that may impact educational success. After tidying up the data to sort through unneeded information, we settled on three factors that was worth further analysis to see how strongly they impacted academic achievement, these factors being median household income, percentage of households that own a computer, and the average people per household. These data columns were especially of note since technology and money often play an important part in getting ahead in school. Without access to technology, it becomes exceedingly difficult to complete assignments or do other research for school.

Upon plotting the factors, we began to see general trends between the Maryland counties plots and the US states plots. The regression lines demonstrated a linear relationship between the median household income and the percentage of households that own a computer and the success of the students. Further analysis in the hypothesis testing had results that continued to support this claim. However, the factor of persons per household was unable to hold up once put to hypothesis testing and machine learning. The overall results demonstrated that specifically household income and availability of a computer impacts a student's ability to learn and that there is a correlation between home situations and academic achievement. There are definitely more household factors that greatly impact the level of academic achivement someone can have that deserve more continual research in the hope to improve the overall educational quality in America. In conclusion, we can see that some, not all, household factors do significantly impact the educational achievement and this can be explained by the increase in cost of living and technological use in the educational system.

If you wish to learn more about the data science and create a project like this in the future, you can utilize the following resources to learn more about what data science truly encompasses and the specific algorithms used in data science and in this tutorial:

  1. Data Science --> https://www.thinkful.com/blog/what-is-data-science/

  2. Linear Regression --> https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

  3. Ordinary Least Squares --> https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

We hope you learned a lot and enjoyed this tutorial on utilizing the data science pipeline to make data meaningful!