Author: David Charles
Date: 07/01/2021
We will be investigating different reasons for COVID-19 vaccination rates and what factors influence New Yorkers to get vaccinated. Data was collected from various sources, cleaned, and merged into one dataset in order to conduct the analysis. We will investigate the correlation between COVID-19 vaccination rates, median household income, educational attainment and COVID-19 death rate/count. We will also attempt to predict how many people will get vaccinated in a neighborhood.
COVID-19 changed life as we know it all of over the world. One of the places that was hit the hardest by COVID-19 was NYC. For months we saw NYC's death count rise and it caused a lot of anxiety for fellow New Yorkers. Fast forward to 2021 and we have two major vaccines, Pfizer and Moderna. However, since being rolled out in December 2020, there are many communities that still have low vaccination rates, but were impacted the most by this pandemic. We will use Python to analyze the data we've collected from various sources, such as, NYC Health, Citizens' Committee for Children, United States Census Bureau, etc.
Our tool of choice for doing the analysis is Python. We will import COVID-19 vaccination data first.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
covid_vacc_zipcode = pd.read_csv('people/coverage-by-modzcta-allages.csv')
covid_vacc_zipcode.head()
DATE | NEIGHBORHOOD_NAME | BOROUGH | MODZCTA | Label | AGE_GROUP | POP_DENOMINATOR | COUNT_PARTIALLY_CUMULATIVE | COUNT_FULLY_CUMULATIVE | COUNT_1PLUS_CUMULATIVE | PERC_PARTIALLY | PERC_FULLY | PERC_1PLUS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-20 | Chelsea/NoMad/West Chelsea | Manhattan | 10001 | 10001, 10118 | All ages | 27613.09 | 3017 | 21749 | 24766 | 10.93 | 78.76 | 89.69 |
1 | 2021-06-20 | Chinatown/Lower East Side | Manhattan | 10002 | 10002 | All ages | 75322.71 | 5571 | 48728 | 54299 | 7.40 | 64.69 | 72.09 |
2 | 2021-06-20 | East Village/Gramercy/Greenwich Village | Manhattan | 10003 | 10003 | All ages | 53977.81 | 3295 | 33687 | 36982 | 6.10 | 62.41 | 68.51 |
3 | 2021-06-20 | Financial District | Manhattan | 10004 | 10004 | All ages | 2972.12 | 390 | 2880 | 3270 | 13.12 | 96.90 | 110.02 |
4 | 2021-06-20 | Financial District | Manhattan | 10005 | 10005 | All ages | 8757.23 | 907 | 6202 | 7109 | 10.36 | 70.82 | 81.18 |
In order to plot this data on a map in Tableau, pgeocode was used to get latitude and longitude based on the zipcode provided in the MODZCTA column.
import pgeocode
nomi = pgeocode.Nominatim('US')
# get latitude and longitude from zip code
covid_vacc_zipcode[['LATITUDE', 'LONGITUDE']] = covid_vacc_zipcode['MODZCTA'].apply(lambda x: nomi.query_postal_code(str(x))[-3:-1])
covid_vacc_zipcode.head()
DATE | NEIGHBORHOOD_NAME | BOROUGH | MODZCTA | Label | AGE_GROUP | POP_DENOMINATOR | COUNT_PARTIALLY_CUMULATIVE | COUNT_FULLY_CUMULATIVE | COUNT_1PLUS_CUMULATIVE | PERC_PARTIALLY | PERC_FULLY | PERC_1PLUS | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-20 | Chelsea/NoMad/West Chelsea | Manhattan | 10001 | 10001, 10118 | All ages | 27613.09 | 3017 | 21749 | 24766 | 10.93 | 78.76 | 89.69 | 40.7484 | -73.9967 |
1 | 2021-06-20 | Chinatown/Lower East Side | Manhattan | 10002 | 10002 | All ages | 75322.71 | 5571 | 48728 | 54299 | 7.40 | 64.69 | 72.09 | 40.7152 | -73.9877 |
2 | 2021-06-20 | East Village/Gramercy/Greenwich Village | Manhattan | 10003 | 10003 | All ages | 53977.81 | 3295 | 33687 | 36982 | 6.10 | 62.41 | 68.51 | 40.7313 | -73.9892 |
3 | 2021-06-20 | Financial District | Manhattan | 10004 | 10004 | All ages | 2972.12 | 390 | 2880 | 3270 | 13.12 | 96.90 | 110.02 | 40.7143 | -74.0060 |
4 | 2021-06-20 | Financial District | Manhattan | 10005 | 10005 | All ages | 8757.23 | 907 | 6202 | 7109 | 10.36 | 70.82 | 81.18 | 40.7056 | -74.0083 |
Next we will import "Educational Attainment by Zipcode" dataset and filter on columns that are important to this project.
# get education by zipcodes
educational_attainment = pd.read_csv('people/ACSST5Y2019.S1501_data_with_overlays_2021-06-22T105606.csv', header=1)
# get cols we need for analysis
cols = ['id',
'Geographic Area Name',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Less than high school graduate',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!High school graduate (includes equivalency)',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Some college or associate\'s degree',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Bachelor\'s degree or higher',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!High school graduate (includes equivalency)',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Some college, no degree',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Associate\'s degree',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Bachelor\'s degree',
'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Graduate or professional degree']
# get filtered dataset
educational_attainment = educational_attainment[cols]
Since we have all of the data we need for educational attainment, we need to get the sum for each educational attainment by zipcode. We will filter then sum along the columns. We will write a function to make this filtering easier. We will clean the zipcode column so that it will make merging datasets easier.
# column filter
def filter_column(df, expressions):
'''
df - dataframe
expression - str, boolean logic operators
e.g 'Dog|dog|cat|Cat'
'''
return df.filter(regex=expressions)
# educational attainment
less_than_high_school = filter_column(educational_attainment, 'Less than').sum(axis=1)
high_school = filter_column(educational_attainment,'High').sum(axis=1)
some_college_associates = filter_column(educational_attainment, 'Some|Associate').sum(axis=1)
bachelors = filter_column(educational_attainment, 'Bachelor').sum(axis=1)
graduate_professional_deg = filter_column(educational_attainment, 'Graduate').sum(axis=1)
# create dataframe for educational attainment
degrees = pd.DataFrame({'raw_zip_code': educational_attainment['Geographic Area Name'],'less_than_high_school': less_than_high_school, 'high_school':high_school, 'some_college_associates':some_college_associates, 'bachelors':bachelors, 'graduate_professional_deg':graduate_professional_deg})
# clean zip code column
degrees.loc[:, 'raw_zip_code'] = degrees.apply(lambda row: row.raw_zip_code[-5:].strip(), axis=1).astype('int')
# merge covid dataset and education attainment on zip code
covid_vacc_zipcode_education = covid_vacc_zipcode.merge(degrees, how='inner', left_on='MODZCTA', right_on='raw_zip_code')
covid_vacc_zipcode_education.head()
# calculate educational attainment percentage by zipcode
# get percentage of educational attainment in neighborhood
covid_vacc_zipcode_education[['less_than_high_school %', 'high_school %', 'some_college_associates %', 'bachelors %', 'graduate_professional_deg %']] = covid_vacc_zipcode_education[['less_than_high_school', 'high_school', 'some_college_associates', 'bachelors', 'graduate_professional_deg']].div(covid_vacc_zipcode_education['POP_DENOMINATOR'], axis=0)
covid_vacc_zipcode_education.head()
DATE | NEIGHBORHOOD_NAME | BOROUGH | MODZCTA | Label | AGE_GROUP | POP_DENOMINATOR | COUNT_PARTIALLY_CUMULATIVE | COUNT_FULLY_CUMULATIVE | COUNT_1PLUS_CUMULATIVE | ... | less_than_high_school | high_school | some_college_associates | bachelors | graduate_professional_deg | less_than_high_school % | high_school % | some_college_associates % | bachelors % | graduate_professional_deg % | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-20 | Chelsea/NoMad/West Chelsea | Manhattan | 10001 | 10001, 10118 | All ages | 27613.09 | 3017 | 21749 | 24766 | ... | 96 | 3016 | 3113 | 8594 | 5443 | 0.003477 | 0.109224 | 0.112736 | 0.311229 | 0.197117 |
1 | 2021-06-20 | Chinatown/Lower East Side | Manhattan | 10002 | 10002 | All ages | 75322.71 | 5571 | 48728 | 54299 | ... | 492 | 12522 | 10714 | 14666 | 6605 | 0.006532 | 0.166245 | 0.142241 | 0.194709 | 0.087689 |
2 | 2021-06-20 | East Village/Gramercy/Greenwich Village | Manhattan | 10003 | 10003 | All ages | 53977.81 | 3295 | 33687 | 36982 | ... | 98 | 5316 | 11336 | 18406 | 13762 | 0.001816 | 0.098485 | 0.210012 | 0.340992 | 0.254957 |
3 | 2021-06-20 | Financial District | Manhattan | 10004 | 10004 | All ages | 2972.12 | 390 | 2880 | 3270 | ... | 22 | 76 | 131 | 1357 | 1337 | 0.007402 | 0.025571 | 0.044076 | 0.456576 | 0.449847 |
4 | 2021-06-20 | Financial District | Manhattan | 10005 | 10005 | All ages | 8757.23 | 907 | 6202 | 7109 | ... | 0 | 413 | 629 | 4071 | 2536 | 0.000000 | 0.047161 | 0.071826 | 0.464873 | 0.289589 |
5 rows × 26 columns
We will now grab the median household income and clean the data before merging it with our dataset.
# median income
median_income_nyc = pd.read_csv('people/Median Incomes.csv')
# rename data column to income
median_income_nyc.rename(columns={'Data':'Income'}, inplace=True)
median_income_nyc[median_income_nyc['Location'].str[0]=='Z'].index[0]
# find zip code by filtering by Z
# clean zip code
median_income_nyc.loc[3900:, 'Location'] = median_income_nyc[median_income_nyc['Location'].str[0]=='Z'].apply(lambda zc: zc.Location[-5:].strip(), axis=1)
# get zip code only
median_income_nyc = median_income_nyc.loc[3900:, :].copy()
# convert income column to float
median_income_nyc['Income'] = median_income_nyc['Income'].astype('float')
# convert object to int
median_income_nyc.loc[:, 'Location'] = median_income_nyc.loc[:, 'Location'].astype('int')
# filter to get "All Household" and most recent median household income which is 2019
# then merge dataset to covid_vacc_zipcode_education
covid_vacc_zipcode_education_income = covid_vacc_zipcode_education.merge(median_income_nyc.loc[(median_income_nyc['Household Type']=='All Households') & (median_income_nyc['TimeFrame']==2019), :], how='inner', left_on='MODZCTA', right_on='Location')
covid_vacc_zipcode_education_income.head()
DATE | NEIGHBORHOOD_NAME | BOROUGH | MODZCTA | Label | AGE_GROUP | POP_DENOMINATOR | COUNT_PARTIALLY_CUMULATIVE | COUNT_FULLY_CUMULATIVE | COUNT_1PLUS_CUMULATIVE | ... | high_school % | some_college_associates % | bachelors % | graduate_professional_deg % | Location | Household Type | TimeFrame | DataFormat | Income | Fips | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-20 | Chelsea/NoMad/West Chelsea | Manhattan | 10001 | 10001, 10118 | All ages | 27613.09 | 3017 | 21749 | 24766 | ... | 0.109224 | 0.112736 | 0.311229 | 0.197117 | 10001 | All Households | 2019 | Dollars | 92840.0 | 10001 |
1 | 2021-06-20 | Chinatown/Lower East Side | Manhattan | 10002 | 10002 | All ages | 75322.71 | 5571 | 48728 | 54299 | ... | 0.166245 | 0.142241 | 0.194709 | 0.087689 | 10002 | All Households | 2019 | Dollars | 36982.0 | 10002 |
2 | 2021-06-20 | East Village/Gramercy/Greenwich Village | Manhattan | 10003 | 10003 | All ages | 53977.81 | 3295 | 33687 | 36982 | ... | 0.098485 | 0.210012 | 0.340992 | 0.254957 | 10003 | All Households | 2019 | Dollars | 118161.0 | 10003 |
3 | 2021-06-20 | Financial District | Manhattan | 10004 | 10004 | All ages | 2972.12 | 390 | 2880 | 3270 | ... | 0.025571 | 0.044076 | 0.456576 | 0.449847 | 10004 | All Households | 2019 | Dollars | 190223.0 | 10004 |
4 | 2021-06-20 | Financial District | Manhattan | 10005 | 10005 | All ages | 8757.23 | 907 | 6202 | 7109 | ... | 0.047161 | 0.071826 | 0.464873 | 0.289589 | 10005 | All Households | 2019 | Dollars | 189702.0 | 10005 |
5 rows × 32 columns
Now we will get COVID-19 death count dataset and import it. We will then merge to our covid_vacc_zipcode_education_income dataset.
covid_death = pd.read_csv('coronavirus-data/totals/data-by-modzcta.csv')
covid_death.head()
MODIFIED_ZCTA | NEIGHBORHOOD_NAME | BOROUGH_GROUP | label | lat | lon | COVID_CASE_COUNT | COVID_CASE_RATE | POP_DENOMINATOR | COVID_DEATH_COUNT | COVID_DEATH_RATE | PERCENT_POSITIVE | TOTAL_COVID_TESTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10001 | Chelsea/NoMad/West Chelsea | Manhattan | 10001, 10118 | 40.750693 | -73.997137 | 1631 | 5906.62 | 27613.09 | 35 | 126.75 | 7.15 | 22799 |
1 | 10002 | Chinatown/Lower East Side | Manhattan | 10002 | 40.715781 | -73.986176 | 6118 | 8122.38 | 75322.71 | 289 | 383.68 | 11.86 | 51736 |
2 | 10003 | East Village/Gramercy/Greenwich Village | Manhattan | 10003 | 40.731825 | -73.989164 | 2934 | 5435.57 | 53977.81 | 49 | 90.78 | 6.76 | 43349 |
3 | 10004 | Financial District | Manhattan | 10004 | 40.703675 | -74.013106 | 254 | 8546.10 | 2972.12 | 2 | 67.29 | 6.55 | 3896 |
4 | 10005 | Financial District | Manhattan | 10005 | 40.706092 | -74.008861 | 431 | 4921.65 | 8757.23 | 0 | 0.00 | 6.34 | 6738 |
Clean dataset
# convert to int
covid_death['MODIFIED_ZCTA'] = covid_death['MODIFIED_ZCTA'].astype('int')
# merge datasets
covid_vacc_zipcode_education_income_deaths = covid_vacc_zipcode_education_income.merge(covid_death, how='inner', left_on='MODZCTA', right_on='MODIFIED_ZCTA')
# this dataset contains lat and longitude so we could've just used that for tableau
covid_vacc_zipcode_education_income_deaths.head()
DATE | NEIGHBORHOOD_NAME_x | BOROUGH | MODZCTA | Label | AGE_GROUP | POP_DENOMINATOR_x | COUNT_PARTIALLY_CUMULATIVE | COUNT_FULLY_CUMULATIVE | COUNT_1PLUS_CUMULATIVE | ... | label | lat | lon | COVID_CASE_COUNT | COVID_CASE_RATE | POP_DENOMINATOR_y | COVID_DEATH_COUNT | COVID_DEATH_RATE | PERCENT_POSITIVE | TOTAL_COVID_TESTS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2021-06-20 | Chelsea/NoMad/West Chelsea | Manhattan | 10001 | 10001, 10118 | All ages | 27613.09 | 3017 | 21749 | 24766 | ... | 10001, 10118 | 40.750693 | -73.997137 | 1631 | 5906.62 | 27613.09 | 35 | 126.75 | 7.15 | 22799 |
1 | 2021-06-20 | Chinatown/Lower East Side | Manhattan | 10002 | 10002 | All ages | 75322.71 | 5571 | 48728 | 54299 | ... | 10002 | 40.715781 | -73.986176 | 6118 | 8122.38 | 75322.71 | 289 | 383.68 | 11.86 | 51736 |
2 | 2021-06-20 | East Village/Gramercy/Greenwich Village | Manhattan | 10003 | 10003 | All ages | 53977.81 | 3295 | 33687 | 36982 | ... | 10003 | 40.731825 | -73.989164 | 2934 | 5435.57 | 53977.81 | 49 | 90.78 | 6.76 | 43349 |
3 | 2021-06-20 | Financial District | Manhattan | 10004 | 10004 | All ages | 2972.12 | 390 | 2880 | 3270 | ... | 10004 | 40.703675 | -74.013106 | 254 | 8546.10 | 2972.12 | 2 | 67.29 | 6.55 | 3896 |
4 | 2021-06-20 | Financial District | Manhattan | 10005 | 10005 | All ages | 8757.23 | 907 | 6202 | 7109 | ... | 10005 | 40.706092 | -74.008861 | 431 | 4921.65 | 8757.23 | 0 | 0.00 | 6.34 | 6738 |
5 rows × 45 columns
# fortunately, we have no missing datapoints
covid_vacc_zipcode_education_income_deaths.isna().sum()
DATE 0
NEIGHBORHOOD_NAME_x 0
BOROUGH 0
MODZCTA 0
Label 0
AGE_GROUP 0
POP_DENOMINATOR_x 0
COUNT_PARTIALLY_CUMULATIVE 0
COUNT_FULLY_CUMULATIVE 0
COUNT_1PLUS_CUMULATIVE 0
PERC_PARTIALLY 0
PERC_FULLY 0
PERC_1PLUS 0
LATITUDE 0
LONGITUDE 0
raw_zip_code 0
less_than_high_school 0
high_school 0
some_college_associates 0
bachelors 0
graduate_professional_deg 0
less_than_high_school % 0
high_school % 0
some_college_associates % 0
bachelors % 0
graduate_professional_deg % 0
Location 0
Household Type 0
TimeFrame 0
DataFormat 0
Income 0
Fips 0
MODIFIED_ZCTA 0
NEIGHBORHOOD_NAME_y 0
BOROUGH_GROUP 0
label 0
lat 0
lon 0
COVID_CASE_COUNT 0
COVID_CASE_RATE 0
POP_DENOMINATOR_y 0
COVID_DEATH_COUNT 0
COVID_DEATH_RATE 0
PERCENT_POSITIVE 0
TOTAL_COVID_TESTS 0
dtype: int64
This plot shows the correlation between median household income and the percentage of individuals that are vaccinated in NYC neighborhoods.
plt.figure(figsize=(20,15))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['PERC_FULLY'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['COVID_DEATH_COUNT'])
plt.colorbar()
plt.xlabel('Fully Vacc %')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 Full Vaccinated')
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper left", title="COVID-19 Death Count")
plt.show()
Correlation between median household income and the percentage of individuals that have received at least one vaccination shot in NYC neighborhoods.
plt.figure(figsize=(20,15))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['PERC_1PLUS'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['COVID_DEATH_COUNT'])
plt.colorbar()
plt.xlabel('1 Plus Vacc %')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 1 Plus')
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper left", title="COVID-19 Death Count")
plt.show()
Correlation between median household income and the COVID 19 death rate in NYC neighborhoods
plt.figure(figsize=(10,10))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['COVID_DEATH_RATE'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['PERC_FULLY'])
plt.colorbar()
plt.xlabel('Death Rate')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 Death Rate')
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper right", title="COVID-19 FULLY VACC %")
plt.show()
Areas that were hit hardest by COVID-19 are less likely to be vaccinated.
import geopandas as gpd
from geopandas import GeoDataFrame
from shapely.geometry import Point
nyc_neighborhood = gpd.read_file('Neighborhood Tabulation Areas (NTA)/geo_export_ab6fc730-beb8-4922-8a7a-24de1a3fdc50.shp')
geometry = [Point(xy) for xy in zip(covid_vacc_zipcode_education_income_deaths['LONGITUDE'], covid_vacc_zipcode_education_income_deaths['LATITUDE'])]
gdf = GeoDataFrame(covid_vacc_zipcode_education_income_deaths, geometry=geometry)
import matplotlib as mpl
cmap = mpl.colors.ListedColormap(['red', 'orange',
'green'])
import contextily as ctx
ax = nyc_neighborhood.boundary.plot(figsize=(20,15), color='grey')
plt.axis('off')
plt.title('COVID-19 VACCINATION NYC', fontsize=20)
# # plt.legend(labels=covid_vacc_income_education_deaths['COVID_DEATH_COUNT'].sort_values(ascending=True), scatterpoints=1)
# plot scatter again on shapefile to get size legend
scatter = plt.scatter('LONGITUDE', 'LATITUDE', data=covid_vacc_zipcode_education_income_deaths, s='COVID_DEATH_COUNT', c='PERC_FULLY',cmap=cmap)
plt.legend(*scatter.legend_elements("sizes", num=6, alpha=0.6), title='COVID-19 DEATH COUNT')
plt.colorbar(cmap=cmap, boundaries=covid_vacc_zipcode_education_income_deaths['PERC_FULLY'].sort_values(ascending=True).to_list())
ctx.add_basemap(ax=ax, crs='epsg:4326')
We are going to use different regression models, both linear and non-linear, to see if we can predict the amount of people who will be fully vaccinated.
from sklearn.model_selection import train_test_split
import re
# create functions for training regression models
# split into test and train datasets
def modelBuilder(X, y, seed=7, test_size=0.1):
'''returns X_train, X_test, y_train, y_test '''
seed = seed
test_size = test_size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
return X_train, y_train, X_test, y_test
def modelsTrainer(X_train, y_train, X_test, y_test, models):
'''Returns results'''
# create an empty list for results
results = []
# loop over models
for model in models:
model_regression = model
# if value error, make sure function is taking correct parameters in the correct order
model_regression.fit(X_train, y_train)
y_pred = model_regression.predict(X_test)
# bias
training_score = model_regression.score(X_train, y_train)
# variance
testing_score = model_regression.score(X_test, y_test)
# percent difference between training and testing score
delta = training_score - testing_score
# get raw model name
model_name = re.search('[a-zA-Z]+', str(model))[0]
# append scores to results
results.append({model_name:[training_score, testing_score, delta]})
return results
Set up train and test datasets
from sklearn.ensemble import RandomForestRegressor
# import all of the models we want to test
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso, Ridge
from xgboost import XGBRegressor
X = covid_vacc_zipcode_education_income_deaths[['POP_DENOMINATOR_x', 'Income','bachelors', 'COVID_DEATH_COUNT']]
y = covid_vacc_zipcode_education_income_deaths['COUNT_FULLY_CUMULATIVE']
X_train, y_train, X_test, y_test = modelBuilder(X, y)
Before we choose a model we have to deal bias-variance tradeoff. When we select a model we are inserting a bias. If the model has a high bias then there is a chance our model will underfit our data. If our model has a high variance, there is a chance it will overfit our data. Ideally, we want a balance between bias and variance. We will fit our model to our data and get the training and testing score. We will then get the difference between the two to decide which model we should choose.
# models to train
models = [LinearRegression(), RandomForestRegressor(), ElasticNet(), GradientBoostingRegressor(), XGBRegressor(verbosity=0), Lasso(), Ridge()]
# fit models
# returns training score, testing score and difference between the two
modelsTrainer(X_train, y_train, X_test, y_test, models)
[{'LinearRegression': [0.8349011295818145,
0.805264936984036,
0.0296361925977785]},
{'RandomForestRegressor': [0.9393786736093015,
0.7816819810432987,
0.15769669256600283]},
{'ElasticNet': [0.8349011295434671,
0.8052659443078068,
0.029635185235660222]},
{'GradientBoostingRegressor': [0.9854864931948837,
0.8559366915021283,
0.12954980169275543]},
{'XGBRegressor': [0.9781925378371675,
0.8717985612917483,
0.10639397654541927]},
{'Lasso': [0.8349011295798674, 0.8052651639636368,
0.02963596561623061]},
{'Ridge': [0.8349011295818097, 0.805264948229599,
0.02963618135221069]}]
from sklearn.model_selection import cross_val_score, KFold
cv = KFold(n_splits=10)
# we can use cross-validation to confirm our linear models have low variance and does well generalizing data
for model in models:
cross_val2 = cross_val_score(model, X_train, y_train, cv=cv)
# linear models have low standard deviation meaning their variance is low
# we can conclude our linear model(s) will do good generalizing data
print(re.search('[a-zA-Z]+', str(model))[0], '-', 'mean:', cross_val2.mean(), 'std:', cross_val2.std())
LinearRegression - mean: 0.7988492776333851 std: 0.09774188277510705
RandomForestRegressor - mean: 0.7321956559173095 std:
0.16819931769007415
ElasticNet - mean: 0.7988515079441996 std: 0.09774302215729834
GradientBoostingRegressor - mean: 0.6875606157113585 std:
0.23012268956944482
XGBRegressor - mean: 0.7207995095977946 std: 0.19388589251272895
Lasso - mean: 0.7988493286692742 std: 0.09774221635911819
Ridge - mean: 0.7988493084839189 std: 0.09774189636351902
lasso = Lasso()
lasso.fit(X_train, y_train)
# points are randomly dispersed around the horizontal axis
# a linear model is our best route
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(lasso, hist=False, qqplot=True)
visualizer.fit(X_train, y_train) # Fit the training data
visualizer.score(X_test, y_test) # Evaluate the model for variance
visualizer.show()
<matplotlib.axes._subplots.AxesSubplot at 0x1449bbeb8>
We can see that linear models performed better than non-linear models
These linear models are: Linear Regression, Elastic Net, Lasso & Ridge.
I went with Lasso Regression for a number of reasons:
Regularization prevents overfitting. This prevents our model from having a high variance. Adding l1 norm prevents overfitting using the absolute value of the weights.
Lasso is robust, meaning it is resistant to outliers. This is helpful because NYC data tend to have outliers especially when it comes to income, population, etc.
Sparsity - Lasso regression has built in feature selection so we can be certain our independent variables are statistically significant. This eliminates the need to drop coefficients if their pvalue is more than 0.05 or whichever confidence level we set before building our regression model.
Although there are some pros to using Lasso regression, here are some cons:
from sklearn.metrics import mean_squared_error
mse = []
for alpha in [0, 0.0001, 0.001, 0.01, 0.1, 1]:
model = Lasso(alpha=alpha)
model.fit(X_train, y_train)
mse.append(mean_squared_error(y_test, model.predict(X_test)))
# lowest mean squared error is when alpha is 1
plt.plot([0, 0.0001, 0.001, 0.01, 0.1, 1], mse)
[<matplotlib.lines.Line2D at 0x1365faba8>]
from sklearn.model_selection import GridSearchCV
# use grid search to tune our alpha
param_grids = {'alpha': [0, 0.0001, 0.001, 0.01, 0.1, 1]}
# setup grid search
grid_search_lasso = GridSearchCV(estimator=lasso, param_grid=param_grids, cv=cv)
grid_search_lasso.fit(X_train, y_train)
grid_search_lasso.best_score_
/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:813: DeprecationWarning:
The default of the `iid` parameter will change from True to False in
version 0.22 and will be removed in 0.24. This will change numeric
results when test-set sizes are unequal.
DeprecationWarning)
0.8001171769697994
# best alpha is indeed 1
grid_search_lasso.best_params_
{'alpha': 1}
Although our model is robust, let's see if scaling our data on improve the accuracy.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# make pipeline
pipe = make_pipeline(StandardScaler(), GridSearchCV(estimator=lasso, param_grid=param_grids, cv=cv))
pipe.fit(X_train, y_train)
# As you can see our model barely improved
pipe.score(X_test, y_test)
/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:813: DeprecationWarning:
The default of the `iid` parameter will change from True to False in
version 0.22 and will be removed in 0.24. This will change numeric
results when test-set sizes are unequal.
DeprecationWarning)
0.8053618445370625
I am confident that this model will do fairly well on new data with at least an 80% accuracy. I will save my model through a process called Serialization and will keep for later use. I can then take this model, build an application and deploy it into production. Before I can do that though, I would need to write some unit tests and logs that will alert me whenever my model begins to perform poorly, my threshold will be 75%. At that point I will need to train my model on more data. Once this is done I will probably dockerize this application and deploy it using AWS.
from sklearn.externals import joblib
# save model for later use
joblib.dump(lasso, 'lasso_grid_search.pkl')
/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning:
sklearn.externals.joblib is deprecated in 0.21 and will be removed in
0.23. Please import this functionality directly from joblib, which can
be installed with: pip install joblib. If this warning is raised when
loading pickled models, you may need to re-serialize those models with
scikit-learn 0.21+.
warnings.warn(msg, category=DeprecationWarning)
['lasso_grid_search.pkl']
COVID-19 vaccination rates are still a problem in low-income areas that were hit the hardest by this pandemic. Areas that are affluent were barely affected by COVID-19 and tend to have some of the highest COVID-19 vaccination rates. There needs to be an increased effort to get more people in low-income neighborhoods to take the vaccine. There is a lot of mistrust between the medical community and low-income communities. Convincing people from low-income communities to take the vaccine starts with trust and transparency. This is one of the many steps we must take as New Yorkers to get back to normal.