COVID-19 VACCINATION NYC ANALYSIS

Author: David Charles
Date: 07/01/2021

Abstract

We will be investigating different reasons for COVID-19 vaccination rates and what factors influence New Yorkers to get vaccinated. Data was collected from various sources, cleaned, and merged into one dataset in order to conduct the analysis. We will investigate the correlation between COVID-19 vaccination rates, median household income, educational attainment and COVID-19 death rate/count. We will also attempt to predict how many people will get vaccinated in a neighborhood.

Introduction

COVID-19 changed life as we know it all of over the world. One of the places that was hit the hardest by COVID-19 was NYC. For months we saw NYC's death count rise and it caused a lot of anxiety for fellow New Yorkers. Fast forward to 2021 and we have two major vaccines, Pfizer and Moderna. However, since being rolled out in December 2020, there are many communities that still have low vaccination rates, but were impacted the most by this pandemic. We will use Python to analyze the data we've collected from various sources, such as, NYC Health, Citizens' Committee for Children, United States Census Bureau, etc.

Methods

Our tool of choice for doing the analysis is Python. We will import COVID-19 vaccination data first.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

covid_vacc_zipcode = pd.read_csv('people/coverage-by-modzcta-allages.csv')
covid_vacc_zipcode.head()

	DATE	NEIGHBORHOOD_NAME	BOROUGH	MODZCTA	Label	AGE_GROUP	POP_DENOMINATOR	COUNT_PARTIALLY_CUMULATIVE	COUNT_FULLY_CUMULATIVE	COUNT_1PLUS_CUMULATIVE	PERC_PARTIALLY	PERC_FULLY	PERC_1PLUS
0	2021-06-20	Chelsea/NoMad/West Chelsea	Manhattan	10001	10001, 10118	All ages	27613.09	3017	21749	24766	10.93	78.76	89.69
1	2021-06-20	Chinatown/Lower East Side	Manhattan	10002	10002	All ages	75322.71	5571	48728	54299	7.40	64.69	72.09
2	2021-06-20	East Village/Gramercy/Greenwich Village	Manhattan	10003	10003	All ages	53977.81	3295	33687	36982	6.10	62.41	68.51
3	2021-06-20	Financial District	Manhattan	10004	10004	All ages	2972.12	390	2880	3270	13.12	96.90	110.02
4	2021-06-20	Financial District	Manhattan	10005	10005	All ages	8757.23	907	6202	7109	10.36	70.82	81.18

In order to plot this data on a map in Tableau, pgeocode was used to get latitude and longitude based on the zipcode provided in the MODZCTA column.

import pgeocode

nomi = pgeocode.Nominatim('US')
# get latitude and longitude from zip code
covid_vacc_zipcode[['LATITUDE', 'LONGITUDE']] = covid_vacc_zipcode['MODZCTA'].apply(lambda x: nomi.query_postal_code(str(x))[-3:-1])

covid_vacc_zipcode.head()

	DATE	NEIGHBORHOOD_NAME	BOROUGH	MODZCTA	Label	AGE_GROUP	POP_DENOMINATOR	COUNT_PARTIALLY_CUMULATIVE	COUNT_FULLY_CUMULATIVE	COUNT_1PLUS_CUMULATIVE	PERC_PARTIALLY	PERC_FULLY	PERC_1PLUS	LATITUDE	LONGITUDE
0	2021-06-20	Chelsea/NoMad/West Chelsea	Manhattan	10001	10001, 10118	All ages	27613.09	3017	21749	24766	10.93	78.76	89.69	40.7484	-73.9967
1	2021-06-20	Chinatown/Lower East Side	Manhattan	10002	10002	All ages	75322.71	5571	48728	54299	7.40	64.69	72.09	40.7152	-73.9877
2	2021-06-20	East Village/Gramercy/Greenwich Village	Manhattan	10003	10003	All ages	53977.81	3295	33687	36982	6.10	62.41	68.51	40.7313	-73.9892
3	2021-06-20	Financial District	Manhattan	10004	10004	All ages	2972.12	390	2880	3270	13.12	96.90	110.02	40.7143	-74.0060
4	2021-06-20	Financial District	Manhattan	10005	10005	All ages	8757.23	907	6202	7109	10.36	70.82	81.18	40.7056	-74.0083

Educational Attainment by Zipcode

Next we will import "Educational Attainment by Zipcode" dataset and filter on columns that are important to this project.

# get education by zipcodes
educational_attainment = pd.read_csv('people/ACSST5Y2019.S1501_data_with_overlays_2021-06-22T105606.csv', header=1)

# get cols we need for analysis
cols = ['id',
        'Geographic Area Name',
        'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Less than high school graduate',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!High school graduate (includes equivalency)',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Some college or associate\'s degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Bachelor\'s degree or higher',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!High school graduate (includes equivalency)',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Some college, no degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Associate\'s degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Bachelor\'s degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Graduate or professional degree']

# get filtered dataset
educational_attainment = educational_attainment[cols]

Since we have all of the data we need for educational attainment, we need to get the sum for each educational attainment by zipcode. We will filter then sum along the columns. We will write a function to make this filtering easier. We will clean the zipcode column so that it will make merging datasets easier.

# column filter
def filter_column(df, expressions):
    '''
    df - dataframe
    expression - str, boolean logic operators
                e.g 'Dog|dog|cat|Cat'
    '''
    return df.filter(regex=expressions)

# educational attainment
less_than_high_school = filter_column(educational_attainment, 'Less than').sum(axis=1)
high_school = filter_column(educational_attainment,'High').sum(axis=1)
some_college_associates = filter_column(educational_attainment, 'Some|Associate').sum(axis=1)
bachelors = filter_column(educational_attainment, 'Bachelor').sum(axis=1)
graduate_professional_deg = filter_column(educational_attainment, 'Graduate').sum(axis=1)

# create dataframe for educational attainment
degrees = pd.DataFrame({'raw_zip_code': educational_attainment['Geographic Area Name'],'less_than_high_school': less_than_high_school, 'high_school':high_school, 'some_college_associates':some_college_associates, 'bachelors':bachelors, 'graduate_professional_deg':graduate_professional_deg})


# clean zip code column
degrees.loc[:, 'raw_zip_code'] = degrees.apply(lambda row: row.raw_zip_code[-5:].strip(), axis=1).astype('int')

# merge covid dataset and education attainment on zip code
covid_vacc_zipcode_education = covid_vacc_zipcode.merge(degrees, how='inner', left_on='MODZCTA', right_on='raw_zip_code')
covid_vacc_zipcode_education.head()


# calculate educational attainment percentage by zipcode
# get percentage of educational attainment in neighborhood
covid_vacc_zipcode_education[['less_than_high_school %', 'high_school %', 'some_college_associates %', 'bachelors %', 'graduate_professional_deg %']] = covid_vacc_zipcode_education[['less_than_high_school', 'high_school', 'some_college_associates', 'bachelors', 'graduate_professional_deg']].div(covid_vacc_zipcode_education['POP_DENOMINATOR'], axis=0)
covid_vacc_zipcode_education.head()

	DATE	NEIGHBORHOOD_NAME	BOROUGH	MODZCTA	Label	AGE_GROUP	POP_DENOMINATOR	COUNT_PARTIALLY_CUMULATIVE	COUNT_FULLY_CUMULATIVE	COUNT_1PLUS_CUMULATIVE	...	less_than_high_school	high_school	some_college_associates	bachelors	graduate_professional_deg	less_than_high_school %	high_school %	some_college_associates %	bachelors %	graduate_professional_deg %
0	2021-06-20	Chelsea/NoMad/West Chelsea	Manhattan	10001	10001, 10118	All ages	27613.09	3017	21749	24766	...	96	3016	3113	8594	5443	0.003477	0.109224	0.112736	0.311229	0.197117
1	2021-06-20	Chinatown/Lower East Side	Manhattan	10002	10002	All ages	75322.71	5571	48728	54299	...	492	12522	10714	14666	6605	0.006532	0.166245	0.142241	0.194709	0.087689
2	2021-06-20	East Village/Gramercy/Greenwich Village	Manhattan	10003	10003	All ages	53977.81	3295	33687	36982	...	98	5316	11336	18406	13762	0.001816	0.098485	0.210012	0.340992	0.254957
3	2021-06-20	Financial District	Manhattan	10004	10004	All ages	2972.12	390	2880	3270	...	22	76	131	1357	1337	0.007402	0.025571	0.044076	0.456576	0.449847
4	2021-06-20	Financial District	Manhattan	10005	10005	All ages	8757.23	907	6202	7109	...	0	413	629	4071	2536	0.000000	0.047161	0.071826	0.464873	0.289589

5 rows × 26 columns

NYC Median Household Income

We will now grab the median household income and clean the data before merging it with our dataset.

# median income
median_income_nyc = pd.read_csv('people/Median Incomes.csv')

# rename data column to income
median_income_nyc.rename(columns={'Data':'Income'}, inplace=True)


median_income_nyc[median_income_nyc['Location'].str[0]=='Z'].index[0]
# find zip code by filtering by Z
# clean zip code
median_income_nyc.loc[3900:, 'Location'] = median_income_nyc[median_income_nyc['Location'].str[0]=='Z'].apply(lambda zc: zc.Location[-5:].strip(), axis=1)

# get zip code only
median_income_nyc = median_income_nyc.loc[3900:, :].copy()

# convert income column to float
median_income_nyc['Income'] = median_income_nyc['Income'].astype('float')

# convert object to int
median_income_nyc.loc[:, 'Location'] = median_income_nyc.loc[:, 'Location'].astype('int')

# filter to get "All Household" and most recent median household income which is 2019
# then merge dataset to covid_vacc_zipcode_education
covid_vacc_zipcode_education_income = covid_vacc_zipcode_education.merge(median_income_nyc.loc[(median_income_nyc['Household Type']=='All Households') & (median_income_nyc['TimeFrame']==2019), :], how='inner', left_on='MODZCTA', right_on='Location')

covid_vacc_zipcode_education_income.head()

	DATE	NEIGHBORHOOD_NAME	BOROUGH	MODZCTA	Label	AGE_GROUP	POP_DENOMINATOR	COUNT_PARTIALLY_CUMULATIVE	COUNT_FULLY_CUMULATIVE	COUNT_1PLUS_CUMULATIVE	...	high_school %	some_college_associates %	bachelors %	graduate_professional_deg %	Location	Household Type	TimeFrame	DataFormat	Income	Fips
0	2021-06-20	Chelsea/NoMad/West Chelsea	Manhattan	10001	10001, 10118	All ages	27613.09	3017	21749	24766	...	0.109224	0.112736	0.311229	0.197117	10001	All Households	2019	Dollars	92840.0	10001
1	2021-06-20	Chinatown/Lower East Side	Manhattan	10002	10002	All ages	75322.71	5571	48728	54299	...	0.166245	0.142241	0.194709	0.087689	10002	All Households	2019	Dollars	36982.0	10002
2	2021-06-20	East Village/Gramercy/Greenwich Village	Manhattan	10003	10003	All ages	53977.81	3295	33687	36982	...	0.098485	0.210012	0.340992	0.254957	10003	All Households	2019	Dollars	118161.0	10003
3	2021-06-20	Financial District	Manhattan	10004	10004	All ages	2972.12	390	2880	3270	...	0.025571	0.044076	0.456576	0.449847	10004	All Households	2019	Dollars	190223.0	10004
4	2021-06-20	Financial District	Manhattan	10005	10005	All ages	8757.23	907	6202	7109	...	0.047161	0.071826	0.464873	0.289589	10005	All Households	2019	Dollars	189702.0	10005

5 rows × 32 columns

COVID-19 Deaths data

Now we will get COVID-19 death count dataset and import it. We will then merge to our covid_vacc_zipcode_education_income dataset.

covid_death = pd.read_csv('coronavirus-data/totals/data-by-modzcta.csv')

covid_death.head()

	MODIFIED_ZCTA	NEIGHBORHOOD_NAME	BOROUGH_GROUP	label	lat	lon	COVID_CASE_COUNT	COVID_CASE_RATE	POP_DENOMINATOR	COVID_DEATH_COUNT	COVID_DEATH_RATE	PERCENT_POSITIVE	TOTAL_COVID_TESTS
0	10001	Chelsea/NoMad/West Chelsea	Manhattan	10001, 10118	40.750693	-73.997137	1631	5906.62	27613.09	35	126.75	7.15	22799
1	10002	Chinatown/Lower East Side	Manhattan	10002	40.715781	-73.986176	6118	8122.38	75322.71	289	383.68	11.86	51736
2	10003	East Village/Gramercy/Greenwich Village	Manhattan	10003	40.731825	-73.989164	2934	5435.57	53977.81	49	90.78	6.76	43349
3	10004	Financial District	Manhattan	10004	40.703675	-74.013106	254	8546.10	2972.12	2	67.29	6.55	3896
4	10005	Financial District	Manhattan	10005	40.706092	-74.008861	431	4921.65	8757.23	0	0.00	6.34	6738

Clean dataset

# convert to int
covid_death['MODIFIED_ZCTA'] = covid_death['MODIFIED_ZCTA'].astype('int')

# merge datasets
covid_vacc_zipcode_education_income_deaths = covid_vacc_zipcode_education_income.merge(covid_death, how='inner', left_on='MODZCTA', right_on='MODIFIED_ZCTA')

# this dataset contains lat and longitude so we could've just used that for tableau
covid_vacc_zipcode_education_income_deaths.head()

	DATE	NEIGHBORHOOD_NAME_x	BOROUGH	MODZCTA	Label	AGE_GROUP	POP_DENOMINATOR_x	COUNT_PARTIALLY_CUMULATIVE	COUNT_FULLY_CUMULATIVE	COUNT_1PLUS_CUMULATIVE	...	label	lat	lon	COVID_CASE_COUNT	COVID_CASE_RATE	POP_DENOMINATOR_y	COVID_DEATH_COUNT	COVID_DEATH_RATE	PERCENT_POSITIVE	TOTAL_COVID_TESTS
0	2021-06-20	Chelsea/NoMad/West Chelsea	Manhattan	10001	10001, 10118	All ages	27613.09	3017	21749	24766	...	10001, 10118	40.750693	-73.997137	1631	5906.62	27613.09	35	126.75	7.15	22799
1	2021-06-20	Chinatown/Lower East Side	Manhattan	10002	10002	All ages	75322.71	5571	48728	54299	...	10002	40.715781	-73.986176	6118	8122.38	75322.71	289	383.68	11.86	51736
2	2021-06-20	East Village/Gramercy/Greenwich Village	Manhattan	10003	10003	All ages	53977.81	3295	33687	36982	...	10003	40.731825	-73.989164	2934	5435.57	53977.81	49	90.78	6.76	43349
3	2021-06-20	Financial District	Manhattan	10004	10004	All ages	2972.12	390	2880	3270	...	10004	40.703675	-74.013106	254	8546.10	2972.12	2	67.29	6.55	3896
4	2021-06-20	Financial District	Manhattan	10005	10005	All ages	8757.23	907	6202	7109	...	10005	40.706092	-74.008861	431	4921.65	8757.23	0	0.00	6.34	6738

5 rows × 45 columns

Check for missing data

# fortunately, we have no missing datapoints
covid_vacc_zipcode_education_income_deaths.isna().sum()

DATE                           0
NEIGHBORHOOD_NAME_x            0
BOROUGH                        0
MODZCTA                        0
Label                          0
AGE_GROUP                      0
POP_DENOMINATOR_x              0
COUNT_PARTIALLY_CUMULATIVE     0
COUNT_FULLY_CUMULATIVE         0
COUNT_1PLUS_CUMULATIVE         0
PERC_PARTIALLY                 0
PERC_FULLY                     0
PERC_1PLUS                     0
LATITUDE                       0
LONGITUDE                      0
raw_zip_code                   0
less_than_high_school          0
high_school                    0
some_college_associates        0
bachelors                      0
graduate_professional_deg      0
less_than_high_school %        0
high_school %                  0
some_college_associates %      0
bachelors %                    0
graduate_professional_deg %    0
Location                       0
Household Type                 0
TimeFrame                      0
DataFormat                     0
Income                         0
Fips                           0
MODIFIED_ZCTA                  0
NEIGHBORHOOD_NAME_y            0
BOROUGH_GROUP                  0
label                          0
lat                            0
lon                            0
COVID_CASE_COUNT               0
COVID_CASE_RATE                0
POP_DENOMINATOR_y              0
COVID_DEATH_COUNT              0
COVID_DEATH_RATE               0
PERCENT_POSITIVE               0
TOTAL_COVID_TESTS              0
dtype: int64

Results

Income vs. COVID 19 Fully Vaccinated

This plot shows the correlation between median household income and the percentage of individuals that are vaccinated in NYC neighborhoods.

plt.figure(figsize=(20,15))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['PERC_FULLY'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['COVID_DEATH_COUNT'])
plt.colorbar()
plt.xlabel('Fully Vacc %')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 Full Vaccinated')

handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper left", title="COVID-19 Death Count")
plt.show()

Income vs. COVID 19 1 Dosage Plus

Correlation between median household income and the percentage of individuals that have received at least one vaccination shot in NYC neighborhoods.

plt.figure(figsize=(20,15))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['PERC_1PLUS'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['COVID_DEATH_COUNT'])
plt.colorbar()
plt.xlabel('1 Plus Vacc %')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 1 Plus')

handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)

legend2 = plt.legend(handles, labels, loc="upper left", title="COVID-19 Death Count")
plt.show()

Income vs. COVID 19 Death Rate

Correlation between median household income and the COVID 19 death rate in NYC neighborhoods

plt.figure(figsize=(10,10))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['COVID_DEATH_RATE'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['PERC_FULLY'])
plt.colorbar()
plt.xlabel('Death Rate')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 Death Rate')

handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper right", title="COVID-19 FULLY VACC %")
plt.show()

Plot points on map

Areas that were hit hardest by COVID-19 are less likely to be vaccinated.

import geopandas as gpd
from geopandas import GeoDataFrame
from shapely.geometry import Point

nyc_neighborhood = gpd.read_file('Neighborhood Tabulation Areas (NTA)/geo_export_ab6fc730-beb8-4922-8a7a-24de1a3fdc50.shp')

geometry = [Point(xy) for xy in zip(covid_vacc_zipcode_education_income_deaths['LONGITUDE'], covid_vacc_zipcode_education_income_deaths['LATITUDE'])]
gdf = GeoDataFrame(covid_vacc_zipcode_education_income_deaths, geometry=geometry)

import matplotlib as mpl
cmap = mpl.colors.ListedColormap(['red', 'orange',
                                  'green'])

import contextily as ctx


ax = nyc_neighborhood.boundary.plot(figsize=(20,15), color='grey')

plt.axis('off')
plt.title('COVID-19 VACCINATION NYC', fontsize=20)

# # plt.legend(labels=covid_vacc_income_education_deaths['COVID_DEATH_COUNT'].sort_values(ascending=True), scatterpoints=1)

# plot scatter again on shapefile to get size legend
scatter = plt.scatter('LONGITUDE', 'LATITUDE', data=covid_vacc_zipcode_education_income_deaths, s='COVID_DEATH_COUNT', c='PERC_FULLY',cmap=cmap)
plt.legend(*scatter.legend_elements("sizes", num=6, alpha=0.6), title='COVID-19 DEATH COUNT')

plt.colorbar(cmap=cmap, boundaries=covid_vacc_zipcode_education_income_deaths['PERC_FULLY'].sort_values(ascending=True).to_list())

ctx.add_basemap(ax=ax, crs='epsg:4326')

Here is a link to these plots in a tableau dashboard:

Regresssion

We are going to use different regression models, both linear and non-linear, to see if we can predict the amount of people who will be fully vaccinated.

Functions

from sklearn.model_selection import train_test_split
import re
# create functions for training regression models
# split into test and train datasets
def modelBuilder(X, y, seed=7, test_size=0.1):
    '''returns X_train, X_test, y_train, y_test '''
    seed = seed
    test_size = test_size
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

    return X_train, y_train, X_test, y_test

def modelsTrainer(X_train, y_train, X_test, y_test, models):
    '''Returns results'''
    # create an empty list for results
    results = []
    # loop over models
    for model in models:
        model_regression = model
        # if value error, make sure function is taking correct parameters in the correct order
        model_regression.fit(X_train, y_train)
        y_pred = model_regression.predict(X_test)
        # bias
        training_score = model_regression.score(X_train, y_train)

        # variance
        testing_score = model_regression.score(X_test, y_test)

        # percent difference between training and testing score
        delta = training_score - testing_score
        # get raw model name
        model_name = re.search('[a-zA-Z]+', str(model))[0]
        # append scores to results
        results.append({model_name:[training_score, testing_score, delta]})


    return results

Set up train and test datasets

from sklearn.ensemble import RandomForestRegressor
# import all of the models we want to test
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso, Ridge
from xgboost import XGBRegressor

X = covid_vacc_zipcode_education_income_deaths[['POP_DENOMINATOR_x', 'Income','bachelors', 'COVID_DEATH_COUNT']]
y = covid_vacc_zipcode_education_income_deaths['COUNT_FULLY_CUMULATIVE']


X_train, y_train, X_test, y_test = modelBuilder(X, y)

Before we choose a model we have to deal bias-variance tradeoff. When we select a model we are inserting a bias. If the model has a high bias then there is a chance our model will underfit our data. If our model has a high variance, there is a chance it will overfit our data. Ideally, we want a balance between bias and variance. We will fit our model to our data and get the training and testing score. We will then get the difference between the two to decide which model we should choose.

# models to train
models = [LinearRegression(), RandomForestRegressor(), ElasticNet(), GradientBoostingRegressor(), XGBRegressor(verbosity=0), Lasso(), Ridge()]

# fit models
# returns training score, testing score and difference between the two
modelsTrainer(X_train, y_train, X_test, y_test, models)

[{'LinearRegression': [0.8349011295818145,
   0.805264936984036,
   0.0296361925977785]},
 {'RandomForestRegressor': [0.9393786736093015,
   0.7816819810432987,
   0.15769669256600283]},
 {'ElasticNet': [0.8349011295434671,
   0.8052659443078068,
   0.029635185235660222]},
 {'GradientBoostingRegressor': [0.9854864931948837,
   0.8559366915021283,
   0.12954980169275543]},
 {'XGBRegressor': [0.9781925378371675,
   0.8717985612917483,
   0.10639397654541927]},
 {'Lasso': [0.8349011295798674, 0.8052651639636368,
0.02963596561623061]},
 {'Ridge': [0.8349011295818097, 0.805264948229599,
0.02963618135221069]}]

from sklearn.model_selection import cross_val_score, KFold

cv = KFold(n_splits=10)

# we can use cross-validation to confirm our linear models have low variance and does well generalizing data
for model in models:
  cross_val2 = cross_val_score(model, X_train, y_train, cv=cv)
  # linear models have low standard deviation meaning their variance is low
  # we can conclude our linear model(s) will do good generalizing data
  print(re.search('[a-zA-Z]+', str(model))[0], '-', 'mean:', cross_val2.mean(), 'std:', cross_val2.std())

LinearRegression - mean: 0.7988492776333851 std: 0.09774188277510705
RandomForestRegressor - mean: 0.7321956559173095 std:
0.16819931769007415
ElasticNet - mean: 0.7988515079441996 std: 0.09774302215729834
GradientBoostingRegressor - mean: 0.6875606157113585 std:
0.23012268956944482
XGBRegressor - mean: 0.7207995095977946 std: 0.19388589251272895
Lasso - mean: 0.7988493286692742 std: 0.09774221635911819
Ridge - mean: 0.7988493084839189 std: 0.09774189636351902

lasso = Lasso()

lasso.fit(X_train, y_train)

# points are randomly dispersed around the horizontal axis
# a linear model is our best route
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(lasso, hist=False, qqplot=True)
visualizer.fit(X_train, y_train)  # Fit the training data
visualizer.score(X_test, y_test)  # Evaluate the model for variance
visualizer.show()

<matplotlib.axes._subplots.AxesSubplot at 0x1449bbeb8>

We can see that linear models performed better than non-linear models

These linear models are: Linear Regression, Elastic Net, Lasso & Ridge.

I went with Lasso Regression for a number of reasons:

Regularization prevents overfitting. This prevents our model from having a high variance. Adding l1 norm prevents overfitting using the absolute value of the weights.
Lasso is robust, meaning it is resistant to outliers. This is helpful because NYC data tend to have outliers especially when it comes to income, population, etc.
Sparsity - Lasso regression has built in feature selection so we can be certain our independent variables are statistically significant. This eliminates the need to drop coefficients if their pvalue is more than 0.05 or whichever confidence level we set before building our regression model.

Although there are some pros to using Lasso regression, here are some cons:

It isn't closed-form since it utilizes the absolute value of our residual/error. Absolute value creates a sharp turn in our function which makes it a non-differentiable function at 0. As a result, there are many solutions to our problem making it computationally more expensive down the road. In this instance, we cannot utilize Gradient Descent and will have to use sklearn's built in coordinate descent to minimize our loss function.

Model Evaluation/Tuning

from sklearn.metrics import mean_squared_error


mse = []
for alpha in [0, 0.0001, 0.001, 0.01, 0.1, 1]:
  model = Lasso(alpha=alpha)
  model.fit(X_train, y_train)
  mse.append(mean_squared_error(y_test, model.predict(X_test)))
# lowest mean squared error is when alpha is 1
plt.plot([0, 0.0001, 0.001, 0.01, 0.1, 1], mse)

[<matplotlib.lines.Line2D at 0x1365faba8>]

from sklearn.model_selection import GridSearchCV
# use grid search to tune our alpha
param_grids = {'alpha': [0, 0.0001, 0.001, 0.01, 0.1, 1]}

# setup grid search
grid_search_lasso = GridSearchCV(estimator=lasso, param_grid=param_grids, cv=cv)

grid_search_lasso.fit(X_train, y_train)

grid_search_lasso.best_score_

/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:813: DeprecationWarning:
The default of the `iid` parameter will change from True to False in
version 0.22 and will be removed in 0.24. This will change numeric
results when test-set sizes are unequal.
  DeprecationWarning)

0.8001171769697994

# best alpha is indeed 1
grid_search_lasso.best_params_

{'alpha': 1}

Although our model is robust, let's see if scaling our data on improve the accuracy.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# make pipeline
pipe = make_pipeline(StandardScaler(), GridSearchCV(estimator=lasso, param_grid=param_grids, cv=cv))
pipe.fit(X_train, y_train)

# As you can see our model barely improved
pipe.score(X_test, y_test)

/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:813: DeprecationWarning:
The default of the `iid` parameter will change from True to False in
version 0.22 and will be removed in 0.24. This will change numeric
results when test-set sizes are unequal.
  DeprecationWarning)

0.8053618445370625

Final model

\( y = 0.26838108x_1+0.00910082x_2+0.99014931x_3+7.89187482x_4+156.9164 \)

I am confident that this model will do fairly well on new data with at least an 80% accuracy. I will save my model through a process called Serialization and will keep for later use. I can then take this model, build an application and deploy it into production. Before I can do that though, I would need to write some unit tests and logs that will alert me whenever my model begins to perform poorly, my threshold will be 75%. At that point I will need to train my model on more data. Once this is done I will probably dockerize this application and deploy it using AWS.

from sklearn.externals import joblib

# save model for later use
joblib.dump(lasso, 'lasso_grid_search.pkl')

/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning:
sklearn.externals.joblib is deprecated in 0.21 and will be removed in
0.23. Please import this functionality directly from joblib, which can
be installed with: pip install joblib. If this warning is raised when
loading pickled models, you may need to re-serialize those models with
scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)

['lasso_grid_search.pkl']

Conclusion

COVID-19 vaccination rates are still a problem in low-income areas that were hit the hardest by this pandemic. Areas that are affluent were barely affected by COVID-19 and tend to have some of the highest COVID-19 vaccination rates. There needs to be an increased effort to get more people in low-income neighborhoods to take the vaccine. There is a lot of mistrust between the medical community and low-income communities. Convincing people from low-income communities to take the vaccine starts with trust and transparency. This is one of the many steps we must take as New Yorkers to get back to normal.