COVID-19 VACCINATION NYC ANALYSIS

Author: David Charles
Date: 07/01/2021

Abstract

We will be investigating different reasons for COVID-19 vaccination rates and what factors influence New Yorkers to get vaccinated. Data was collected from various sources, cleaned, and merged into one dataset in order to conduct the analysis. We will investigate the correlation between COVID-19 vaccination rates, median household income, educational attainment and COVID-19 death rate/count. We will also attempt to predict how many people will get vaccinated in a neighborhood.

Introduction

COVID-19 changed life as we know it all of over the world. One of the places that was hit the hardest by COVID-19 was NYC. For months we saw NYC's death count rise and it caused a lot of anxiety for fellow New Yorkers. Fast forward to 2021 and we have two major vaccines, Pfizer and Moderna. However, since being rolled out in December 2020, there are many communities that still have low vaccination rates, but were impacted the most by this pandemic. We will use Python to analyze the data we've collected from various sources, such as, NYC Health, Citizens' Committee for Children, United States Census Bureau, etc.

Methods

Our tool of choice for doing the analysis is Python. We will import COVID-19 vaccination data first.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

covid_vacc_zipcode = pd.read_csv('people/coverage-by-modzcta-allages.csv')
covid_vacc_zipcode.head()
DATE NEIGHBORHOOD_NAME BOROUGH MODZCTA Label AGE_GROUP POP_DENOMINATOR COUNT_PARTIALLY_CUMULATIVE COUNT_FULLY_CUMULATIVE COUNT_1PLUS_CUMULATIVE PERC_PARTIALLY PERC_FULLY PERC_1PLUS
0 2021-06-20 Chelsea/NoMad/West Chelsea Manhattan 10001 10001, 10118 All ages 27613.09 3017 21749 24766 10.93 78.76 89.69
1 2021-06-20 Chinatown/Lower East Side Manhattan 10002 10002 All ages 75322.71 5571 48728 54299 7.40 64.69 72.09
2 2021-06-20 East Village/Gramercy/Greenwich Village Manhattan 10003 10003 All ages 53977.81 3295 33687 36982 6.10 62.41 68.51
3 2021-06-20 Financial District Manhattan 10004 10004 All ages 2972.12 390 2880 3270 13.12 96.90 110.02
4 2021-06-20 Financial District Manhattan 10005 10005 All ages 8757.23 907 6202 7109 10.36 70.82 81.18

In order to plot this data on a map in Tableau, pgeocode was used to get latitude and longitude based on the zipcode provided in the MODZCTA column.

import pgeocode

nomi = pgeocode.Nominatim('US')
# get latitude and longitude from zip code
covid_vacc_zipcode[['LATITUDE', 'LONGITUDE']] = covid_vacc_zipcode['MODZCTA'].apply(lambda x: nomi.query_postal_code(str(x))[-3:-1])

covid_vacc_zipcode.head()
DATE NEIGHBORHOOD_NAME BOROUGH MODZCTA Label AGE_GROUP POP_DENOMINATOR COUNT_PARTIALLY_CUMULATIVE COUNT_FULLY_CUMULATIVE COUNT_1PLUS_CUMULATIVE PERC_PARTIALLY PERC_FULLY PERC_1PLUS LATITUDE LONGITUDE
0 2021-06-20 Chelsea/NoMad/West Chelsea Manhattan 10001 10001, 10118 All ages 27613.09 3017 21749 24766 10.93 78.76 89.69 40.7484 -73.9967
1 2021-06-20 Chinatown/Lower East Side Manhattan 10002 10002 All ages 75322.71 5571 48728 54299 7.40 64.69 72.09 40.7152 -73.9877
2 2021-06-20 East Village/Gramercy/Greenwich Village Manhattan 10003 10003 All ages 53977.81 3295 33687 36982 6.10 62.41 68.51 40.7313 -73.9892
3 2021-06-20 Financial District Manhattan 10004 10004 All ages 2972.12 390 2880 3270 13.12 96.90 110.02 40.7143 -74.0060
4 2021-06-20 Financial District Manhattan 10005 10005 All ages 8757.23 907 6202 7109 10.36 70.82 81.18 40.7056 -74.0083

Educational Attainment by Zipcode

Next we will import "Educational Attainment by Zipcode" dataset and filter on columns that are important to this project.

# get education by zipcodes
educational_attainment = pd.read_csv('people/ACSST5Y2019.S1501_data_with_overlays_2021-06-22T105606.csv', header=1)

# get cols we need for analysis
cols = ['id',
        'Geographic Area Name',
        'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Less than high school graduate',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!High school graduate (includes equivalency)',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Some college or associate\'s degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 18 to 24 years!!Bachelor\'s degree or higher',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!High school graduate (includes equivalency)',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Some college, no degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Associate\'s degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Bachelor\'s degree',
       'Estimate!!Total!!AGE BY EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Graduate or professional degree']

# get filtered dataset
educational_attainment = educational_attainment[cols]

Since we have all of the data we need for educational attainment, we need to get the sum for each educational attainment by zipcode. We will filter then sum along the columns. We will write a function to make this filtering easier. We will clean the zipcode column so that it will make merging datasets easier.

# column filter
def filter_column(df, expressions):
    '''
    df - dataframe
    expression - str, boolean logic operators
                e.g 'Dog|dog|cat|Cat'
    '''
    return df.filter(regex=expressions)

# educational attainment
less_than_high_school = filter_column(educational_attainment, 'Less than').sum(axis=1)
high_school = filter_column(educational_attainment,'High').sum(axis=1)
some_college_associates = filter_column(educational_attainment, 'Some|Associate').sum(axis=1)
bachelors = filter_column(educational_attainment, 'Bachelor').sum(axis=1)
graduate_professional_deg = filter_column(educational_attainment, 'Graduate').sum(axis=1)

# create dataframe for educational attainment
degrees = pd.DataFrame({'raw_zip_code': educational_attainment['Geographic Area Name'],'less_than_high_school': less_than_high_school, 'high_school':high_school, 'some_college_associates':some_college_associates, 'bachelors':bachelors, 'graduate_professional_deg':graduate_professional_deg})


# clean zip code column
degrees.loc[:, 'raw_zip_code'] = degrees.apply(lambda row: row.raw_zip_code[-5:].strip(), axis=1).astype('int')

# merge covid dataset and education attainment on zip code
covid_vacc_zipcode_education = covid_vacc_zipcode.merge(degrees, how='inner', left_on='MODZCTA', right_on='raw_zip_code')
covid_vacc_zipcode_education.head()


# calculate educational attainment percentage by zipcode
# get percentage of educational attainment in neighborhood
covid_vacc_zipcode_education[['less_than_high_school %', 'high_school %', 'some_college_associates %', 'bachelors %', 'graduate_professional_deg %']] = covid_vacc_zipcode_education[['less_than_high_school', 'high_school', 'some_college_associates', 'bachelors', 'graduate_professional_deg']].div(covid_vacc_zipcode_education['POP_DENOMINATOR'], axis=0)
covid_vacc_zipcode_education.head()
DATE NEIGHBORHOOD_NAME BOROUGH MODZCTA Label AGE_GROUP POP_DENOMINATOR COUNT_PARTIALLY_CUMULATIVE COUNT_FULLY_CUMULATIVE COUNT_1PLUS_CUMULATIVE ... less_than_high_school high_school some_college_associates bachelors graduate_professional_deg less_than_high_school % high_school % some_college_associates % bachelors % graduate_professional_deg %
0 2021-06-20 Chelsea/NoMad/West Chelsea Manhattan 10001 10001, 10118 All ages 27613.09 3017 21749 24766 ... 96 3016 3113 8594 5443 0.003477 0.109224 0.112736 0.311229 0.197117
1 2021-06-20 Chinatown/Lower East Side Manhattan 10002 10002 All ages 75322.71 5571 48728 54299 ... 492 12522 10714 14666 6605 0.006532 0.166245 0.142241 0.194709 0.087689
2 2021-06-20 East Village/Gramercy/Greenwich Village Manhattan 10003 10003 All ages 53977.81 3295 33687 36982 ... 98 5316 11336 18406 13762 0.001816 0.098485 0.210012 0.340992 0.254957
3 2021-06-20 Financial District Manhattan 10004 10004 All ages 2972.12 390 2880 3270 ... 22 76 131 1357 1337 0.007402 0.025571 0.044076 0.456576 0.449847
4 2021-06-20 Financial District Manhattan 10005 10005 All ages 8757.23 907 6202 7109 ... 0 413 629 4071 2536 0.000000 0.047161 0.071826 0.464873 0.289589

5 rows × 26 columns

NYC Median Household Income

We will now grab the median household income and clean the data before merging it with our dataset.

# median income
median_income_nyc = pd.read_csv('people/Median Incomes.csv')

# rename data column to income
median_income_nyc.rename(columns={'Data':'Income'}, inplace=True)


median_income_nyc[median_income_nyc['Location'].str[0]=='Z'].index[0]
# find zip code by filtering by Z
# clean zip code
median_income_nyc.loc[3900:, 'Location'] = median_income_nyc[median_income_nyc['Location'].str[0]=='Z'].apply(lambda zc: zc.Location[-5:].strip(), axis=1)

# get zip code only
median_income_nyc = median_income_nyc.loc[3900:, :].copy()

# convert income column to float
median_income_nyc['Income'] = median_income_nyc['Income'].astype('float')

# convert object to int
median_income_nyc.loc[:, 'Location'] = median_income_nyc.loc[:, 'Location'].astype('int')

# filter to get "All Household" and most recent median household income which is 2019
# then merge dataset to covid_vacc_zipcode_education
covid_vacc_zipcode_education_income = covid_vacc_zipcode_education.merge(median_income_nyc.loc[(median_income_nyc['Household Type']=='All Households') & (median_income_nyc['TimeFrame']==2019), :], how='inner', left_on='MODZCTA', right_on='Location')

covid_vacc_zipcode_education_income.head()
DATE NEIGHBORHOOD_NAME BOROUGH MODZCTA Label AGE_GROUP POP_DENOMINATOR COUNT_PARTIALLY_CUMULATIVE COUNT_FULLY_CUMULATIVE COUNT_1PLUS_CUMULATIVE ... high_school % some_college_associates % bachelors % graduate_professional_deg % Location Household Type TimeFrame DataFormat Income Fips
0 2021-06-20 Chelsea/NoMad/West Chelsea Manhattan 10001 10001, 10118 All ages 27613.09 3017 21749 24766 ... 0.109224 0.112736 0.311229 0.197117 10001 All Households 2019 Dollars 92840.0 10001
1 2021-06-20 Chinatown/Lower East Side Manhattan 10002 10002 All ages 75322.71 5571 48728 54299 ... 0.166245 0.142241 0.194709 0.087689 10002 All Households 2019 Dollars 36982.0 10002
2 2021-06-20 East Village/Gramercy/Greenwich Village Manhattan 10003 10003 All ages 53977.81 3295 33687 36982 ... 0.098485 0.210012 0.340992 0.254957 10003 All Households 2019 Dollars 118161.0 10003
3 2021-06-20 Financial District Manhattan 10004 10004 All ages 2972.12 390 2880 3270 ... 0.025571 0.044076 0.456576 0.449847 10004 All Households 2019 Dollars 190223.0 10004
4 2021-06-20 Financial District Manhattan 10005 10005 All ages 8757.23 907 6202 7109 ... 0.047161 0.071826 0.464873 0.289589 10005 All Households 2019 Dollars 189702.0 10005

5 rows × 32 columns

COVID-19 Deaths data

Now we will get COVID-19 death count dataset and import it. We will then merge to our covid_vacc_zipcode_education_income dataset.

covid_death = pd.read_csv('coronavirus-data/totals/data-by-modzcta.csv')

covid_death.head()
MODIFIED_ZCTA NEIGHBORHOOD_NAME BOROUGH_GROUP label lat lon COVID_CASE_COUNT COVID_CASE_RATE POP_DENOMINATOR COVID_DEATH_COUNT COVID_DEATH_RATE PERCENT_POSITIVE TOTAL_COVID_TESTS
0 10001 Chelsea/NoMad/West Chelsea Manhattan 10001, 10118 40.750693 -73.997137 1631 5906.62 27613.09 35 126.75 7.15 22799
1 10002 Chinatown/Lower East Side Manhattan 10002 40.715781 -73.986176 6118 8122.38 75322.71 289 383.68 11.86 51736
2 10003 East Village/Gramercy/Greenwich Village Manhattan 10003 40.731825 -73.989164 2934 5435.57 53977.81 49 90.78 6.76 43349
3 10004 Financial District Manhattan 10004 40.703675 -74.013106 254 8546.10 2972.12 2 67.29 6.55 3896
4 10005 Financial District Manhattan 10005 40.706092 -74.008861 431 4921.65 8757.23 0 0.00 6.34 6738

Clean dataset

# convert to int
covid_death['MODIFIED_ZCTA'] = covid_death['MODIFIED_ZCTA'].astype('int')

# merge datasets
covid_vacc_zipcode_education_income_deaths = covid_vacc_zipcode_education_income.merge(covid_death, how='inner', left_on='MODZCTA', right_on='MODIFIED_ZCTA')

# this dataset contains lat and longitude so we could've just used that for tableau
covid_vacc_zipcode_education_income_deaths.head()
DATE NEIGHBORHOOD_NAME_x BOROUGH MODZCTA Label AGE_GROUP POP_DENOMINATOR_x COUNT_PARTIALLY_CUMULATIVE COUNT_FULLY_CUMULATIVE COUNT_1PLUS_CUMULATIVE ... label lat lon COVID_CASE_COUNT COVID_CASE_RATE POP_DENOMINATOR_y COVID_DEATH_COUNT COVID_DEATH_RATE PERCENT_POSITIVE TOTAL_COVID_TESTS
0 2021-06-20 Chelsea/NoMad/West Chelsea Manhattan 10001 10001, 10118 All ages 27613.09 3017 21749 24766 ... 10001, 10118 40.750693 -73.997137 1631 5906.62 27613.09 35 126.75 7.15 22799
1 2021-06-20 Chinatown/Lower East Side Manhattan 10002 10002 All ages 75322.71 5571 48728 54299 ... 10002 40.715781 -73.986176 6118 8122.38 75322.71 289 383.68 11.86 51736
2 2021-06-20 East Village/Gramercy/Greenwich Village Manhattan 10003 10003 All ages 53977.81 3295 33687 36982 ... 10003 40.731825 -73.989164 2934 5435.57 53977.81 49 90.78 6.76 43349
3 2021-06-20 Financial District Manhattan 10004 10004 All ages 2972.12 390 2880 3270 ... 10004 40.703675 -74.013106 254 8546.10 2972.12 2 67.29 6.55 3896
4 2021-06-20 Financial District Manhattan 10005 10005 All ages 8757.23 907 6202 7109 ... 10005 40.706092 -74.008861 431 4921.65 8757.23 0 0.00 6.34 6738

5 rows × 45 columns

Check for missing data

# fortunately, we have no missing datapoints
covid_vacc_zipcode_education_income_deaths.isna().sum()
DATE                           0
NEIGHBORHOOD_NAME_x            0
BOROUGH                        0
MODZCTA                        0
Label                          0
AGE_GROUP                      0
POP_DENOMINATOR_x              0
COUNT_PARTIALLY_CUMULATIVE     0
COUNT_FULLY_CUMULATIVE         0
COUNT_1PLUS_CUMULATIVE         0
PERC_PARTIALLY                 0
PERC_FULLY                     0
PERC_1PLUS                     0
LATITUDE                       0
LONGITUDE                      0
raw_zip_code                   0
less_than_high_school          0
high_school                    0
some_college_associates        0
bachelors                      0
graduate_professional_deg      0
less_than_high_school %        0
high_school %                  0
some_college_associates %      0
bachelors %                    0
graduate_professional_deg %    0
Location                       0
Household Type                 0
TimeFrame                      0
DataFormat                     0
Income                         0
Fips                           0
MODIFIED_ZCTA                  0
NEIGHBORHOOD_NAME_y            0
BOROUGH_GROUP                  0
label                          0
lat                            0
lon                            0
COVID_CASE_COUNT               0
COVID_CASE_RATE                0
POP_DENOMINATOR_y              0
COVID_DEATH_COUNT              0
COVID_DEATH_RATE               0
PERCENT_POSITIVE               0
TOTAL_COVID_TESTS              0
dtype: int64

Results

Income vs. COVID 19 Fully Vaccinated

This plot shows the correlation between median household income and the percentage of individuals that are vaccinated in NYC neighborhoods.

plt.figure(figsize=(20,15))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['PERC_FULLY'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['COVID_DEATH_COUNT'])
plt.colorbar()
plt.xlabel('Fully Vacc %')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 Full Vaccinated')

handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper left", title="COVID-19 Death Count")
plt.show()

Income vs. COVID 19 1 Dosage Plus

Correlation between median household income and the percentage of individuals that have received at least one vaccination shot in NYC neighborhoods.

plt.figure(figsize=(20,15))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['PERC_1PLUS'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['COVID_DEATH_COUNT'])
plt.colorbar()
plt.xlabel('1 Plus Vacc %')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 1 Plus')

handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)

legend2 = plt.legend(handles, labels, loc="upper left", title="COVID-19 Death Count")
plt.show()

Income vs. COVID 19 Death Rate

Correlation between median household income and the COVID 19 death rate in NYC neighborhoods

plt.figure(figsize=(10,10))
scatter = plt.scatter(covid_vacc_zipcode_education_income_deaths['COVID_DEATH_RATE'], covid_vacc_zipcode_education_income_deaths['Income'], c=covid_vacc_zipcode_education_income_deaths['Income'], s=covid_vacc_zipcode_education_income_deaths['PERC_FULLY'])
plt.colorbar()
plt.xlabel('Death Rate')
plt.ylabel('Income')
plt.title('Income vs. COVID 19 Death Rate')

handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = plt.legend(handles, labels, loc="upper right", title="COVID-19 FULLY VACC %")
plt.show()

Plot points on map

Areas that were hit hardest by COVID-19 are less likely to be vaccinated.

import geopandas as gpd
from geopandas import GeoDataFrame
from shapely.geometry import Point

nyc_neighborhood = gpd.read_file('Neighborhood Tabulation Areas (NTA)/geo_export_ab6fc730-beb8-4922-8a7a-24de1a3fdc50.shp')

geometry = [Point(xy) for xy in zip(covid_vacc_zipcode_education_income_deaths['LONGITUDE'], covid_vacc_zipcode_education_income_deaths['LATITUDE'])]
gdf = GeoDataFrame(covid_vacc_zipcode_education_income_deaths, geometry=geometry)

import matplotlib as mpl
cmap = mpl.colors.ListedColormap(['red', 'orange',
                                  'green'])

import contextily as ctx


ax = nyc_neighborhood.boundary.plot(figsize=(20,15), color='grey')

plt.axis('off')
plt.title('COVID-19 VACCINATION NYC', fontsize=20)

# # plt.legend(labels=covid_vacc_income_education_deaths['COVID_DEATH_COUNT'].sort_values(ascending=True), scatterpoints=1)

# plot scatter again on shapefile to get size legend
scatter = plt.scatter('LONGITUDE', 'LATITUDE', data=covid_vacc_zipcode_education_income_deaths, s='COVID_DEATH_COUNT', c='PERC_FULLY',cmap=cmap)
plt.legend(*scatter.legend_elements("sizes", num=6, alpha=0.6), title='COVID-19 DEATH COUNT')

plt.colorbar(cmap=cmap, boundaries=covid_vacc_zipcode_education_income_deaths['PERC_FULLY'].sort_values(ascending=True).to_list())

ctx.add_basemap(ax=ax, crs='epsg:4326')
Here is a link to these plots in a tableau dashboard:

Regresssion

We are going to use different regression models, both linear and non-linear, to see if we can predict the amount of people who will be fully vaccinated.

Functions

from sklearn.model_selection import train_test_split
import re
# create functions for training regression models
# split into test and train datasets
def modelBuilder(X, y, seed=7, test_size=0.1):
    '''returns X_train, X_test, y_train, y_test '''
    seed = seed
    test_size = test_size
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

    return X_train, y_train, X_test, y_test

def modelsTrainer(X_train, y_train, X_test, y_test, models):
    '''Returns results'''
    # create an empty list for results
    results = []
    # loop over models
    for model in models:
        model_regression = model
        # if value error, make sure function is taking correct parameters in the correct order
        model_regression.fit(X_train, y_train)
        y_pred = model_regression.predict(X_test)
        # bias
        training_score = model_regression.score(X_train, y_train)

        # variance
        testing_score = model_regression.score(X_test, y_test)

        # percent difference between training and testing score
        delta = training_score - testing_score
        # get raw model name
        model_name = re.search('[a-zA-Z]+', str(model))[0]
        # append scores to results
        results.append({model_name:[training_score, testing_score, delta]})


    return results

Set up train and test datasets

from sklearn.ensemble import RandomForestRegressor
# import all of the models we want to test
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso, Ridge
from xgboost import XGBRegressor

X = covid_vacc_zipcode_education_income_deaths[['POP_DENOMINATOR_x', 'Income','bachelors', 'COVID_DEATH_COUNT']]
y = covid_vacc_zipcode_education_income_deaths['COUNT_FULLY_CUMULATIVE']


X_train, y_train, X_test, y_test = modelBuilder(X, y)

Before we choose a model we have to deal bias-variance tradeoff. When we select a model we are inserting a bias. If the model has a high bias then there is a chance our model will underfit our data. If our model has a high variance, there is a chance it will overfit our data. Ideally, we want a balance between bias and variance. We will fit our model to our data and get the training and testing score. We will then get the difference between the two to decide which model we should choose.

# models to train
models = [LinearRegression(), RandomForestRegressor(), ElasticNet(), GradientBoostingRegressor(), XGBRegressor(verbosity=0), Lasso(), Ridge()]

# fit models
# returns training score, testing score and difference between the two
modelsTrainer(X_train, y_train, X_test, y_test, models)
[{'LinearRegression': [0.8349011295818145,
   0.805264936984036,
   0.0296361925977785]},
 {'RandomForestRegressor': [0.9393786736093015,
   0.7816819810432987,
   0.15769669256600283]},
 {'ElasticNet': [0.8349011295434671,
   0.8052659443078068,
   0.029635185235660222]},
 {'GradientBoostingRegressor': [0.9854864931948837,
   0.8559366915021283,
   0.12954980169275543]},
 {'XGBRegressor': [0.9781925378371675,
   0.8717985612917483,
   0.10639397654541927]},
 {'Lasso': [0.8349011295798674, 0.8052651639636368,
0.02963596561623061]},
 {'Ridge': [0.8349011295818097, 0.805264948229599,
0.02963618135221069]}]
from sklearn.model_selection import cross_val_score, KFold

cv = KFold(n_splits=10)

# we can use cross-validation to confirm our linear models have low variance and does well generalizing data
for model in models:
  cross_val2 = cross_val_score(model, X_train, y_train, cv=cv)
  # linear models have low standard deviation meaning their variance is low
  # we can conclude our linear model(s) will do good generalizing data
  print(re.search('[a-zA-Z]+', str(model))[0], '-', 'mean:', cross_val2.mean(), 'std:', cross_val2.std())
LinearRegression - mean: 0.7988492776333851 std: 0.09774188277510705
RandomForestRegressor - mean: 0.7321956559173095 std:
0.16819931769007415
ElasticNet - mean: 0.7988515079441996 std: 0.09774302215729834
GradientBoostingRegressor - mean: 0.6875606157113585 std:
0.23012268956944482
XGBRegressor - mean: 0.7207995095977946 std: 0.19388589251272895
Lasso - mean: 0.7988493286692742 std: 0.09774221635911819
Ridge - mean: 0.7988493084839189 std: 0.09774189636351902
lasso = Lasso()

lasso.fit(X_train, y_train)

# points are randomly dispersed around the horizontal axis
# a linear model is our best route
from yellowbrick.regressor import ResidualsPlot
visualizer = ResidualsPlot(lasso, hist=False, qqplot=True)
visualizer.fit(X_train, y_train)  # Fit the training data
visualizer.score(X_test, y_test)  # Evaluate the model for variance
visualizer.show()
<matplotlib.axes._subplots.AxesSubplot at 0x1449bbeb8>

We can see that linear models performed better than non-linear models

These linear models are: Linear Regression, Elastic Net, Lasso & Ridge.

I went with Lasso Regression for a number of reasons:

  • Regularization prevents overfitting. This prevents our model from having a high variance. Adding l1 norm prevents overfitting using the absolute value of the weights.

  • Lasso is robust, meaning it is resistant to outliers. This is helpful because NYC data tend to have outliers especially when it comes to income, population, etc.

  • Sparsity - Lasso regression has built in feature selection so we can be certain our independent variables are statistically significant. This eliminates the need to drop coefficients if their pvalue is more than 0.05 or whichever confidence level we set before building our regression model.

Although there are some pros to using Lasso regression, here are some cons:

  • It isn't closed-form since it utilizes the absolute value of our residual/error. Absolute value creates a sharp turn in our function which makes it a non-differentiable function at 0. As a result, there are many solutions to our problem making it computationally more expensive down the road. In this instance, we cannot utilize Gradient Descent and will have to use sklearn's built in coordinate descent to minimize our loss function.

Model Evaluation/Tuning

from sklearn.metrics import mean_squared_error


mse = []
for alpha in [0, 0.0001, 0.001, 0.01, 0.1, 1]:
  model = Lasso(alpha=alpha)
  model.fit(X_train, y_train)
  mse.append(mean_squared_error(y_test, model.predict(X_test)))
# lowest mean squared error is when alpha is 1
plt.plot([0, 0.0001, 0.001, 0.01, 0.1, 1], mse)
[<matplotlib.lines.Line2D at 0x1365faba8>]
from sklearn.model_selection import GridSearchCV
# use grid search to tune our alpha
param_grids = {'alpha': [0, 0.0001, 0.001, 0.01, 0.1, 1]}

# setup grid search
grid_search_lasso = GridSearchCV(estimator=lasso, param_grid=param_grids, cv=cv)

grid_search_lasso.fit(X_train, y_train)

grid_search_lasso.best_score_
/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:813: DeprecationWarning:
The default of the `iid` parameter will change from True to False in
version 0.22 and will be removed in 0.24. This will change numeric
results when test-set sizes are unequal.
  DeprecationWarning)
0.8001171769697994
# best alpha is indeed 1
grid_search_lasso.best_params_
{'alpha': 1}

Although our model is robust, let's see if scaling our data on improve the accuracy.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# make pipeline
pipe = make_pipeline(StandardScaler(), GridSearchCV(estimator=lasso, param_grid=param_grids, cv=cv))
pipe.fit(X_train, y_train)

# As you can see our model barely improved
pipe.score(X_test, y_test)
/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/model_selection/_search.py:813: DeprecationWarning:
The default of the `iid` parameter will change from True to False in
version 0.22 and will be removed in 0.24. This will change numeric
results when test-set sizes are unequal.
  DeprecationWarning)
0.8053618445370625

Final model

\( y = 0.26838108*x_1+0.00910082*x_2+0.99014931*x_3+7.89187482*x_4+156.9164 \)

I am confident that this model will do fairly well on new data with at least an 80% accuracy. I will save my model through a process called Serialization and will keep for later use. I can then take this model, build an application and deploy it into production. Before I can do that though, I would need to write some unit tests and logs that will alert me whenever my model begins to perform poorly, my threshold will be 75%. At that point I will need to train my model on more data. Once this is done I will probably dockerize this application and deploy it using AWS.

from sklearn.externals import joblib

# save model for later use
joblib.dump(lasso, 'lasso_grid_search.pkl')
/Users/pythagoras/anaconda3/lib/python3.7/site-
packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning:
sklearn.externals.joblib is deprecated in 0.21 and will be removed in
0.23. Please import this functionality directly from joblib, which can
be installed with: pip install joblib. If this warning is raised when
loading pickled models, you may need to re-serialize those models with
scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
['lasso_grid_search.pkl']

Conclusion

COVID-19 vaccination rates are still a problem in low-income areas that were hit the hardest by this pandemic. Areas that are affluent were barely affected by COVID-19 and tend to have some of the highest COVID-19 vaccination rates. There needs to be an increased effort to get more people in low-income neighborhoods to take the vaccine. There is a lot of mistrust between the medical community and low-income communities. Convincing people from low-income communities to take the vaccine starts with trust and transparency. This is one of the many steps we must take as New Yorkers to get back to normal.


References

  • https://data.cccnewyork.org/data/table/66/median-incomes#66/107/62/a/a
  • https://www1.nyc.gov/site/doh/covid/covid-19-data-vaccines.page#byzip
  • https://github.com/nychealth/covid-vaccine-data
  • https://data.census.gov/cedsci/table?q=educational%20attainment&t=Educational%20Attainment&g=0400000US36.860000&tid=ACSST5Y2019.S1501
  • https://github.com/nychealth/coronavirus-data