3  Results

3.1 Part 1: Study which segments of the population exhibited greatest COVID vaccine hesitancy following the deployment of mass vaccinations in the US in early 2021.

The national average COVID hesitance rate reported during the first quarter of the data collection period (April - July 2021) was ~18%. In other words, 18% of those surveyed after the mass deployment of vaccinations responded that they would probably or definitely not get vaccinated. In Part 1, we will break down this percentage across U.S. geographies, identifying the states with the highest and lowest hesitancy rates. We will then analyze the types of population demographics (e.g., age, vaccination history, ethnicity, etc.) that had a strong association with COVID vaccine hesitancy.

Our analysis of variations in vaccine hesitance begins by comparing hesitancy across states during the April-July 21’ timeframe. As illustrated in the map below, vaccine hesitance varied significantly by geography. States such as Hawaii, New York, and Vermont exhibited the lowest hesitance, while other states like Wyoming, North Dakota, and West Virginia exhibited the highest hesitance. It is worth noting that there are clusters of high hesitance, namely the South East and Northern Mountain West regions, whereas low hesitance was more prominent in the North East and South West regions.

Code
# Import libraries

# install.packages("choroplethr")
# install.packages("RColorBrewer")
# install.packages("GGally")
# install.packages("devtools")
# install.packages("remotes")
# install.packages("choroplethrMaps")
# devtools::install_github("timelyportfolio/parcoords")
# install.packages("ggpmisc")

library(ggpmisc)
library(tidyverse)
library(data.table)
library(vcd)
library(choroplethrMaps)
library(GGally)
library(parcoords)
library(RColorBrewer)
library(choroplethr)
library(lubridate)
library(stringr)

# Import starting dataset and filter for vaccine hesitance indicator and monthly samples


# Load vaccine hesitancy data
covid <- fread("../covid_dataset/covid.csv")

# Filter for Monthly sample period and vaccine hesitance responses
covid_hes <- covid |>
  filter(`Time Type` == "Monthly",
         `Indicator Category` == "Probably or definitely will not get vaccinated")

# Prep data for analysis. The following chunk truncates the names of population segments for data visualization purposes

# Create vector of shortened group names
group_name_shortening <- c("Age" = "Age",
                           "Age by race/ethnicity" = "Age.Race",
                           "All adults 18+" = "18+ Adults",
                           "Born in the U.S." = "US.Born",
                           "Concern about getting COVID-19 disease" = "Covid.Concern",
                           "Confidence in COVID-19 vaccine safety" = "Vaccine.Confiden",
                           "Confidence that COVID-19 vaccine is important" = "Vaccine.Imprtnt",
                           "COVID-19 vaccination status of friends and family" = "Fr.Fam.Vaccd",
                           "COVID-19 vaccine information seeking frequency in past month" = "Vacc.Info.Seek",
                           "Disability status (any)" = "Disability",
                           "Essential worker group" = "Ess.Wrkr.grp",
                           "Ever had COVID-19 disease (self-report)" = "Past.Covid",
                           "Gender identity" = "G.Identity",
                           "Have a regular physician or provider for primary care" = "Prim.Care.Providr.Status",
                           "Health condition associated with higher risk for COVID-19 (any)" = "Comorbidity",
                           "Healthcare provider recommended I get a COVID-19 booster" = "Dr.Recommend.Boostr",
                           "Healthcare provider recommended I get a COVID-19 vaccine" = "Dr.Recommend.Vacc",
                           "Insurance status" = "Insurance",
                           "Language of interview" = "Language",
                           "Mask-wearing frequency" = "Mask.Use",
                           "Mental health status" = "Mental.Health",
                           "Metropolitan statistical area" = "Rural.Urban",
                            "Political leaning of county of residence" = "Political",
                           "Poverty status" = "Pov.Status",
                           "Pregnancy status (females age 18 – 49 years)" = "Pregnancy",
                           "Race/Ethnicity (4 level)" = "Race.4lvl",
                           "Race/Ethnicity (7 level)" = "Race.7lvl",
                           "Received a flu vaccination" = "Flu.Vacc",
                           "Received non-COVID-19 vaccine(s) in past two years" = "Past.Vaccin",
                           "Self-perceived equity in seeking healthcare compared to other races/ethnicities" = "Racial.Equity",
                           "Sex" = "Sex",
                           "Sexual orientation" = "S.Orientation",
                           "Social Vulnerability Index (SVI) of county of residence" = "SVI",
                           "Speaks a language other than English at home" = "Non.Englsh.Language",
                           "Work or school requires COVID-19 vaccination" = "Work.School")

# Construct dataframe from vector
group_name_shortened <- data.frame(`Group Name` = names(group_name_shortening), shortened_group = group_name_shortening, stringsAsFactors = FALSE)|>
  dplyr::rename(`Group Name` = Group.Name)

# Create vector of shortened group category names
group_cat_shortening <- c("18 – 49 years" = "18 – 49",
                          "50 – 64 years" = "50 – 64",
                          "65+ years" = "65+",
                          "18 – 49 years, Black, non-Hispanic" = "18 – 49, Black",
                          "18 – 49 years, Hispanic" = "18 – 49, Hispanic",
                          "18 – 49 years, Other or multiple races, non-Hispanic" = "18 – 49, Other",
                          "18 – 49 years, White, non-Hispanic" = "18 – 49, White",
                          "50 – 64 years, Black, non-Hispanic" = "50 – 64, Black",
                          "50 – 64 years, Hispanic" = "50 – 64, Hispanic",
                          "50 – 64 years, Other or multiple races, non-Hispanic" = "50 – 64, Other",
                          "50 – 64 years, White, non-Hispanic" = "50 – 64, White",
                          "65+ years, Black, non-Hispanic" = "65+, Black",
                          "65+ years, Hispanic" = "65+, Hispanic",
                          "65+ years, Other or multiple races, non-Hispanic" = "65+, Other",
                          "65+ years, White, non-Hispanic" = "65+, White",
                          "Born in the U.S." = "US Born",
                          "Not born in the U.S." = "Not US Born",
                          "Many or almost all family or friends have received vaccine" = "Many",
                          "Some or no family or friends have been vaccinated" = "No_some","A little or not at all concerned" = "A little_not",
                          "Very or moderately concerned" = "Very_Moderately",
                          "Somewhat or not at all safe for me" = "Not at all_Somewhat",
                          "Very or completely safe for me" = "Very",
                          "A little or not at all important to protect me from COVID-19" = "A little_not at all",
                          "Somewhat or very important to protect me from COVID-19" = "Somewhat_very",
                          "No" = "No",
                          "Yes" = "Yes",
                          "Cisgender" = "Cisgender",
                          "Don't know/refused" = "Dont know",
                          "Transgender/Non-binary" = "TG/Non bin",
                          "Insured" = "Insured",
                          "Not insured" = "Not insured",
                          "English" = "English",
                          "Other" = "Other",
                          "Spanish" = "Spanish",
                          "Rural" = "Rural",
                          "Suburban" = "Suburban",
                          "Urban" = "Urban",
                          "Democrat-leaning" = "Democrat",
                          "Not Democrat-leaning or Republican-leaning" = "None",
                          "Republican-leaning" = "Republican",
                          "Above poverty, income <$75k" = "<$75k",
                          "Above poverty, income >=$75k" = ">=$75k",
                          "Below poverty" = "Below poverty",
                          "Unknown income" = "Unknown",
                          "Breastfeeding" = "Breastfeeding",
                          "Not pregnant, trying to get pregnant, or breastfeeding" = "Not pregnant",
                          "Pregnant" = "Pregnant",
                          "Trying to get pregnant" = "Trying to get pregnant",
                          "American Indian/Alaska Native, non-Hispanic" = "Native",
                          "Asian, non-Hispanic" = "Asian",
                          "Black, non-Hispanic" = "Black",
                          "Hispanic" = "Hispanic",
                          "Native Hawaiian/Pacific Islander, non-Hispanic" = "Pacific",
                          "Other or multiple races, non-Hispanic" = "Other",
                          "White, non-Hispanic" = "White",
                          "Female" = "Female",
                          "Male" = "Male",
                          "Gay/Lesbian/Bisexual/Other" = "LGB",
                          "Heterosexual/Straight" = "Hetero",
                          "High SVI" = "High",
                          "Low SVI" = "Low",
                          "Moderate SVI" = "Moderate",
                          "SVI unknown" = "Unknown",
                          "Not applicable/unemployed" = "Unemployed",
                          "All adults age 18+ years" = "18+ Adults",
                          "Essential healthcare" = "Ess.Healthcare",
                          "Other essential worker" = "Ess.Other",
                          "Other frontline worker" = "Frontline.Other",
                          "Persons who are not essential workers" = "Not.Ess.Wrkr",
                          "School and childcare" = "School.Childcare",
                          "Better than other races or ethnicities" = "Better",
                          "Same as other races or ethnicities" = "Equal",
                          "Worse than other races or ethnicities" = "Worse",
                          "Did not receive flu vaccination" = "No",
                          "Received a flu vaccination" = "Yes",
                          "Political leaning unknown" = "Unknown",
                          "Often or sometimes" = "Often.Sometimes",
                          "Rarely or never" = "Rarely.Never",
                          "Always or often wears mask" = "Always.Often",
                          "Sometimes, rarely, or never wears mask" = "Occasional.Never",
                          "Excellent, very good, or good" = "Excellent.Good",
                          "Fair or poor" = "Fair.Poor")

# Construct dataframe from vector
group_cat_shortened <- data.frame(`Group Category` = names(group_cat_shortening), shortened_cat = group_cat_shortening, stringsAsFactors = FALSE) |>
  dplyr::rename(`Group Category` = Group.Category)

# Join shortened group names and categories to master table
covid_hes_prepped <- left_join(left_join(covid_hes, group_name_shortened,
                                         by = "Group Name"),
                                         group_cat_shortened, by = "Group Category")


# Create combinations of group names and categories via paste function
# Include string wrapped version of text for visualization purposes
covid_hes_prepped <- covid_hes_prepped |>
  mutate(concat = paste(shortened_group, shortened_cat, sep = " - "),
         concat_wrap = str_wrap(concat, width = 10))


# Store values of months that will be combined to analyze hesitancy when vaccines were initially released in 2021
periods_to_combine_2021 <- c("April 22 - May 29","May 30 - June 26","June 27 - July 31")

# Calculate overall nationwide hesitancy
national_hesitance_rate <- covid_hes_prepped |>
  filter(`Year` == 2021,
         `Time Period` %in% periods_to_combine_2021,
         `Group Category`== "All adults age 18+ years",
         Geography == "National") |>
  mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`,0)) |>
  group_by(Geography) |>
  summarise(`Sample Size` = sum(`Sample Size`), n_hesitant = sum(n_hesitant)) |>
  mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
         `Time Period` = "2021 analysis period",
         Year = 2021) 


# Calculate early 2021 hesitance by state
covid_2021_analysis_allstates <- covid_hes_prepped |>
  filter(`Year` == 2021,
         `Time Period` %in% periods_to_combine_2021,
         `Group Category`== "All adults age 18+ years",
         `Geography Type` %in% c("Jurisdictional Estimates"),
         !grepl("-", Geography)) |>
  mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`,0)) |>
  group_by(Geography) |>
  summarise(`Sample Size` = sum(`Sample Size`),
            n_hesitant = sum(n_hesitant)) |>
  mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
         `Time Period` = "2021 analysis period",
         Year = 2021)

# Prep dataframe to be used in the state_choropleth function
covid_2021_analysis_allstates_map <- covid_2021_analysis_allstates |>
  select(Geography, `Estimate (%)`) |>
  transmute(region = tolower(Geography), value = `Estimate (%)`)


# Display state heatmap of hesitance rates
state_choropleth(covid_2021_analysis_allstates_map,
                 title = "   April-July 2021 COVID-19 Vaccine Hesitance Rates by State",
                 legend = "Hesitancy Rates (%) ")

While hesitance rates by state appear to be normally distributed, with no clear outliers per the histogram and boxplot below, the aforementioned variation in hesitance between states like Wyoming and New York can still be used as a starting point for where to prioritize vaccination campaign efforts in the event of a future COVID-19 outbreak.

Code
# Plot histogram of state hesitance rates
ggplot(covid_2021_analysis_allstates, aes(x = `Estimate (%)`)) +
  geom_histogram(bins = 8, fill = "blue", color = "black", alpha = 0.7) +
  geom_boxplot() +
  labs(title = "Distribution of Statewide Hesitance Rates", x = "Hesitance %", y = "Number of states") +
  theme_bw()

Code
# Search for outliers
n_outliers <- covid_2021_analysis_allstates |>
  mutate(lower_bound = quantile(`Estimate (%)`, 0.25) - 1.5*IQR(`Estimate (%)`),
         upper_bound = quantile(`Estimate (%)`, 0.75) + 1.5*IQR(`Estimate (%)`)) |>
  filter(`Estimate (%)` < lower_bound & `Estimate (%)` > upper_bound)

This finding brings us to our next layer of analysis: do population segments within each state have similar vaccine hesitance rates? Before diving into each state further, we first identified the population segments with relatively higher nationwide vaccine hesitance rates from April-July 2021. Across every state, it can be seen that the 10 segments displayed below demonstrated the highest hesitance.

Code
# Create vector of all group names as represented in the data
group_name_toloop <- c("Sex","Age","Race/Ethnicity (7 level)","Age by race/ethnicity","Sexual orientation","Gender identity","Metropolitan statistical area","Born in the U.S.", "Language of interview","Poverty status", "Insurance status","Social Vulnerability Index (SVI) of county of residence", "Political leaning of county of residence", "Received non-COVID-19 vaccine(s) in past two years", "Health condition associated with higher risk for COVID-19 (any)","Disability status (any)", "Pregnancy status (females age 18 – 49 years)", "Ever had COVID-19 disease (self-report)","Concern about getting COVID-19 disease", "Confidence in COVID-19 vaccine safety",  "Confidence that COVID-19 vaccine is important", "Healthcare provider recommended I get a COVID-19 vaccine",  "COVID-19 vaccination status of friends and family","Work or school requires COVID-19 vaccination" )


group_name_toloop_short <- lapply(group_name_toloop, function(x) ifelse(x %in% names(group_name_shortening), group_name_shortening[[x]], x))
  

# Calculate vaccine hesitance by population segment
covid_2021_analysis <- covid |>
  filter(`Year` == 2021, `Time Period` %in% periods_to_combine_2021,
         `Group Name` %in% group_name_toloop,
         `Geography Type` %in% c("Jurisdictional Estimates",
                                 "National Estimates"),
         `Indicator Category` == "Probably or definitely will not get vaccinated",
         !grepl("-", Geography)) |>
  mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`,0)) |>
  group_by(pick(c(1:6))) |>
  summarise(`Sample Size` = sum(`Sample Size`),
            n_hesitant = sum(n_hesitant)) |>
  filter(Geography == "National") |>
  mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
         `Time Period` = "2021 analysis period",Year = 2021)


# Join shortened group names and categories to master table
covid_2021_analysis <- left_join(left_join(covid_2021_analysis, group_name_shortened,
                                         by = "Group Name"),
                                         group_cat_shortened, by = "Group Category")


# Create combinations of group names and categories via paste function
# Include string wrapped version of text for visualization purposes
covid_2021_analysis <- covid_2021_analysis |>
  mutate(concat = paste(shortened_group, shortened_cat, sep = " - "),
         concat_wrap = str_wrap(concat, width = 10),
         `Group Name` = shortened_group,
         `Group Category` = shortened_cat) |>
  select(-c(shortened_group, shortened_cat))


# Visualize the 10 categories with the highest hesitance rates
covid_2021_analysis |> ungroup() |> top_n(10, `Estimate (%)`) |>
  ggplot(aes(x = fct_reorder(concat_wrap, -(`Estimate (%)`)),
             y = `Estimate (%)`)) +
  geom_bar(stat = "identity", fill = "darkgreen", color = "black") +
  labs(title = "Top 10 categories with the highest hesitance in 2021",
       x = "Group categories", y = "2021 hesitance (%)") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1,size=6),
        axis.text.y = element_text(size = 8),
        axis.title = element_text(size = 8),
        plot.title = element_text(size = 12))

Code
# Store the categories with the highest hesitance rates in a single-column dataframe
top_10_highest_hesitance_cat_list <- covid_2021_analysis |>
  ungroup() |>
  top_n(10, `Estimate (%)`) |> select(concat)

Each segment represents a group of individuals who participated in the survey and who aligns to a certain type of demographic, such as an ethnicity or a group individuals who share a certain opinion about vaccines. For instance, “Vaccine.Imprtnt - A little_not at all” represents individuals who do not believe vaccines are essential to healthcare, whereas “Fr.Fam.Vaccd - No_some” corresponds to individuals who responded “No or some” to whether their friends and/or family are vaccinated. A noteworthy observation is that political views may have influenced vaccine hesitancy, evidenced by a high hesitance rate among Republican survey participants.

To investigate how population-specific hesitance varies by state, we focused on the 10 most and 10 least hesitant states and generated two corresponding parallel coordinate plots, where elements on the x-axis represent some of the most hesitant population segments.

Code
# Calculate hesitance over three months in 2021 by state and population segment
# Clean state column
state_group_analysis <- covid |>
  filter(`Year` == 2021,
         `Time Period` %in% periods_to_combine_2021,
         `Geography Type` == "Jurisdictional Estimates",
         `Indicator Category` == "Probably or definitely will not get vaccinated",
         !grepl("-", Geography)) |>
  mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`, 0)) |>
  group_by(Geography, `Group Name`, `Group Category`) |>
  summarise(`Sample Size` = sum(`Sample Size`),
            n_hesitant = sum(n_hesitant)) |>
  mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
         `Time Period` = "2021 analysis period",
         Year = 2021) |>
  ungroup()


# Shorten group names
state_group_analysis <- state_group_analysis |>
  mutate_at(vars(`Group Name`), ~recode(., !!!group_name_shortening))

state_group_analysis <- state_group_analysis |>
  mutate_at(vars(`Group Category`), ~recode(., !!!group_cat_shortening))

state_group_analysis <- state_group_analysis |> mutate(concat = paste(`Group Name`, `Group Category`, sep = " - "))

state_group_analysis$concat_wrap <- str_wrap(state_group_analysis$concat, width = 10)


# Filter groups for top most hesitant
# Pivot wider to convert row-level categories to columns containing hesitancy rates
state_group_analysis_pivot <- state_group_analysis |>
  filter(concat %in% top_10_highest_hesitance_cat_list$concat) |>
  select(Geography, `Estimate (%)`, concat_wrap) |>
  pivot_wider(id_cols = Geography, names_from = "concat_wrap",
              values_from = "Estimate (%)")

# Pregnancy - pregnant is not available at the state level during the filtered timeframe
state_not_in_top_10 <- top_10_highest_hesitance_cat_list |>
  filter(!concat %in% state_group_analysis$concat)

pregnancy_check <- covid |>
  filter(`Group Category` == "Pregnant") |>
  distinct(`Geography Type`, Geography, `Group Name`, `Group Category`,
           `Time Period`, Year)

# Identify top most hesitant states
top_10_hesistant_states <- covid_2021_analysis_allstates |>
  arrange(desc(`Estimate (%)`)) |>
  filter(!Geography %in% c("Guam","Puerto Rico")) |>
  slice(1:10)
  
# Identify top least hesitant states
bottom_10_hesistant_states <- covid_2021_analysis_allstates |>
  arrange(`Estimate (%)`)|>
  filter(!Geography %in% c("Guam","Puerto Rico")) |>
  slice(1:10)

# Filter pivoted data on most hesitant states
top_10_parallel_analaysis <- state_group_analysis_pivot |>
  filter(Geography %in% top_10_hesistant_states$Geography)


# Display parallel coordinates plot for most hesitant states
ggparcoord(top_10_parallel_analaysis |> select(where(~!any(is.na(.)))), columns = c(2:8),
           scale = "globalminmax",
           alpha = 0.4) +
  labs(title = "2021 Hesitance Rates by Population Segment",
       subtitle = "10 Highest Hesitance States",
       x = "Population Segments", y = "2021 Hesitance (%)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
# Filter pivoted data on least hesitant states
bottom_10_parallel_analaysis <- state_group_analysis_pivot |>
  filter(Geography %in% bottom_10_hesistant_states$Geography)


# Display parallel coordinates plot for least hesitant states
ggparcoord(bottom_10_parallel_analaysis |> select(where(~!any(is.na(.)))), 
           columns = c(2:7),   
           scale = "globalminmax",
           alpha = 0.4) +
    labs(title = "2021 Hesitance Rates by Population Segment",
    subtitle = "10 Lowest Hesitance States",
    x = "Population Segments", y = "2021 Hesitance (%)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In both cases, the hesitance rates within states were not uniform across all population segments. In particular, the group of individuals who responded “A little or not at all” to whether vaccination is important to them showed remarkably higher hesitance rates at the state level, which held true for every state.

This observation suggests that vaccine hesitance may not differ solely based on geography and ties in closely with the key question we posed in our research plan: are certain types of demographics more strongly associated with COVID-19 vaccine hesitance than others? To address this question, we proceeded to conduct Chi-Squared tests at the national level for every demographic specified in the Data section of this report. Based on the statistical significance of each test, we then ranked the demographics to identify those that had a stronger association with COVID-19 vaccine hesitance.

Code
# Define function to run Chi Square test for a given demographic
get_chi_sq <- function(df, geography, time_period, year, group_name, hesitant_group) {
  
  temp <- df |> filter(Geography == geography,
                       `Indicator Name`== "Vaccination uptake and intention",
                       `Time Period` == time_period,
                       Year == year,
                       `Group Name`== group_name) |>
    mutate(hesitant = ifelse(`Indicator Category` %in% hesitant_group,1,0)) |>
    group_by(`Group Category`,hesitant) |>
    summarise(hesitance_rate = sum(`Estimate (%)`),
              sample_size = first(`Sample Size`)) |>
    mutate(n_hesitance = round(hesitance_rate/100 * sample_size,0)) |>
    filter(hesitant == 1) |>
    select(`Group Category`,n_hesitance,sample_size)
  
  if(nrow(temp) > 0) {
    expected_proportion <- sum(temp$n_hesitance)/sum(temp$sample_size)

  temp <- temp |>
    mutate(expected = sample_size*expected_proportion)
  
  res = chisq.test(x = temp$n_hesitance, p = temp$expected/sum(temp$expected))
  output <- list(round(res$statistic,digits = 0), res$p.value)
    
  }
  else {
    output <- list(0,0)
    
  }
  
  
   return(output)
}

group_name_toloop <- c("Sex", "Age", "Race/Ethnicity (7 level)",
                       "Age by race/ethnicity", "Sexual orientation",
                       "Gender identity", "Metropolitan statistical area",
                       "Born in the U.S.", "Language of interview",
                       "Poverty status", "Insurance status",
                       "Social Vulnerability Index (SVI) of county of residence",
                       "Political leaning of county of residence",
                       "Received non-COVID-19 vaccine(s) in past two years",
                       "Health condition associated with higher risk for COVID-19 (any)",
                       "Disability status (any)",
                       "Pregnancy status (females age 18 – 49 years)",
                       "Ever had COVID-19 disease (self-report)",
                       "Concern about getting COVID-19 disease",
                       "Confidence in COVID-19 vaccine safety",
                       "Confidence that COVID-19 vaccine is important",
                       "Healthcare provider recommended I get a COVID-19 vaccine",
                       "COVID-19 vaccination status of friends and family",
                       "Work or school requires COVID-19 vaccination" )

group_name_toloop_short <- lapply(group_name_toloop, function(x) ifelse(x %in% names(group_name_shortening), group_name_shortening[[x]], x))
  

result_df2 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(result_df2) <- c("group_name", "X2_test_stat", "p_value")

for (x in group_name_toloop_short) {
  res <- get_chi_sq(covid_2021_analysis,geography = "National",
                    time_period = "2021 analysis period",
                    year = 2021,
                    group_name = x,
                    hesitant_group = c("Probably or definitely will not get vaccinated"))
  result_df2[nrow(result_df2) + 1,] = c(x,res[1],res[2])

}

result_df2 <- arrange(result_df2, desc(X2_test_stat))

result_df2 |>
  ungroup() |>
  top_n(10,X2_test_stat) |>
  dplyr::rename(`Demographic Type` = group_name)
    Demographic Type X2_test_stat p_value
1    Vaccine.Imprtnt        78443       0
2   Vaccine.Confiden        45851       0
3       Fr.Fam.Vaccd        33687       0
4        Past.Vaccin        12886       0
5      Covid.Concern         7882       0
6           Age.Race         6893       0
7                Age         5585       0
8          Political         4616       0
9        Work.School         3969       0
10 Dr.Recommend.Vacc         3823       0

The top 10 most statistically significant types of demographics are displayed above. These results provide key insights into understanding COVID-19 vaccine hesitance, such as:

  • Whether an individual has been vaccinated in the past may determine their hesitance toward the COVID-19 vaccine (see Chi-Squared for “Past.Vaccin”).

  • One’s age and race may also influence hesitance (see “Age.Race” above). We previously saw that white individuals between the ages of 18-49 were more hesitant than others.

  • Moreover, whether an individual associates themselves with the Republican party could also impact willingness to receive the COVID-19 vaccine (see Chi-Squared “Political”).

Two illustrative mosaic plots are shown below to confirm the Chi-Square test results above. The first plot helps visualize the relationship between COVID-19 vaccine hesitance vs the self-perceived importance of vaccines in general (the group with the highest Chi-Square score). The second plot visualizes the association between hesitance and gender identity (the group with the lowest Chi-Square score).

Code
mosaic_2var <- function(df,indep_var_group_name,n_chars) {
  truncate_string <- function(x) {
  str_sub(x, 1, n_chars)}

indep_var_group_name_no_spaces <- gsub("[()\\[/\\\\ ]", ".", indep_var_group_name)
indep_var_group_name_no_spaces <- truncate_string(indep_var_group_name_no_spaces)

temp <- df |>
  ungroup() |>
  filter(`Group Name` == indep_var_group_name) |>
  mutate(n_not_hesitant = `Sample Size` - n_hesitant) |>
  select(`Group Category`, n_hesitant, n_not_hesitant,`Sample Size`) |>
  rename(!!indep_var_group_name_no_spaces := `Group Category`,
         hesitant = `n_hesitant`,
         `not hesitant` = n_not_hesitant) |> 
  pivot_longer(cols = hesitant:`not hesitant`,names_to = "group", values_to = "Freq")


temp[[indep_var_group_name_no_spaces]] =  substr(temp[[indep_var_group_name_no_spaces]], 1, n_chars)


formula <- as.formula(paste("group", "~", indep_var_group_name_no_spaces))



return(mosaic(formula, direction = c("v", "h"), temp,
       highlighting_fill = c("grey80", "cornflowerblue"),
       gp_varnames = gpar(fontsize = 10, fontface = 1),
       gp_labels = gpar(fontsize = 6)))

}

mosaic_2var(covid_2021_analysis,"Vaccine.Imprtnt", 100)

Code
mosaic_2var(covid_2021_analysis,"G.Identity", 100)

It can be seen in the first plot that those who consider vaccines as important had significantly lower hesitance than those who did not. On the other hand, gender identity yielded a plot that exactly matches a mosaic plot of expected values, thus confirming its statistically insignificant Chi-Square test result.

While mosaic plots can be analyzed for every type of demographic, having the list of most statistically significant groups is beneficial on its own for furthering our exploration of vaccine hesitance. For instance, the significant results we obtained may spark an interest in researching how race affected hesitance in 2021. We can also begin to research the influence of being U.S. born or living in a rural community on hesitance, as both these categories of demographics had a statistically significant Chi-Squared.

Now that we have a better understanding of the relationships between geography, demographics, and 2021 COVID-19 vaccine hesitance, we will move onto the second part of our research, which looks at how vaccine hesitance has changed over time.

3.2 Part 2: Analysis of segments that showed the greatest reduction in COVID vaccine hesitancy between 2021 and 2023

Across the nation, vaccine hesitance declined from around 18% in April/May 2021 to 11% in May/June 2023, as shown in the previously-seen graph below.

Code
# Filter for the vaccination uptake and intent indicator
# Filter for responses corresponding to vaccine hesitancy
# Include all adults 18+
# Set time period to monthly (excluding weekly samples) and filter for nationwide samples
vacc_intention_time <- covid |>
  filter(`Indicator Name` == 'Vaccination uptake and intention') |>
  filter(`Indicator Category` == 'Probably or definitely will not get vaccinated') |>
  filter(`Group Name` == 'All adults 18+') |>
  filter(`Time Type` == 'Monthly') |>
  filter(Geography == 'National') |>
  arrange(`Geography Type`, Geography, `Time Period`)

# Derive month of data collection
vacc_intention_time <- vacc_intention_time |>
  mutate(month_split = paste0(str_split(`Time Period`, "-",
                                        simplify = TRUE)[, 2],
         ", ", Year),
         month = mdy(month_split))

# Plot hesitancy over time
ggplot(vacc_intention_time, aes(x = month, y = `Estimate (%)`,
                                color = `Indicator Category`)) +
  geom_line() +
  labs(title = "U.S. Vaccination Response Rates Over Time",
       x = "Month",
       y = "Response Rate (%)") +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(title = "Indicator Category:"))

However, some segments of the population demonstrated a greater decline in hesitance than others. To better understand population-specific changes in hesitance, we developed the following cleveland dot plot, which breaks down vaccine hesitance for 20 population segments that had the highest initial hesitance 2021.

Code
annual_monthly_snapshots = c("April 22 - May 29","May 1 - May 28",
                             "April 30 - May 27")

snapshots <- covid |>
  filter(`Time Period` %in% annual_monthly_snapshots,
         `Group Name` %in% group_name_toloop,
         Geography == "National",
         `Indicator Category` == "Probably or definitely will not get vaccinated" ) |>
  mutate(Year = as.factor(Year)) |>
  mutate_at(vars(`Group Name`), ~recode(., !!!group_name_shortening)) |>
  mutate_at(vars(`Group Category`), ~recode(., !!!group_cat_shortening)) |>
  mutate(rows = paste(`Group Name`, `Group Category`, sep = " - ")) |>
  select(rows,Year,`Estimate (%)`) |>
  arrange(rows, Year) 

row_order <- snapshots |>
  pivot_wider(names_from = Year, values_from = `Estimate (%)`) |>
  drop_na(`2021`) |>
  drop_na(`2023`) |>
  mutate(decline_21_23 = `2021` - `2023`) |>
  arrange((decline_21_23))

row_order_list_chg_21_23 <- c(row_order$rows)

row_order_list_2021 <- row_order |>
  arrange(`2021`) |>
  top_n(20,`2021`) |>
  select(rows)

row_order_list_2021 <- row_order_list_2021$rows               

row_order_list_2023 <- c(row_order |>
                           arrange((`2023`)) |>
                           select(rows))

snapshots <- snapshots[snapshots$rows %in% row_order_list_chg_21_23, ]


snapshots |> filter(rows %in% row_order_list_2021) |>
  ggplot(aes(x = `Estimate (%)`, y = fct_relevel(rows,(row_order_list_2021)),
             color = Year)) +
  geom_point(size = 1) +
  labs(title = "20 Most Hesitant Segments in 2021, and Changes in 2022-23",
       x = "Vaccine hesitance (%)", y = "Population Segments") +
  theme(legend.position = "bottom",axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 6), plot.title = element_text(size = 10))

It can be seen from the cleveland dot plot that the change in hesitance of the top 20 most hesitant segments in 2021 varies by population segment. To visualize this change better, we also created the bar graph below, which shows the 10 segments that demonstrated the largest decreases in hesitance between 2021 and 2023.

Code
# Bar chart of changes

row_order$rows <- str_wrap(row_order$rows, width = 10)

row_order |>
  top_n(10, decline_21_23) |>
  ggplot(aes(x = fct_reorder(rows,rev(decline_21_23)), y = decline_21_23)) +
  geom_bar(stat = "identity", fill = "navy", color = "black") +
  labs(title = "Top 10 Segments with the Largest 2021-23 Decreases",
       x = "Population Segments", y = "2021-23 % Decrease in Hesitance") +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 0, vjust = 1, hjust=0.5,size=6),
        axis.text.y = element_text(size=8),
        axis.title = element_text(size = 8),
        plot.title = element_text(size = 8))

Code
# Uncomment code below to view dataframe of combinations of state and population segments that had large increases or decreases in hesitance

# covid_hes_prepped <- covid_hes_prepped |>
#   mutate(month_split = paste0(str_split(`Time Period`, "-",
#                                         simplify = TRUE)[, 2],
#          ", ", Year),
#          month = mdy(month_split))
# 
# 
# state_group <- covid_hes_prepped |>
#   filter(`Geography Type` == "Jurisdictional Estimates",
#          !grepl("-", Geography)) |>
#   mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`, 0)) |>
#   group_by(Geography, `Group Name`, `Group Category`, `Time Period`, Year,
#            concat, concat_wrap, month) |>
#   summarise(`Sample Size` = sum(`Sample Size`),
#             n_hesitant = sum(n_hesitant)) |>
#   mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100) |>
#   ungroup() |>
#   group_by(Geography, `Group Name`, `Group Category`, concat) |>
#   mutate(min_month = min(month),
#          min_year = year(min_month),
#          max_month = max(month),
#          max_year = year(max_month)) |>
#   filter(month == min_month | month == max_month) |>
#   filter(!min_year == max_year) |>
#   ungroup()
# 
# 
# state_group_pivot <- state_group |>
#   pivot_wider(id_cols = c(Geography, concat), names_from = "Year", values_from = `Estimate (%)`)
# 
# state_group_pivot <- state_group_pivot |>
#   filter(!(`2023` == "NULL" | `2021` == "NULL")) |>
#   mutate(delta = as.numeric(`2023`) - as.numeric(`2021`))

The bar chart shows that individuals who did not perceive vaccines as essential had the sharpest decline in hesitance. Other groups who also had large decreases are those who had low confidence in vaccine safety/efficacy and those who contracted COVID.

While more research is required to understand why some of these segments had more substantial decreases in hesitance, we can analyze whether initial hesitance in 2021 is an associated factor. The scatterplot below shows the relationship between hesitance rates in 2021 and the change in hesitance from 2021-23 across population segments.

Code
row_order %>% ggplot(aes(x = `2021`, y = decline_21_23)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  labs(title = "Scatterplot of 2021 hesitance vs. 2021-23 decline by population segment",
       x = "2021 Hesitance (%)",
       y = "Decline in hesitance (2021-23) (%)")+
  stat_poly_eq(formula = y ~ x,   # Add R-squared value
               aes(label = paste( ..rr.label.., sep = "~~~~")),
               parse = TRUE,
               size = 4)

Segments with the highest initial hesitance generally showed the greatest reduction in hesitance, with a high R2 of 0.75. This could be because those with the highest hesitance received targeted public healthcare interventions, or simply because they had the greatest scope for reduction in the first place. Determining the exact reason requires further investigation that is beyond the scope of this project.

Similar to Part 1 of our analysis, we are also interested in understanding changes in hesitance by geography. To answer this question, we constructed the following heatmap of changes in hesitance by state.

Code
comparison_baseline_2021 <- covid |>
  filter(`Year` == 2021, `Time Type` == "Monthly",
         `Time Period` == "April 22 - May 29",
         `Group Category`== "All adults age 18+ years",
         `Geography Type` %in% c("Jurisdictional Estimates"),
         `Indicator Category` == "Probably or definitely will not get vaccinated",
         !grepl("-", Geography)) |>
  mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`, 0)) |>
  group_by(Geography) |>
  summarise(`Sample Size` = sum(`Sample Size`),
            n_hesitant = sum(n_hesitant)) |>
  mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
         `Time Period` = "2021 most recent snapshot",
         Year = 2021)

covid_2023_analysis_allstates <- covid |>
  filter(`Year` == 2023, `Time Type` == "Monthly",
         `Time Period` == "May 28 - June 30",
         `Group Category`== "All adults age 18+ years",
         `Geography Type` %in% c("Jurisdictional Estimates"),
         `Indicator Category` == "Probably or definitely will not get vaccinated",
         !grepl("-", Geography)) |>
  mutate(n_hesitant = round(`Estimate (%)`/100 * `Sample Size`, 0)) |>
  group_by(Geography) |>
  summarise(`Sample Size` = sum(`Sample Size`),
            n_hesitant = sum(n_hesitant)) |>
  mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
         `Time Period` = "2023 most recent snapshot",
         Year = 2023)


state_delta <- left_join(covid_2023_analysis_allstates,
                         comparison_baseline_2021 |>
                           select(Geography, estimate_2021 = `Estimate (%)`),
                         by = "Geography")

state_delta <- state_delta |>
  mutate(`Change in Hesitancy Rates` = `Estimate (%)` - estimate_2021)

state_delta_map <- state_delta |>
  select(Geography, `Change in Hesitancy Rates`) |>
  transmute(region = tolower(Geography),
            value = as.numeric(`Change in Hesitancy Rates`))


# red_green_palette <- brewer.pal(n = 7, name = "RdYlGn")

state_choropleth(state_delta_map,
                 title = "   2021-2023 Changes in Vaccine Hesitance by State",
                 legend = "Change in \nHesitancy Rates   ",
                 num_colors = 0)

The heatmap indicates that West Virginia exhibited the greatest overall change in hesitance (-23.3%). Other geographies that had higher hesitance rates in 2021, such as the South East and Northern Mountain West, also had more notable decreases in hesitance. Conversely, some states had almost no change in hesitance. New Mexico, in particular, experienced a slight increase in hesitance of 0.3%.

The scatterplot below shows the relationship between initial hesitance in 2021 and the change in hesitance between 2021-23 by state.

Code
state_delta %>% mutate(`Decline in Hesitancy Rates` = -1*`Change in Hesitancy Rates`) %>% ggplot(aes(x = estimate_2021, y = `Decline in Hesitancy Rates`)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") + 
  labs(title = "Scatterplot of 2021 Hesitance vs. 2021-23 Decline by State",
       x = "2021 Hesitance (%)",
       y = "2021-23 Decline in Hesitance (%)")+
  stat_poly_eq(formula = y ~ x,   # Add R-squared value
               aes(label = paste( ..rr.label.., sep = "~~~~")),
               parse = TRUE,
               size = 4)

As seen in the heatmap, states with the highest initial hesitance generally showed the greatest decline in hesitance; although, the R-Squared statistic is low (0.15). This relationship is significantly weaker than a similar relationship seen for population segments where the R-Squared was 0.75.

To view the latest distribution of vaccine hesitance by state, we produced one additional heatmap of statewide hesitance rates as of June 30th, 2023.

Code
# Prep dataframe to be used in the state_choropleth function
covid_2023_analysis_allstates_map <- covid_2023_analysis_allstates |>
  select(Geography, `Estimate (%)`) |>
  transmute(region = tolower(Geography), value = `Estimate (%)`)


# Display state heatmap of hesitance rates
state_choropleth(covid_2023_analysis_allstates_map,
                 title = "   June 2023 COVID-19 Vaccine Hesitance Rates by State",
                 legend = "Hesitancy Rates (%) ")

This map shows vaccine hesitance is still prevalent in some states, namely in southern states, as well as Wyoming, Wisconsin, Indiana, and Ohio.

In the following section, we summarize our takeaways from Part 1 as well as the time series analysis from Part 2.

3.3 Part 3: High-level recommendations for segments of the population that need further public health interventions.

Our analyses highlighted several key findings:

  1. In 2021, COVID-19 vaccine hesitance was present in all states and many segments of the population. However, some regions and groups were more susceptible to hesitance, as seen by the disparities in hesitance rates between states and population segments.

  2. Some differences between population demographics may have been drivers of vaccine hesitance in 2021. For example, population characteristics such as one’s perception of the importance, safety, and efficacy of vaccines or having unvaccinated friends and family appear to be linked with hesitancy. We recommend further analysis into how someone’s willingness to receive the COVID vaccine may be impacted by the following characteristics:

  • General perception of vaccination importance
  • Past history of vaccination
  • Race/ethnicity
  • Political views
  • Living in a rural vs urban community
  • Being U.S. Born or not
  1. Three years after the beginning of the pandemic, vaccine hesitance persists in most states and some segments of the populations. However, hesitance can be reduced. There was a history of successful reductions in hesitance between 2021-23 nationally (-7%), state-wise (up to -23%), and by population segment (up to -32%). This was especially visible in states like West Virginia, and segments such as those who had reservations about the importance or the safety and efficacy of vaccines. This indicates potential to further reduce hesitance in these segments; although, further study into the causes of these declines in hesitance is required. As of June 2023, the geographies and population segments listed below exhibit the largest hesitance. In the event that future COVID outbreaks were to occur, we recommend prioritizing outreach to these groups to reduce hesitance and mitigate further spread of the disease.
  • Residents of states with high hesitance rates: Texas, Oklahoma, Arkansas, Wyoming, Wisconsin, Indiana, and Ohio
  • Individuals who have not been previously vaccinated
  • Individuals who are not insured
  • Women who are pregnant and have not been vaccinated
  • Individuals who are white and between the age of 18 and 49
  • Individuals living in rural communities

Additional trends in vaccination hesitance over time may exist. In the following section of this report, we provide a tool that visualizes month-by-month hesitance by state and population segment.

Code
# # Uncomment the code below to create inputs files to interactive D3 graphic
# 
# 
# hesitance_state_group_cat <- covid_hes_prepped |>
#   filter(`Geography Type` %in% c("Jurisdictional Estimates"),
#          !grepl("-", Geography)) |>
#   mutate(month_split = paste0(str_split(`Time Period`, "-",
#                                         simplify = TRUE)[, 2], ", ", Year),
#          month = mdy(month_split),
#          n_hesitant = round(`Estimate (%)`/100 * `Sample Size`, 0)) |>
#   group_by(Geography, `Group Name`, `Group Category`, concat, month) |>
#   summarise(`Sample Size` = sum(`Sample Size`),
#             n_hesitant = sum(n_hesitant)) |>
#   mutate(`Estimate (%)` = n_hesitant / `Sample Size` * 100,
#          concat_no_state = concat,
#          concat = paste0(Geography, ": ", concat)) |>
#   ungroup()
# 
# 
# d3_input_prep <- hesitance_state_group_cat |>
#   pivot_wider(id_cols = month, names_from = "concat", values_from = `Estimate (%)`) |>
#   dplyr::rename(Category = month)
# 
# d3_input_prep[is.na(d3_input_prep)] <- 0
# 
# 
# write_csv(d3_input_prep, "D3 Inputs/All State-Category Values.csv")
# 
# 
# dropdown_selections <- hesitance_state_group_cat |>
#   distinct(Geography, concat_no_state)
# 
# write_csv(dropdown_selections, "D3 Inputs/Dropdown Choices.csv")