Version 1.0 | First Created March 10, 2025 | Updated May 16, 2025
Chakraborty (2021) investigates the relationships between COVID-19 incidence rates and several demographic characteristics of people with disabilities by county in the continental United States. The aim of the study is to investigate whether people with disabilities (PwDs) face disproportionate challenges due to COVID-19.
Continued, from “Reproduction of Chakraborty 2021: An intracategorical analysis of COVID-19 and people with disabilities” (Holler et al.): To do so, Chakraborty examines the statistical relationship between county incidence rates of COVID-19 cases and county-level percentages of people with disabilities and different socio-demographic characteristics. Specifically, Chakraborty tests county-level bivariate correlations between COVID-19 incidence against the percentage of disability as one hypothesis, and tests correlation between COVID-19 incidence and percentage of people with disabilities in 18 different socio-demographic categories of race, ethnicity, poverty status, age, and biological sex. Chakraborty then re-tests for the same county-level associations while controlling for spatial dependence. Spatial dependence is controlled by constructing generalized estimating equation (GEE) models using a combination of state and spatial clusters of COVID-19 incidence as to define the GEE clusters. One GEE model is constructed for each of the four types of socio-demographic category: race, ethnicity, age, and biological sex. Chakraborty (2021) finds significant positive relationships between COVID-19 rates and socially vulnerable demographic categories of race, ethnicity, poverty status, age, and biological sex.
This study is a reproduction, with the goal of examining Chakraborty’s study design and its impact particularly in public policy as well as in fields such as research and education. This reproduction will attempt to reproduce the original study’s results.
This will include the map of county level distribution of COVID-19 incidence rates (Fig. 1), the summary statistics for disability and socio-demographic variables and bivariate correlations with county-level COVID-19 incidence rate (Table 1), and the GEE models for predicting COVID-19 county-level incidence rate (Table 2). A successful reproduction should be able to generate identical results as published by Chakraborty (2021).
Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. https://doi.org/10.1016/j.dhjo.2020.101007
The American Community Survey (ACS) five-year estimate (2014-2018) variables used in the study are outlined in the table below. Details on ACS data collection can be found at https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html and details on sampling methods and accuracy can be found at https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html.
Variable Name in Study | ACS Variable name |
---|---|
percent of total civilian non-institutionalized population with a disability | S1810_C03_001E |
Race | |
percent w disability: White alone | S1810_C03_004E |
percent w disability: Black alone | S1810_C03_005E |
percent w disability: Native American | S1810_C03_006E |
percent w disability: Asian alone | S1810_C03_007E |
percent w disability: Other race | S1810_C03_009E |
Ethnicity | |
percent w disability: Non-Hispanic White | S1810_C03_0011E |
percent w disability: Hispanic | S1810_C03_012E |
percent w disability: Non-Hispanic non-White | (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100 |
percent w disability: Other race | S1810_C03_009E |
Poverty | |
percent w disability: Below poverty level | (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100 |
percent w disability: Above poverty level | (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100 |
Age | |
percent w disability: 5-17 | S1810_C03_014E |
percent w disability: 18-34 | S1810_C03_015E |
percent w disability: 35-64 | S1810_C03_016E |
percent w disability: 65-74 | S1810_C03_017E |
percent w disability: 75+ | S1810_C03_018E |
Biological sex | |
percent w disability: male | S1810_C03_001E |
percent w disability: female | S1810_C03_003E |
American Community Survey (ACS) data for sociodemographic
subcategories of people with disabilities can be accessed by using the
tidycensus
package to query the Census API. This requires
an API key which can be acquired at api.census.gov/data/key_signup.html.
Data on COVID-19 cases from the Johns Hopkins University dashboard
have been provided directly with the research compendium because the
data is no longer available online in the state in which it was
downloaded on August 1, 2020. The dashboard and cumulative counts of
COVID-19 cases and deaths were continually updated, so an exact
reproduction required communication with the original author, Jayajit
Chakraborty, for assistance with provision of data from August 1, 2020.
The data includes an estimate of the total population
(POP_ESTIMA
) and confirmed COVID-19 cases
(Confirmed
). The COVID-19 case data expresses cumulative
count of reported COVID-19 from 1/22/2020 to 8/1/2020. Although metadata
for this particular resource is no longer available from the original
source, one can reasonably assume that the total population estimate was
based on the 2014-2018 5-year ACS estimate, as the 2019 estimates data
had not been released yet.
Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale. We received the COVID-19 case data through 8/1/2020 at the county level from the author, as there is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database.
Data on COVID-19 cases from the Johns Hopkins University dashboard have been provided directly with the research compendium because the data is no longer available online in the state in which it was downloaded on August 1, 2020. The dashboard and cumulative counts of COVID-19 cases and deaths were continually updated, so an exact reproduction required communication with the original author, Jayajit Chakraborty, for assistance with provision of data from August 1, 2020.
The data includes an estimate of the total population
(POP_ESTIMA
) and confirmed COVID-19 cases
(Confirmed
). The COVID-19 case data expresses cumulative
count of reported COVID-19 from 1/22/2020 to 8/1/2020. Although metadata
for this particular resource is no longer available from the original
source, one can reasonably assume that the total population estimate was
based on the 2014-2018 5-year ACS estimate, as the 2019 estimates data
had not been released yet.
Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale. We received the COVID-19 case data through 8/1/2020 at the county level from the author, as there is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database.
Calculate incidence rate (cases per 100,000 people)
Convert the COVID data to a non-geographic data frame.
Examine ACS variables from relevant tables
American Community Survey (ACS) data for sociodemographic
subcategories of people with disabilities can be accessed by using the
tidycensus
package to query the Census API. This requires
an API key which can be acquired at api.census.gov/data/key_signup.html.
Create variable csv tables
Query census (don’t run if data is already saved)
To be consistent with study extent, exclude Alaska, Hawai’i and Puerto Rico from the dataset.
Also, join poverty ACS variable to other ACS data using GEOID to create one dataset with all of the study variables.
Finally, transform the ACS geographic data into Contiguous USA Albers Equal Area projection and fix geometry errors.
Join variables from acs_vars_S1810
and
acs5_c18130
Calculate independent socio-demographic variables of people with disabilities as percentages for each sub-category of disability (race, ethnicity, poverty, age, and biological sex) and remove raw census data from the data frame (workflow step 4).
Reproject the data into an Albers equal area conic projection.
Number of PwDs / total from each PwD variable * 100
Join dependent COVID data to independent ACS demographic data.
This was not mentioned in the original study or in our pre-analysis plan. However, we replace the missing data with zeros, producing results identical to Chakraborty’s.
fips | statefp | county | county_st | covid_rate | dis_pct | white_pct | black_pct | native_pct | asian_pct | other_pct | non_hisp_white_pct | hisp_pct | non_hisp_non_white_pct | bpov_pct | apov_pct | pct_5_17 | pct_18_34 | pct_35_64 | pct_65_74 | pct_75 | male_pct | female_pct | pop | cases | x | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35039 | 35 | Rio Arriba | Rio Arriba County, New Mexico | 751.17 | 16.06467 | 10.77458 | 0.038371 | 2.744807 | 0.038371 | 2.468536 | 2.355981 | 11.39619 | 2.312494 | NA | NA | 0.3069682 | 1.25857 | 6.781439 | 3.391998 | 4.279648 | 8.556738 | 7.50793 | 39006 | 293 | -106.6932 | 36.50962 |
Replace the NA
values with 0
.
The purpose of this step is to improve our understanding of the geographic patterns and relationships of between the overarching independent variable (percentage of people with disability) and the dependent variable (COVID-19 incidence rate).
summarise()
Calculate summary statistics for dependent COVID-19 rate and independent PwD socio-demographic characteristics (Min, Max, Mean, SD) to reproduce columns of original study table 1
min | max | mean | SD | |
---|---|---|---|---|
covid_rate | 0.00 | 14257.17 | 966.90 | 1003.96 |
dis_pct | 3.83 | 33.71 | 15.95 | 4.40 |
white_pct | 0.85 | 33.26 | 13.55 | 4.63 |
black_pct | 0.00 | 20.70 | 1.48 | 2.66 |
native_pct | 0.00 | 13.74 | 0.28 | 0.94 |
asian_pct | 0.00 | 3.45 | 0.09 | 0.18 |
other_pct | 0.00 | 15.24 | 0.55 | 0.65 |
non_hisp_white_pct | 0.10 | 33.16 | 12.84 | 4.81 |
hisp_pct | 0.00 | 25.26 | 0.99 | 2.15 |
non_hisp_non_white_pct | 0.00 | 20.93 | 2.13 | 2.75 |
bpov_pct | 0.00 | 14.97 | 3.57 | 1.85 |
apov_pct | 0.00 | 27.30 | 12.48 | 3.06 |
pct_5_17 | 0.00 | 5.08 | 1.03 | 0.48 |
pct_18_34 | 0.00 | 5.59 | 1.56 | 0.67 |
pct_35_64 | 1.01 | 18.36 | 6.35 | 2.30 |
pct_65_74 | 0.00 | 12.73 | 3.09 | 1.16 |
pct_75 | 0.00 | 11.13 | 3.87 | 1.19 |
male_pct | 1.30 | 18.19 | 8.06 | 2.37 |
female_pct | 1.91 | 19.94 | 7.90 | 2.26 |
Correlate cor()
COVID cases, pct disability for each
disability variable
Mutate()
to produce Pearson’s r column in Table 1mutate( t = abs(r) / sqrt((1 - r\^2) / (df)), p = pt(t, df, lower.tail = FALSE) )
variable | r | t | p |
---|---|---|---|
dis_pct | -0.060 | 3.350 | 0.000 |
white_pct | -0.332 | 19.612 | 0.000 |
black_pct | 0.460 | 28.847 | 0.000 |
native_pct | 0.019 | 1.072 | 0.142 |
asian_pct | 0.094 | 5.272 | 0.000 |
other_pct | 0.026 | 1.460 | 0.072 |
non_hisp_white_pct | -0.361 | 21.545 | 0.000 |
hisp_pct | 0.119 | 6.686 | 0.000 |
non_hisp_non_white_pct | 0.442 | 27.429 | 0.000 |
bpov_pct | 0.106 | 5.914 | 0.000 |
apov_pct | -0.151 | 8.513 | 0.000 |
pct_5_17 | 0.084 | 4.688 | 0.000 |
pct_18_34 | 0.063 | 3.493 | 0.000 |
pct_35_64 | -0.008 | 0.460 | 0.323 |
pct_65_74 | -0.091 | 5.113 | 0.000 |
pct_75 | -0.186 | 10.541 | 0.000 |
male_pct | -0.134 | 7.519 | 0.000 |
female_pct | 0.023 | 1.305 | 0.096 |
pop | 0.128 | 7.215 | 0.000 |
cases | 0.209 | 11.891 | 0.000 |
x | 0.099 | 5.540 | 0.000 |
y | -0.412 | 25.195 | 0.000 |
Subtract results from Chakraborty’s Table 1
Compare reproduced descriptive statistics to original descriptive statistics. Difference is calculated as ‘reproduction study - original study’. Identical results will result in zero.
min | max | mean | SD | |
---|---|---|---|---|
covid_rate | 0 | 0.17 | -0.1 | -0.04 |
dis_pct | 0 | 0.00 | 0.0 | 0.00 |
white_pct | 0 | 0.00 | 0.0 | 0.00 |
black_pct | 0 | 0.00 | 0.0 | 0.00 |
native_pct | 0 | 0.00 | 0.0 | 0.00 |
asian_pct | 0 | 0.00 | 0.0 | 0.00 |
other_pct | 0 | 0.00 | 0.0 | 0.00 |
non_hisp_white_pct | 0 | 0.00 | 0.0 | 0.00 |
hisp_pct | 0 | 0.00 | 0.0 | 0.00 |
non_hisp_non_white_pct | 0 | 0.00 | 0.0 | 0.00 |
bpov_pct | 0 | 0.00 | 0.0 | 0.00 |
apov_pct | 0 | 0.00 | 0.0 | 0.00 |
pct_5_17 | 0 | 0.00 | 0.0 | 0.00 |
pct_18_34 | 0 | 0.00 | 0.0 | 0.00 |
pct_35_64 | 0 | 0.00 | 0.0 | 0.00 |
pct_65_74 | 0 | 0.00 | 0.0 | 0.00 |
pct_75 | 0 | 0.00 | 0.0 | 0.00 |
male_pct | 0 | 0.00 | 0.0 | 0.00 |
female_pct | 0 | 0.00 | 0.0 | 0.00 |
Compare the reproduced Pearson’s r correlation coefficients
to the original study’s Pearson’s r correlation coefficients.
(Stars indicates the significance level with two stars for
p < 0.01
and one star for p < 0.05
.)
Note: Correlation difference rp_r_diff
is calculated
between the reproduction study rp_r
and original study
or_r
as rp_r_diff = rp_r - or_r
Direction
difference rp_dir_diff
is calculated as
(rp_r > 0) - (or_r > 0)
, giving 0
if
both coefficients have the same direction, 1
if the
reproduction is positive and the original is negative, and
-1
if the reproduction is negative but the original is
positive.
Variable | R | Sig. Level | R | Sig. Level | R | Sig. Level | Direction |
---|---|---|---|---|---|---|---|
dis_pct | -0.056 | 2 | -0.060 | 2 | -0.004 | 0 | 0 |
white_pct | -0.326 | 2 | -0.332 | 2 | -0.006 | 0 | 0 |
black_pct | 0.456 | 2 | 0.460 | 2 | 0.004 | 0 | 0 |
native_pct | 0.020 | 0 | 0.019 | 0 | -0.001 | 0 | 0 |
asian_pct | 0.097 | 2 | 0.094 | 2 | -0.003 | 0 | 0 |
other_pct | 0.028 | 0 | 0.026 | 0 | -0.002 | 0 | 0 |
non_hisp_white_pct | -0.355 | 2 | -0.361 | 2 | -0.006 | 0 | 0 |
hisp_pct | 0.119 | 2 | 0.119 | 2 | 0.000 | 0 | 0 |
non_hisp_non_white_pct | 0.439 | 2 | 0.442 | 2 | 0.003 | 0 | 0 |
bpov_pct | 0.108 | 2 | 0.106 | 2 | -0.002 | 0 | 0 |
apov_pct | -0.146 | 2 | -0.151 | 2 | -0.005 | 0 | 0 |
pct_5_17 | 0.083 | 2 | 0.084 | 2 | 0.001 | 0 | 0 |
pct_18_34 | 0.066 | 1 | 0.063 | 2 | -0.003 | 1 | 0 |
pct_35_64 | -0.005 | 0 | -0.008 | 0 | -0.003 | 0 | 0 |
pct_65_74 | -0.089 | 2 | -0.091 | 2 | -0.002 | 0 | 0 |
pct_75 | -0.181 | 2 | -0.186 | 2 | -0.005 | 0 | 0 |
male_pct | -0.131 | 2 | -0.134 | 2 | -0.003 | 0 | 0 |
female_pct | 0.028 | 0 | 0.023 | 0 | -0.005 | 0 | 0 |
Reproduction correlation coefficients varied slightly from the original study coefficients by +/- 0.006.
All Pearson’s correlation coefficients were significant to the same level, except for age 18 to 34.
Counter-intuitively, the correlation coefficient was slightly closer to 0 but the p value was also found to be more significant, suggesting a difference in the estimation of t and/or p, or a typographical error. All of the coefficients had the same direction.
Unplanned Deviation for Reproduction: We should
expect identical results for this correlation test, so we loaded the
original author’s data from Aug1GEEdata.csv
to re-test the
statistic, calculated as unplanned_r
below.
variable | unplanned_r | or_r | diff |
---|---|---|---|
dis_pct | -0.056 | -0.056 | 0 |
white_pct | -0.326 | -0.326 | 0 |
black_pct | 0.456 | 0.456 | 0 |
native_pct | 0.020 | 0.020 | 0 |
asian_pct | 0.097 | 0.097 | 0 |
other_pct | 0.028 | 0.028 | 0 |
non_hisp_white_pct | -0.355 | -0.355 | 0 |
hisp_pct | 0.119 | 0.119 | 0 |
non_hisp_non_white_pct | 0.439 | 0.439 | 0 |
bpov_pct | 0.108 | 0.108 | 0 |
apov_pct | -0.146 | -0.146 | 0 |
pct_5_17 | 0.083 | 0.083 | 0 |
pct_18_34 | 0.066 | 0.066 | 0 |
pct_35_64 | -0.005 | -0.005 | 0 |
pct_65_74 | -0.089 | -0.089 | 0 |
pct_75 | -0.181 | -0.181 | 0 |
male_pct | -0.131 | -0.131 | 0 |
female_pct | 0.028 | 0.028 | 0 |
The author’s original data produced coefficients identical to the original study.
Unplanned Deviation for Reproduction: Considering the precise bitwise reproduction of descriptive statistics and of correlation statistics from author-provided data, we will recalculate the COVID-19 incidence rate with author-provided case and population data for comparison to the author-provided incidence rate.
FIPS | State | County | Population | Cases | OR_Incidence | RPr_Incidence | Difference |
---|---|---|---|---|---|---|---|
1115 | Alabama | St. Clair | 88690 | 1151 | 1349.52 | 1297.78 | -51.74 |
1117 | Alabama | Shelby | 215707 | 2911 | 1297.78 | 1349.52 | 51.74 |
5123 | Arkansas | St. Francis | 25439 | 1112 | 704.16 | 4371.24 | 3667.08 |
5125 | Arkansas | Saline | 121421 | 855 | 397.33 | 704.16 | 306.83 |
5127 | Arkansas | Scott | 10319 | 41 | 314.15 | 397.33 | 83.18 |
5129 | Arkansas | Searcy | 7958 | 25 | 1322.08 | 314.15 | -1007.93 |
5131 | Arkansas | Sebastian | 127753 | 1689 | 5420.39 | 1322.08 | -4098.31 |
5133 | Arkansas | Sevier | 17139 | 929 | 570.08 | 5420.39 | 4850.31 |
5135 | Arkansas | Sharp | 17366 | 99 | 4371.24 | 570.08 | -3801.16 |
8039 | Colorado | Elbert | 26282 | 85 | 626.60 | 323.42 | -303.18 |
8041 | Colorado | El Paso | 713856 | 4473 | 323.42 | 626.60 | 303.18 |
8065 | Colorado | Lake | 7824 | 70 | 349.85 | 894.68 | 544.83 |
8067 | Colorado | La Plata | 56310 | 197 | 894.68 | 349.85 | -544.83 |
We found that 13 counties had incorrect COVID-19 incidence scores, and the scores seem to be transposed from other counties, such that the overall descriptive statistics were accurate but the correlation coefficients were inaccurate. This finding implies that subsequent analyses using the COVID-19 Incidence rate will be slightly different and more accurate in this reproduction study than in the original study.
Unplanned deviation for reproduction: Join the original author’s Incidence data into our reproduction data frame so that we can later test for sensitivity to this error. Then report any counties for which the reproduced COVID incidence rate differs from the original author’s COVID incidence rate.
county_st | covid_rate | or_incidence |
---|---|---|
St. Clair County, Alabama | 1297.78 | 1349.52 |
Shelby County, Alabama | 1349.52 | 1297.78 |
St. Francis County, Arkansas | 4371.24 | 704.16 |
Saline County, Arkansas | 704.16 | 397.33 |
Scott County, Arkansas | 397.33 | 314.15 |
Searcy County, Arkansas | 314.15 | 1322.08 |
Sebastian County, Arkansas | 1322.08 | 5420.39 |
Sevier County, Arkansas | 5420.39 | 570.08 |
Sharp County, Arkansas | 570.08 | 4371.24 |
Elbert County, Colorado | 323.42 | 626.60 |
El Paso County, Colorado | 626.60 | 323.42 |
Lake County, Colorado | 894.68 | 349.85 |
La Plata County, Colorado | 349.85 | 894.68 |
The join worked, highlighting the same 13 counties with inconsistent incidence rates. This also confirms that our reproduced dependent variable is identical to the original dependent variable with the exception of these three counties.
SpatialEpi
packageFirst, calculate the Kulldorff spatial scan statistic using SpatialEpi. Optionally, skip this code block due to long run times of more than 10 minutes.
Load pre-calculated Kulldorff results
Report Kulldorff spatial scan results.
## [1] Most likely cluster:
## $location.IDs.included
## [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838 280 1760 1846 1756
##
## $population
## [1] 16949211
##
## $number.of.cases
## [1] 469091
##
## $expected.cases
## [1] 233805.6
##
## $SMR
## [1] 2.006329
##
## $log.likelihood.ratio
## [1] 97983.07
##
## $monte.carlo.rank
## [1] 1
##
## $p.value
## [1] 0.001
## [1] Number of Secondary clusters: 136
Assign unique cluster ID’s to each county within a cluster. Clusters include the county at the center of a cluster and all of the other counties within the cluster radius. Therefore, we use the FIPS code of the county at the center of each cluster as the unique cluster ID.
First, we must join the Kulldorff spatial scan cluster IDs to the acs_covid simple features dataframe. Although this was planned in workflow step 9, the order of operations between steps 9 and steps 7 and 8 is not important.
Next, calculate a new field isCluster
to identify
counties in COVID-19 clusters. Additionally, distinguish between
counties defining the center of a cluster from counties constituting
other parts of a cluster by comparing the cluster ID (equivalent to the
center county’s fips code) to the county fips code.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_tm_polygons()`: migrate the argument(s) related to the scale of
## the visual variable `fill` namely 'palette' (rename to 'values') to fill.scale
## = tm_scale(<HERE>).
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [tm_polygons()] Argument `value.na` unknown.
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
##
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "-Oranges" is named
## "oranges" (in long format "brewer.oranges")
(local cases/local population) / (global cases - local cases/global population - local population)
Classify counties using mutate(case_when())
The SpatialEpi
implementation of Kulldorff spatial scan
statistics does not calculate local relative risk or cluster relative
risk. Therefore, the next step is to calculate local and cluster
relative risk (workflow step 7).
Classify relative risk on a scale from 1 to 6 (workflow step 8). Risk is classified according to this table:
Relative Risk Values | Relative Risk Class |
---|---|
Outside of cluster | 1 |
RR < 1 | 1 |
1 <= RR < 2 | 2 |
2 <= RR < 3 | 3 |
3 <= RR < 4 | 4 |
4 <= RR < 5 | 5 |
5 <= RR < 6 | 6 |
Counties falling outside of any cluster are assigned a score of 1.
c(state + relative risk)
to produce unique cluster
identification codes
The original study did not directly report any results from the
Kulldorff spatial scan statistic. However, the Kulldorff cluster
relative risk scores were combined with states to create clusters for
GEE models, hereafter called “GEE clusters”. The original study reported
102
unique GEE clusters having a range of 1
to
245
counties in each cluster.
In order to compare results, we first create cluster IDs as combinations of the state ID and COVID relative risk class. The first clustering ID (State) and second clustering score (COVID relative risk class) were combined to form IDs for each unique combination of state and relative risk class. Then, we find the number of unique clusters and frequency counties per cluster in our reproduction study for comparison to the original study.
## 111 unique clusters based on spatialEpi CLUSTER relative risk
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 7.00 27.56 50.50 159.00
We failed to reproduce the same configuration of GEE clusters as the original study, finding 9 more clusters than the original study and a much smaller maximum cluster of 159 counties compared to 245 counties.
Reproduce Kulldorff spatial scan statistic in SaTScan
Unplanned deviation for reproduction: Upon failing
to reproduce an identical number of GEE clusters using SpatialEpi in R,
we reproduced the procedure in the free but not open SaTScan software,
using the current software version 10.1. The input data files
(case
, Coordinates.geo
, and
Population.pop
), and output data files
(sat_scan_rpr.txt
, sat_scan_rpr.col.shp
, and
sat_scan_rpr.gis.shp
) are found in the
data/derived/public/satscan
directory. The
sat_scan_rpr.txt
file reports the model parameters used in
addition to results.
Although it is not ideal to intercede with this unplanned deviation at this step, is the first step in the methodology following the Kulldorff spatial scan statistic with a result reported in the original publication.
First, load and verify whether our SaTScan reproduction data compares to the author-provided SaTScan data.
## 96 reproduced relative risk observations
## 96 author-provided relative risk observations
## 96 reproduced relative risk values match the original author's relative risk values
Our SaTScan results exactly reproduced the author-provided SaTScan results data.
Join the SaTScan results to acs_covid
for mapping and
analysis.
## Joining 96 records with 96 unique LOC_ID county values
Unplanned deviation for reproduction: Visualize the spatial distribution of the author-provided Kulldorff COVID-19 Clusters.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [tm_polygons()] Arguments `values` and `value.na` unknown.
In the map above, clusters containing only one county have no visible circle. Clusters containing two counties are encircled, but have no label. Clusters containing three or more counties are encircled and labelled with the number of counties.
Note that this version of data only includes the 96 counties defining cluster centers, visualized with fill colors above. The data excludes all of the non-center counties in clusters with more than one county. The extent of these larger clusters is visualized by unfilled circles defined by cluster radii.
Aug1GEE
)The generalized estimating equation (GEE) models were used to test association between intra-categorical rates of disability and COVID-19 incidence rates while accounting for spatial clustering. A separate hypothesis was formulated for each type of subcategorization of PwDs, numbered H2.1 through H2.5 in Table 4.
As specified by the author, “GEEs extend the generalized linear model to accommodate clustered data, in addition to relaxing several assumptions of traditional regression (i.e., normality)”. Additionally, the author noted that “clusters of observations must be defined based on the assumption that observations within a cluster are correlated while observations from different clusters are independent.” All five GEE models were specified with exchangeable correlation matrices, gamma distributions, and logarithmic link function. These specifications were chosen after testing each alternative and choosing the models with the best quasilikelihood under the independence model criterion (QIC).
This accomplishes the step 11 of the workflow diagram.
Unplanned deviation for reanalysis: Based on the three observations above, we think that it would be more valid to choose one set of secondary clusters based on a single method rather than combining a set of hierarchical clusters with a set of GINI optimized clusters. We also think that it would be more valid to include risk levels for all counties within a cluster (i.e. all counties within any of the circles above), rather than only the county at the center of a cluster. Finally, we think it would be more valid to treat clusters as a single category rather than five tiers of above-normal risk.
To complete the reproduction/reanalysis study, we will therefore calculate and compare multiple versions of the GEE models:
First, calculate GEE cluster IDs.
We have already calculated: - rp_clusID
based on our
SpatialEpi clusters - ss_clusID
based on our SaTScan
cluster centers, and shown to be identical to the original author’s data
- gini_clusID
based on our SaTScan GINI-optimized
clusters
Second, filter the data for non-zero COVID-19 rates and z-score standardize the independent variables. This accomplishes step 10 of the workflow diagram.
Unplanned deviation for reproduction: We assumed that we should filter for COVID rates > 0 first and then calculate z-scores, however after comparing data in the next code block, we realized that the original study had first calculated z-scores and then filtered for COVID rates > 0. Therefore, to align with the original study, in the next code block we first calculate z-scores and then filter for COVID rates > 0.
Compare independent variables for GEE models by subtracting the original values from the reproduced values, and finding the average and standard deviation of difference for each variable.
## Summary of difference between reproduction independent variables and original independent variables
## Mean:
## z_white_pct z_black_pct z_native_pct
## 0 0 0
## z_asian_pct z_other_pct z_non_hisp_white_pct
## 0 0 0
## z_hisp_pct z_non_hisp_non_white_pct z_bpov_pct
## 0 0 0
## z_apov_pct z_pct_5_17 z_pct_18_34
## 0 0 0
## z_pct_35_64 z_pct_65_74 z_pct_75
## 0 0 0
## z_male_pct z_female_pct
## 0 0
##
## Standard deviation:
## z_white_pct z_black_pct z_native_pct
## 0.000 0.000 0.000
## z_asian_pct z_other_pct z_non_hisp_white_pct
## 0.001 0.001 0.000
## z_hisp_pct z_non_hisp_non_white_pct z_bpov_pct
## 0.000 0.000 0.000
## z_apov_pct z_pct_5_17 z_pct_18_34
## 0.000 0.001 0.001
## z_pct_35_64 z_pct_65_74 z_pct_75
## 0.000 0.000 0.000
## z_male_pct z_female_pct
## 0.000 0.000
When we had filtered for COVID rates > 0 first and then z-score standardized second, the means of differences ranged from -0.012 to 0.004, and standard deviations of differences ranged from 0.000 to 0.016.
After changing the order to first z-score standardize and then filter for COVID rates > 0, we observed no mean difference between our reproduced variables and the original variables, and we find no standard deviation > 0.001 for the difference between reproduction independent variables and original variables. There are no major differences between the independent variables.
Optionally, you may save the preprocessed data to
data/raw/public/gee_data.gpkg
Optionally, you may load the preprocessed data from
data/raw/public/gee_data.gpkg
Generalized Estimating Equation parameters:
“The ‘exchangeable’ correlation matrix was selected for the results reported here, since this specification yielded the best statistical fit based on the QIC (quasi- likelihood under the independence) model criterion.” (Chakraborty 2021, Methods paragraph 5)
“The gamma distribution with logarithmic link function was chosen for all GEEs since this model specification provided the lowest QIC value.” (Chakraborty 2021, Methods paragraph 5)
Load digitized version of Table 2 from the original publication.
term | estimate | std.error | conf.low | conf.high | wald_chi_square | p_stars |
---|---|---|---|---|---|---|
Race model intercept | 7.106 | 0.083 | 6.997 | 7.322 | 7465.214 | 2 |
White | -0.203 | 0.020 | -0.242 | -0.164 | 102.958 | 2 |
Black | 0.111 | 0.016 | 0.079 | 0.143 | 46.214 | 2 |
Native American | 0.051 | 0.009 | 0.033 | 0.069 | 31.438 | 2 |
Asian | 0.080 | 0.018 | 0.046 | 0.115 | 21.060 | 2 |
Other race | 0.077 | 0.017 | 0.044 | 0.115 | 21.030 | 2 |
Ethnicity model intercept | 7.186 | 0.083 | 7.023 | 7.348 | 7525.648 | 2 |
Non-Hispanic White | -0.237 | 0.022 | -0.280 | -0.194 | 116.954 | 2 |
Hispanic | 0.119 | 0.031 | 0.058 | 0.180 | 14.708 | 2 |
Non-Hispanic non-White | 0.118 | 0.016 | 0.086 | 0.149 | 53.248 | 2 |
Poverty model intercept | 7.183 | 0.072 | 7.043 | 7.324 | 10066.930 | 2 |
Below poverty level | 0.148 | 0.022 | 0.105 | 0.190 | 46.913 | 2 |
Above poverty level | -0.267 | 0.023 | -0.312 | -0.222 | 134.297 | 2 |
Age model intercept | 7.242 | 0.076 | 7.093 | 7.391 | 9093.179 | 2 |
Age 5-17 | 0.047 | 0.016 | 0.016 | 0.078 | 8.732 | 2 |
Age 18-34 | 0.038 | 0.022 | -0.005 | 0.081 | 3.008 | 0 |
Age 35-64 | -0.026 | 0.023 | -0.071 | 0.019 | 1.300 | 0 |
Age 65-74 | -0.089 | 0.021 | -0.131 | -0.047 | 17.597 | 2 |
Age 75 or more | -0.108 | 0.020 | -0.148 | -0.069 | 29.479 | 2 |
Biological sex model intercept | 7.223 | 0.072 | 7.081 | 7.365 | 9963.672 | 2 |
Male | -0.298 | 0.028 | -0.353 | -0.243 | 113.460 | 2 |
Female | 0.153 | 0.029 | 0.097 | 0.209 | 28.577 | 2 |
Define a function for calculating and summarizing five GEE models
Calculate GEE models with: - Clustering: Reproduced SpatialEpi clusters & State ID - Dependent variable: reproduced COVID-19 incidence (fixed errors).
term | estimate | std.error | conf.low | conf.high | stars | p.value |
---|---|---|---|---|---|---|
Race model intercept | 6.755 | 0.020 | 6.716 | 6.795 | 2 | 0.000 |
White | -0.204 | 0.024 | -0.250 | -0.158 | 2 | 0.000 |
Black | 0.265 | 0.021 | 0.224 | 0.305 | 2 | 0.000 |
Native American | 0.020 | 0.023 | -0.024 | 0.065 | 0 | 0.368 |
Asian | 0.051 | 0.020 | 0.012 | 0.090 | 1 | 0.011 |
Other race | 0.086 | 0.018 | 0.050 | 0.122 | 2 | 0.000 |
Ethnicity model intercept | 6.742 | 0.020 | 6.703 | 6.781 | 2 | 0.000 |
Non-Hispanic White | -0.230 | 0.024 | -0.276 | -0.184 | 2 | 0.000 |
Hispanic | 0.115 | 0.016 | 0.083 | 0.146 | 2 | 0.000 |
Non-Hispanic non-White | 0.267 | 0.019 | 0.229 | 0.304 | 2 | 0.000 |
Poverty model intercept | 6.816 | 0.021 | 6.774 | 6.858 | 2 | 0.000 |
Below poverty level | 0.219 | 0.029 | 0.162 | 0.277 | 2 | 0.000 |
Above poverty level | -0.285 | 0.031 | -0.346 | -0.224 | 2 | 0.000 |
Age model intercept | 6.823 | 0.022 | 6.781 | 6.865 | 2 | 0.000 |
Age 5-17 | 0.063 | 0.023 | 0.018 | 0.108 | 2 | 0.006 |
Age 18-34 | 0.032 | 0.035 | -0.036 | 0.101 | 0 | 0.355 |
Age 35-64 | -0.005 | 0.041 | -0.086 | 0.076 | 0 | 0.903 |
Age 65-74 | -0.025 | 0.041 | -0.105 | 0.055 | 0 | 0.543 |
Age 75 or more | -0.160 | 0.027 | -0.214 | -0.107 | 2 | 0.000 |
Biological sex model intercept | 6.816 | 0.021 | 6.774 | 6.858 | 2 | 0.000 |
Male | -0.368 | 0.039 | -0.445 | -0.292 | 2 | 0.000 |
Female | 0.266 | 0.041 | 0.185 | 0.347 | 2 | 0.000 |
Present maps, takeaways from comparing data with original study results.
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "-Oranges" is named
## "oranges" (in long format "brewer.oranges")
The resulting figures likely indicate a “success” for this partial reproduction, due to the fact that most of the reproduced tables’ values were close to the original study. Further, the reproduction methods uncovered the fact that several counties had incorrect COVID incidence scores, resulting in inaccurate correlation coefficients, an important finding. Although this procedure code had many deviations from the original workflow, some planned and others unplanned, this reproduction was also successful overall in expanding the visualization of the original study’s methods by including maps of disability rates and clusters. This reproduction also compared the results of SpatialEpi Kulldorf clusters, which were similar, yet more accurate than the author’s original SaTScan clusters.
This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](DOI:%5B10.17605/OSF.IO/W29MQ){.uri}
The majority of this code is adapted from a reproduction of the same study by Joseph Holler, Junyi Zhou, Peter Kedron, Drew An-Pham, Derrick Burt.