Abstract

This study is a reproduction of: Tuholske, C., Lynch, V.D., Spriggs, R. et al. Hazardous heat exposure among incarcerated people in the United States. Nat Sustain 7, 394–398 (2024). https://doi.org/10.1038/s41893-024-01293-y Git Repository

By reproducing the exploratory data analysis done in Tuholske et al. (2024), we seek to…

1. Identify the temporal resolution at which authors use population data to calculate population weighted hazardous heat days

2. Understand the population weighting mechanism in state-level hazardous heat calculations.

3. Evaluate the effectiveness of the author’s methods compared to similar research.

Study metadata

Original study spatio-temporal metadata

  • Spatial Coverage: United States Lower 48
  • Spatial Resolution: Carceral facility points and States
  • Spatial Reference System: spatial reference system of original study
  • Temporal Coverage: 1982-2020
  • Temporal Resolution: 1 year

Study design

This study is a reproduction study, and the original study is more exploratory in design. The primary objective for exploration through research and analysis investigated in the original study was exposure to hazardous heat in carceral facilities in the continental U.S. The authors wanted to examine how exposure to hazardous heat changed over time from 1982-2020, as well as how exposure within carceral facilities compared to exposure in the rest of the state. In general, determining the spatial distribution of carceral facilities with higher levels of hazardous heat exposure was also an objective of the original paper

Materials and procedure

Computational environment

Original study computational environment

The original study data transformations and analysis were completed primarily in R using Rmd documents, as well as in Python. The versions of R and Python used are not disclosed, but would have been R 4.3.3 or earlier, and Python 3.12 or earlier.

In the original study, R packages are called in across different scripts. However, it seems that the important ones for this study are:

original_study_packages <- c(
  "dplyr", 
  "data.table", 
  "maptools",
  "mapproj",
  "rgeos",
  "rgdal",
  "RColorBrewer",
  "ggplot2",
  "raster", # planned deviation: we will be using `stars` in our reproduction
  "sp", # planned deviation: we will be using `sf` in our reproduction
  "plyr",
  "graticule",
  "zoo",
  "purrr",
  "cowplot",
  "janitor"
  )

Prepare reproduction computational environment

For the reproduction study, we will be using R version 4.4.2, and the groundhog package to maintain package consistency. All packages used will be up to date as of 2025-02-01.

We plan on using the packages tidyverse, here, markdown, htmltools, dplyr, sf, and stars. As we encounter the need for other packages in our implimentation of the code, we will make note of them as unplanned deviations.

Data and variables

We are going to use data from the original study’s git repository (linked on top level readme). This includes:

- Population data for the study period

- Prison boundary polygons with facility information

- State polygons

- WBGT data, at prison point and state levels

Population Data

  • Title: population (pre_1990 & vintage_2020)
  • Abstract: Population data representing different age groups (10 year increments) from 0-5 years old up to 85 years old by sex
  • Spatial Coverage: Continental U.S.
  • Spatial Resolution: County by FIPS Code
  • Spatial Representation Type: N/A
  • Spatial Reference System: N/A
  • Temporal Coverage: Each year, 1982-2020
  • Temporal Resolution: Month
  • Lineage: Acquired from census, pre-1990 and post-1990 data standardized
  • Distribution: Data available in original study’s git repository
  • Constraints: Public domain
  • Data Quality: Unclear lineage documentation
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
year observance year integer
fips county FIPS code integer
sex 1 is male, 2 is female integer
age 10-year age group integer
month month of year integer
pop group’s population in county integer

Prison Boundaries

  • Title: Prison_Boundaries.shp
  • Abstract: Shapefile containing prison boundary polygons including geographic, type, operation, population, capacity, and other data
  • Spatial Coverage: United States of America (including Alaska, Hawaii, DC, and territories)
  • Spatial Resolution: parcel/building sized polygon (effectively points)
  • Spatial Representation Type: vector MULTIPOLYGON
  • Spatial Reference System: CRS 3857 Spherical/Web Mercator
  • Temporal Coverage: unclear - appears to represent data as of 6/6/2020
  • Temporal Resolution: n/a
  • Lineage: Refer to metadata_Prison_Boundaries_WebDownload.pdf
  • Distribution: Data available in original study’s git repository
  • Constraints: Public Domain
  • Data Quality: Unclear lineage documentation, many missing facility information
  • Variables: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
status describes facility status as open, closed or not available
population population of facility, -999 represents missing data
capacity total capacity of facility, -999 represents missing data

State Boundaries

  • Title: states.shp
  • Abstract: state boundary polygons with region
  • Spatial Coverage: The 50 US states and Washington DC
  • Spatial Resolution: US state
  • Spatial Representation Type: vector MULTIPOLYGON
  • Spatial Reference System: EPSG 4269
  • Temporal Coverage: n/a
  • Temporal Resolution: n/a
  • Lineage: unknown
  • Distribution: Data available in original study’s git repository
  • Constraints: Public domain
  • Data Quality: n/a
  • Variables:
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
STATE_NAME Name Name of state character string n/a US state names n/a n/a
DRAWSEQ Draw Sequence unknown integer n/a 1-51 n/a n/a
STATE_FIPS FIPS Code State FIPS code (two digit) integer n/a 01-56 n/a n/a
SUB_REGION Sub-Region Sub-Region of the US character string n/a n/a n/a n/a
STATE_ABBR Abbreviation Two letter abbreviation character string n/a two-letter postal abbreviations

WBGT Data

WBGT Data - Prison Level

  • Title: wbgt_raw/prison/weighted_area_raster_prison_wbgtmax_daily_(year).rds
  • Abstract: Daily WBGTmax, weighted by area, from 1982-2020
  • Spatial Coverage: United States Lower 48
  • Spatial Resolution: Prison by prison ID
  • Spatial Representation Type: N/A
  • Spatial Reference System: N/A
  • Temporal Coverage: Each year, 1982-2020
  • Temporal Resolution: Day of year
  • Lineage:
    • Heat data acquired from Parameter-elevation Regressions on Independent Slopes Model (PRISM) dataset
    • Prison data acquired from Homeland Infrastructure Foundation-Level Data (HIFLD), produced by the Department of Homeland Security
    • WBGTmax estimated by authors using high-resolution (4 km) daily maximum 2 m air temperature data (Tmax), and maximum vapour pressure deficit data (VPDmax)
  • Distribution: Data available in original study’s git repository
  • Constraints: Open access Creative Commons Attribution 4.0 International License,
  • Data Quality: Data lacks sufficient documentation in the original study repository/resources. The original link to the HIFLD data (from the citation) no longer works, however HIFLD data can now be found here.
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
prison_id unqiue prison id integer 6640 prisons
wbgtmax wbgtmax estimated for specified day integer missing data not included
date day of year (dd/mm/yyyy) character string
day day integer
month month integer
year year integer

WBGT Data - State Level

  • Title: wbgt_raw/state/weighted_area_raster_fips_wbgtmax_daily_(year).rds
  • Abstract: Daily WBGTmax, weighted by area, from 1982-2020
  • Spatial Coverage: United States Lower 48
  • Spatial Resolution: County by FIPS Code
  • Spatial Representation Type: N/A
  • Spatial Reference System: N/A
  • Temporal Coverage: Each year, 1982-2020
  • Temporal Resolution: Day of year
  • Lineage:
    • Acquired from Parameter-elevation Regressions on Independent Slopes Model (PRISM) dataset
    • Unknown exactly how WBGTmax 4 km raster was summarized by county polygons using the daily maximum 2 m air temperature data (Tmax) and maximum vapour pressure deficit data (VPDmax)
  • Distribution: Data available in original study’s git repository
  • Constraints: Open access Creative Commons Attribution 4.0 International License,
  • Data Quality: Data lacks sufficient documentation in the original study repository/resources.
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
fips county fips code integer
wbgtmax wbgtmax estimated for specified day integer missing data not included
date day of year (dd/mm/yyyy) character string
day day integer
month month integer
year year integer

Prior observations

At the time of this pre-analysis plan, we have the derived data to work off of, and we have examined some of the csv tables. We have neither visualized nor analyzed prison data or WBGTmax temperature data before.

Bias and threats to validity

There are no statistical tests in this study, so issues such as spatial heterogeneity/anistropy/autocorrelation do not matter. Scale could be a threat to validity, because county populations are aggregated to calculate the number of population-weighted heat days in each state. There is also a scale issue measuring micro-climate conditions at prison boundaries compared to 4 km temperature data. Further, there is no specification of how heat days are calculated within each county given that counties do no map neatly to 4 km by 4km grids used to calculate hazardous heat days. The ways in which the county boundaries are drawn also supports the argument that there is a Modifiable Area Unit Problem.

Both the scale and boundary issues also have a temporal component that may create threats to validity.

Data transformations

Planned deviation:

We will not attempt to produce the original study’s WBGTmax grid because the methods are unclear, and therefore we will skip to joining the author-provided WBGTmax by day grid data to the prison points.

(When implementing plan) Explain what we believe the authors did to produce the WBGTmax grid and preliminary steps

Transform data to create Figures 1a, 1b

(More descriptive segment of original study’s workflow)

Step 1: Join author-provided WBGTmax by day grid data to author-provided carceral facility point data

Use st_join() or st_extract()

Result: Rds table of WBGTmax by day by prison

Step 2: Filter result by days when WBGTmax exceeded 28 degrees C

Step 3: Group by carceral facility type and year

Count to produce summary of days exceeded per year by facility

Result: Rds table with variables
- prison facility

- facility type

- prison population

- n days exceeding 28 degrees

- year

Analysis

Analyze data to create Figures 2a, 2b and 2c

Figure 2b, 2c - results are based on linear regression models

Step 4: Begin with result from Step 3

Step 5: Spatial join county population by year data

Result: Rds table with variables
- prison facility

- facility type

- prison population

- county

- population

- n days exceeding 28 degrees

- year

Step 6: Population-weighted aggregation

Aggregate data into states

Weighted sum of days exceeded across all counties of the state

Sum of days exceeded multiplied by (Ratio of county population / state population)

Planned deviations for reproduction:

Deviation 1: Investigating temperature threshold

Repeat workflow for reproducing figures 1 and 2, instead filtering for days when WBGTmax exceeded 29.4 degrees C (85 degrees F standard informed by other literature)

Deviation 2: Investigating sources of uncertainty/error/bias

How many open facilities had a population of -999? (Potential source of uncertainty)

Step 7: Select facilities with population of -999 from author-provided carceral facilities data.

Report number of facilities compared to authors’ number.

Results: Present figures (reproduced from original study and planned deviations).

Create Fig. 1a and b using results from step 3

Create Fig. 2a, b and c using results from step 6

Map result from step 7 to examine data quality issue

Discussion

What are the implications of us being able to recreate or not recreate the figures? Why does it matter for the original study to be reproducible? Mention significance of groundhog usage for sustainable reproduction. Discuss research suggesting maximum daily temperature doesn’t matter as much for heat stress, and how long stretches of night time lows may be more serious.

Integrity Statement

This is the first version of our pre-analysis plan. Any deviations in our workflow will be documented as unplanned deviations.

Acknowledgements

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](DOI:%5B10.17605/OSF.IO/W29MQ){.uri}

References

Kedron, P., & Holler, J. (2023). Template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences. https://doi.org/10.17605/OSF.IO/W29MQ

Cheng, Joe, Carson Sievert, Barret Schloerke, Winston Chang, Yihui Xie, and Jeff Allen. 2024. Htmltools: Tools for HTML. https://github.com/rstudio/htmltools.
Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://here.r-lib.org/.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
———. 2024a. Sf: Simple Features for r. https://r-spatial.github.io/sf/.
———. 2024b. Stars: Spatiotemporal Arrays, Raster and Vector Data Cubes. https://r-spatial.github.io/stars/.
Pebesma, Edzer, and Roger Bivand. 2023a. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
———. 2023b. Spatial Data Science: With applications in R. London: Chapman; Hall/CRC. https://doi.org/10.1201/9780429459016.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.
Xie, Yihui, JJ Allaire, and Jeffrey Horner. 2024. Markdown: Render Markdown with Commonmark. https://github.com/rstudio/markdown.