Abstract

This is a study of gerrymandering in Alabama. We well test three methods of shape-based compactness scores, assess the representativeness of districts based on prior presidential elections and race. We will then extend prior studies by calculating representativeness of the convex hull of district polygons.

Study Metadata

Study design

This is an original study based on literature on gerrymandering metrics.

This study is exploratory in design, with the goal of evaluating usefulness of a new gerrymandering metric based on the convex hull of a congressional district and representative capability inside the convex hull compared to the congressional district.

Materials and procedure

Computational environment

I plan on using package … for …

Data and variables

We plan on using data sources…

Layers from districts.gpkg

Precincts 2020

  • Title: Voting Precincts 2020
  • Abstract: Alabama voting data for 2020 elections by precinct.
  • Spatial Coverage: Alabama OSM:161950
  • Spatial Resolution: voting precincts
  • Spatial Reference System: EPSG: 4269, NAD 1983 geographic coordinate system
  • Temporal Coverage: voting precincts used for tabulating the 2020 election
  • Temporal Resolution: annual election (2020)
  • Lineage: Saved as geopackage format. Processing prior to download is explained in validation report and readme
  • Distribution: Data available at Redistricting Data Hub with free login.
  • Constraints: Permitted for noncommercial and nonpartisan use only. Copyright and use constraints explained here
  • Data Quality: State any planned quality assessment
  • Variables: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
    • Label: variable name as used in the data or code
    • Alias: intuitive natural language name
    • Definition: Short description or definition of the variable. Include measurement units in description.
    • Type: data type, e.g. character string, integer, real
    • Accuracy: e.g. uncertainty of measurements
    • Domain: Expected range of Maximum and Minimum of numerical data, or codes or categories of nominal data, or reference to a standard codebook
    • Missing Data Value(s): Values used to represent missing data and frequency of missing data observations
    • Missing Data Frequency: Frequency of missing data observations: not yet known for data to be collected
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
VTDST20 Voting district ID
GEOID20 Unique geographic ID
G20PRETRU total votes for Trump in 2020
G20PREBID total votes for Biden in 2020

Districts 2023

  • Title: US Congressional Districts 2023
  • Abstract: Alabama congressional districts for the 2024 election.
  • Spatial Coverage: Alabama OSM:161950
  • Spatial Resolution: congressional districts
  • Spatial Reference System: EPSG: 3857, NAD 1984 Web Mercator projection
  • Temporal Coverage: districts approved in 2023 for use in 2024.
  • Temporal Resolution:
  • Lineage: Loaded into QGIS as ArcGIS feaure service layer and saved in geopackage format. Extraneous data fields were removed and the FIX GEOMETRIES tool was used to correct geometry errors.
  • Distribution: Alabama State GIS via ESRI feature service
  • Constraints: Public Domain data free for use and redistribution.
  • Data Quality: State any planned quality assessment
  • Variables: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
    • Label: variable name as used in the data or code
    • Alias: intuitive natural language name
    • Definition: Short description or definition of the variable. Include measurement units in description.
    • Type: data type, e.g. character string, integer, real
    • Accuracy: e.g. uncertainty of measurements
    • Domain: Expected range of Maximum and Minimum of numerical data, or codes or categories of nominal data, or reference to a standard codebook
    • Missing Data Value(s): Values used to represent missing data and frequency of missing data observations
    • Missing Data Frequency: Frequency of missing data observations: not yet known for data to be collected
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
DISTRICT US Congressional District Number
POPULATION total population (2020 census)
WHITE total white population (2020 census)
BLACK total Black or African American population (2020 census)

Blockgroups 2020

  • Title: Block Groups 2020
  • Abstract: Vector polygon geopackage layer of Census tracts and demographic data.
  • Spatial Coverage: Alabama OSM:161950
  • Spatial Resolution: census block groups
  • Spatial Reference System: EPSG: 4269, NAD 1983 geographic coordinate system
  • Temporal Coverage: 2020 census
  • Temporal Resolution: 10 year census (2020)
  • Lineage: Data downloaded from US Census API “pl” public law summary file using tidycensus in R
  • Distribution: US Census API
  • Constraints: Public Domain data free for use and redistribution.
  • Data Quality: State any planned quality assessment
  • Variables: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
    • Label: variable name as used in the data or code
    • Alias: intuitive natural language name
    • Definition: Short description or definition of the variable. Include measurement units in description.
    • Type: data type, e.g. character string, integer, real
    • Accuracy: e.g. uncertainty of measurements
    • Domain: Expected range of Maximum and Minimum of numerical data, or codes or categories of nominal data, or reference to a standard codebook
    • Missing Data Value(s): Values used to represent missing data and frequency of missing data observations
    • Missing Data Frequency: Frequency of missing data observations: not yet known for data to be collected
Label Alias Definition Type Accuracy Domain Missing Data Value(s) Missing Data Frequency
GEOID code to uniquely identify tracts
P4_001N total population, 18 years or older
P4_006N total: not Hispanic or Latino, Population of one race, Black or African American alone, 18 years or older
P5_003N Total institutionalized population in correctional facilities for adults, 18 years or older

Prior observations

I have conducted an analogous analysis with this data before using QGIS to determine compactness along with race and party affiliation data. However, I only conducted my analysis with an area-weighted re-aggregation approach, and did not incorporate convex hull.

At the time of this study pre-registration, the authors had _____ prior knowledge of the geography of the study region with regards to the ____ phenomena to be studied. This study is related to ____ prior studies by the authors

For each primary data source, declare the extent to which authors had already engaged with the data:

For each secondary source, declare the extent to which authors had already engaged with the data:

If pilot test data has been collected or acquired, describe how the researchers observed and analyzed the pilot test, and the extent to which the pilot test influenced the research design.

Bias and threats to validity

Given the research design and primary data to be collected and/or secondary data to be used, discuss common threats to validity and the approach to mitigating those threats, with an emphasis on geographic threats to validity.

These include: - uneven primary data collection due to geographic inaccessibility or other constraints - multiple hypothesis testing - edge or boundary effects - the modifiable areal unit problem - nonstationarity - spatial dependence or autocorrelation - temporal dependence or autocorrelation - spatial scale dependency - spatial anisotropies - confusion of spatial and a-spatial causation - ecological fallacy - uncertainty e.g. from spatial disaggregation, anonymization, differential privacy

Data transformations

blockgroups2020 needs to be acquired using tidycensus() in R

districts23 needs to be reprojected to EPSG:4269 for geodesic analysis

Area needs to be calculated for districts23 and blockgroups2020

The process of area-weighted re-aggregation needs to be conducted for blockgroups20 and districts23

Compactness needs to be calculated for districts23

Convex hull needs to be calculated for districts23

Race, compactness and voting data need to be joined together to produce a final table.

Describe all data transformations planned to prepare data sources for analysis. This section should explain with the fullest detail possible how to transform data from the raw state at the time of acquisition or observation, to the pre-processed derived state ready for the main analysis. Including steps to check and mitigate sources of bias and threats to validity. The method may anticipate contingencies, e.g. tests for normality and alternative decisions to make based on the results of the test. More specifically, all the geographic and variable transformations required to prepare input data as described in the data and variables section above to match the study’s spatio-temporal characteristics as described in the study metadata and study design sections. Visual workflow diagrams may help communicate the methodology in this section.

Examples of geographic transformations include coordinate system transformations, aggregation, disaggregation, spatial interpolation, distance calculations, zonal statistics, etc.

Examples of variable transformations include standardization, normalization, constructed variables, imputation, classification, etc.

Be sure to include any steps planned to exclude observations with missing or outlier data, to group observations by attribute or geographic criteria, or to impute missing data or apply spatial or temporal interpolation.

Analysis

Describe the methods of analysis that will directly test the hypotheses or provide results to answer the research questions. This section should explicitly define any spatial / statistical models and their parameters, including grouping criteria, weighting criteria, and significance thresholds. Also explain any follow-up analyses or validations.

Results

Describe how results are to be presented.

Discussion

Describe how the results are to be interpreted vis a vis each hypothesis or research question.

Integrity Statement

Include an integrity statement - The authors of this preregistration state that they completed this preregistration to the best of their knowledge and that no other preregistration exists pertaining to the same hypotheses and research. If a prior registration does exist, explain the rationale for revising the registration here.

Acknowledgements

This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](DOI:%5B10.17605/OSF.IO/W29MQ){.uri}

References

Müller, Kirill. 2020. Here: A Simpler Way to Find Your Files. https://here.r-lib.org/.
Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.
———. 2024. Sf: Simple Features for r. https://r-spatial.github.io/sf/.
Pebesma, Edzer, and Roger Bivand. 2023. Spatial Data Science: With applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Tennekes, Martijn. 2018. tmap: Thematic Maps in R.” Journal of Statistical Software 84 (6): 1–39. https://doi.org/10.18637/jss.v084.i06.
———. 2025. Tmap: Thematic Maps. https://github.com/r-tmap/tmap.
Walker, Kyle, and Matt Herman. 2025. Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames. https://walker-data.com/tidycensus/.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.