Working with Messy Data

Christopher Teixeira
November 15, 2023

A little about me…




Christopher Teixeira

Principal Data Scientist

The MITRE Corporation



Interests

  • Data Analytics
  • Applied Statistics
  • Operations Research

Education

  • MS in Operations Research, George Mason University
  • BSc in Mathematics, Worcester Polytechnic Institute

Reading in data

Code
# URL for the CMS data used in these slides
url <- "https://data.cms.gov/data-api/v1/dataset/8889d81e-2ee7-448f-8713-f071038289b5/data?size=5000"

# Download and convert the JSON file to a data frame
df <- jsonlite::fromJSON(url)

# View the data frame in a friendly format for Quarto
knitr::kable(head(df), format="html")
Rndrng_NPI Rndrng_Prvdr_Last_Org_Name Rndrng_Prvdr_First_Name Rndrng_Prvdr_MI Rndrng_Prvdr_Crdntls Rndrng_Prvdr_Gndr Rndrng_Prvdr_Ent_Cd Rndrng_Prvdr_St1 Rndrng_Prvdr_St2 Rndrng_Prvdr_City Rndrng_Prvdr_State_Abrvtn Rndrng_Prvdr_State_FIPS Rndrng_Prvdr_Zip5 Rndrng_Prvdr_RUCA Rndrng_Prvdr_RUCA_Desc Rndrng_Prvdr_Cntry Rndrng_Prvdr_Type Rndrng_Prvdr_Mdcr_Prtcptg_Ind Tot_HCPCS_Cds Tot_Benes Tot_Srvcs Tot_Sbmtd_Chrg Tot_Mdcr_Alowd_Amt Tot_Mdcr_Pymt_Amt Tot_Mdcr_Stdzd_Amt Drug_Sprsn_Ind Drug_Tot_HCPCS_Cds Drug_Tot_Benes Drug_Tot_Srvcs Drug_Sbmtd_Chrg Drug_Mdcr_Alowd_Amt Drug_Mdcr_Pymt_Amt Drug_Mdcr_Stdzd_Amt Med_Sprsn_Ind Med_Tot_HCPCS_Cds Med_Tot_Benes Med_Tot_Srvcs Med_Sbmtd_Chrg Med_Mdcr_Alowd_Amt Med_Mdcr_Pymt_Amt Med_Mdcr_Stdzd_Amt Bene_Avg_Age Bene_Age_LT_65_Cnt Bene_Age_65_74_Cnt Bene_Age_75_84_Cnt Bene_Age_GT_84_Cnt Bene_Feml_Cnt Bene_Male_Cnt Bene_Race_Wht_Cnt Bene_Race_Black_Cnt Bene_Race_API_Cnt Bene_Race_Hspnc_Cnt Bene_Race_NatInd_Cnt Bene_Race_Othr_Cnt Bene_Dual_Cnt Bene_Ndual_Cnt Bene_CC_AF_Pct Bene_CC_Alzhmr_Pct Bene_CC_Asthma_Pct Bene_CC_Cncr_Pct Bene_CC_CHF_Pct Bene_CC_CKD_Pct Bene_CC_COPD_Pct Bene_CC_Dprssn_Pct Bene_CC_Dbts_Pct Bene_CC_Hyplpdma_Pct Bene_CC_Hyprtnsn_Pct Bene_CC_IHD_Pct Bene_CC_Opo_Pct Bene_CC_RAOA_Pct Bene_CC_Sz_Pct Bene_CC_Strok_Pct Bene_Avg_Risk_Scre
1003000126 Enkeshafi Ardalan M.D. M I 6410 Rockledge Dr Ste 304 Bethesda MD 24 20817 1 Metropolitan area core: primary flow within an urbanized area of 50,000 and greater US Internal Medicine Y 36 661 3749 515976.55 289822.37 231289.23 198266.66 * # 78 42 197 247 175 378 283 469 83 62 14 0 33 188 473 0.24 0.3 0.12 0.16 0.39 0.52 0.24 0.37 0.44 0.75 0.75 0.62 0.11 0.58 0.06 0.14 1.8026
1003000134 Cibull Thomas L M.D. M I 2650 Ridge Ave Evanston Hospital Evanston IL 17 60201 1 Metropolitan area core: primary flow within an urbanized area of 50,000 and greater US Pathology Y 21 3216 7359 1136431.94 273928.38 203094.28 191013.47 0 0 0 0 0 0 0 21 3216 7359 1136431.94 273928.38 203094.28 191013.47 76 88 1396 1198 534 1698 1518 2920 28 66 47 0 155 190 3026 0.1 0.07 0.05 0.14 0.11 0.24 0.06 0.15 0.2 0.52 0.51 0.24 0.11 0.38 0.01 0.03 1.0785
1003000142 Khalil Rashid M.D. M I 4126 N Holland Sylvania Rd Suite 220 Toledo OH 39 43623 1 Metropolitan area core: primary flow within an urbanized area of 50,000 and greater US Anesthesiology Y 52 239 1932 295950.73 122731.08 93430.64 95509.35 5 32 898 8254.52 2818.25 2306.71 3229.16 47 239 1034 287696.21 119912.83 91123.93 92280.19 68 77 98 52 12 147 92 171 55 0 74 165 0.09 0.06 0.09 0.11 0.21 0.37 0.21 0.39 0.38 0.58 0.7 0.33 0.09 0.75 0.06 1.492
1003000423 Velotta Jennifer A M.D. F I 11100 Euclid Ave Cleveland OH 39 44106 1 Metropolitan area core: primary flow within an urbanized area of 50,000 and greater US Obstetrics & Gynecology Y 19 69 738 21300 8017.57 6735.97 6892.83 * # 66 47 69 0 14 55 0.16 0.28 0.17 0.55 0.46 0.16 0.17 0.38 0.6362
1003000480 Rothchild Kevin B MD M I 12605 E 16th Ave Aurora CO 08 80045 1 Metropolitan area core: primary flow within an urbanized area of 50,000 and greater US General Surgery Y 29 112 162 180891 35872.83 28071.92 27509.07 0 0 0 0 0 0 0 29 112 162 180891 35872.83 28071.92 27509.07 65 39 55 64 48 82 15 0 46 66 0.18 0.21 0.35 0.13 0.32 0.4 0.48 0.65 0.31 0.12 0.59 1.8233
1003000530 Semonche Amanda M DO F I 1021 Park Ave Suite 203 Quakertown PA 42 18951 1 Metropolitan area core: primary flow within an urbanized area of 50,000 and greater US Internal Medicine Y 31 404 1487 257901 180753.38 140941.85 129609.8 5 106 120 14114 10903.23 10903.23 10684.71 26 404 1367 243787 169850.15 130038.62 118925.09 72 50 206 110 38 237 167 383 51 353 0.08 0.07 0.06 0.11 0.11 0.29 0.1 0.27 0.25 0.74 0.69 0.22 0.07 0.33 0.03 0.04 1.1156
Code
import pandas as pd

# URL for the CMS data used in these slides
url = "https://data.cms.gov/data-api/v1/dataset/8889d81e-2ee7-448f-8713-f071038289b5/data?size=5000"

# Download and convert the JSON file to a data frame
df = pd.read_json(url)

# View the data frame in a friendly format for Quarto
df.to_html()

Changing data types

Code
# Subset the data down to a select few to work with. 
# Then convert "Bene_" and "Tot_" variables to numeric. 
# The ID and zip code variables get converted to factors.
df.subset <- df |> 
    select(Rndrng_NPI,
           Rndrng_Prvdr_Zip5,
           Tot_Benes,
           Tot_Srvcs,
           starts_with("Bene_")) |>
    mutate(across(starts_with(c("Bene_","Tot_")), as.numeric),
           across(starts_with("Rndrng_"), factor))

# View the data frame in a friendly format for Quarto
knitr::kable(head(df.subset), format="html")
Rndrng_NPI Rndrng_Prvdr_Zip5 Tot_Benes Tot_Srvcs Bene_Avg_Age Bene_Age_LT_65_Cnt Bene_Age_65_74_Cnt Bene_Age_75_84_Cnt Bene_Age_GT_84_Cnt Bene_Feml_Cnt Bene_Male_Cnt Bene_Race_Wht_Cnt Bene_Race_Black_Cnt Bene_Race_API_Cnt Bene_Race_Hspnc_Cnt Bene_Race_NatInd_Cnt Bene_Race_Othr_Cnt Bene_Dual_Cnt Bene_Ndual_Cnt Bene_CC_AF_Pct Bene_CC_Alzhmr_Pct Bene_CC_Asthma_Pct Bene_CC_Cncr_Pct Bene_CC_CHF_Pct Bene_CC_CKD_Pct Bene_CC_COPD_Pct Bene_CC_Dprssn_Pct Bene_CC_Dbts_Pct Bene_CC_Hyplpdma_Pct Bene_CC_Hyprtnsn_Pct Bene_CC_IHD_Pct Bene_CC_Opo_Pct Bene_CC_RAOA_Pct Bene_CC_Sz_Pct Bene_CC_Strok_Pct Bene_Avg_Risk_Scre
1003000126 20817 661 3749 78 42 197 247 175 378 283 469 83 62 14 0 33 188 473 0.24 0.30 0.12 0.16 0.39 0.52 0.24 0.37 0.44 0.75 0.75 0.62 0.11 0.58 0.06 0.14 1.8026
1003000134 60201 3216 7359 76 88 1396 1198 534 1698 1518 2920 28 66 47 0 155 190 3026 0.10 0.07 0.05 0.14 0.11 0.24 0.06 0.15 0.20 0.52 0.51 0.24 0.11 0.38 0.01 0.03 1.0785
1003000142 43623 239 1932 68 77 98 52 12 147 92 171 55 NA NA 0 NA 74 165 0.09 0.06 0.09 0.11 0.21 0.37 0.21 0.39 0.38 0.58 0.70 0.33 0.09 0.75 NA 0.06 1.4920
1003000423 44106 69 738 66 NA 47 NA NA 69 0 NA NA NA NA NA NA 14 55 NA NA NA NA NA 0.16 NA 0.28 0.17 0.55 0.46 0.16 0.17 0.38 NA NA 0.6362
1003000480 80045 112 162 65 39 55 NA NA 64 48 82 NA NA 15 0 NA 46 66 NA NA 0.18 NA 0.21 0.35 0.13 0.32 0.40 0.48 0.65 0.31 0.12 0.59 NA NA 1.8233
1003000530 18951 404 1487 72 50 206 110 38 237 167 383 NA NA NA NA NA 51 353 0.08 0.07 0.06 0.11 0.11 0.29 0.10 0.27 0.25 0.74 0.69 0.22 0.07 0.33 0.03 0.04 1.1156
Code
# Subset down to the columns of interest.
cols = [0,12,19,20,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]
df_subset = df.iloc[:,cols]

# Convert character to numeric columns. 
numeric_cols = df_subset.columns.drop(["Rndrng_NPI","Rndrng_Prvdr_Zip5"])
df_subset[numeric_cols]=df_subset[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Convert ID and zip code columns to categorical. 
categorical_cols = ["Rndrng_NPI","Rndrng_Prvdr_Zip5"]
df_subset[categorical_cols]=df_subset[categorical_cols].apply(pd.Categorical)

Exploratory data analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. 1



Four primary types of EDA:

  1. Univariate non-graphical: Describe the data and find patterns that exist within a variable.
  2. Univariate graphical: For a single variable, explore the values visaully using graphs like box plots or histograms.
  3. Multivariate nongraphical: Describe the relationships between two or more variables in the data.
  4. Multivariate graphical: Visualize the relationships between two or more variables through graphs like scatter plots or heat maps,

Applying EDA: Univariate

Code
# Use the skimr package to examine the dataset. 
# Produces a high level summary (# of rows/columns, column types)
# For each column type, it produces details about each variable.

library(skimr)
skim(df.subset)
Data summary
Name df.subset
Number of rows 5000
Number of columns 36
_______________________
Column type frequency:
factor 2
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Rndrng_NPI 0 1 FALSE 5000 100: 1, 100: 1, 100: 1, 100: 1
Rndrng_Prvdr_Zip5 0 1 FALSE 2809 770: 30, 559: 25, 021: 23, 100: 22

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Tot_Benes 0 1.00 283.78 511.06 11.00 60.00 148.00 305.00 9843.00 ▇▁▁▁▁
Tot_Srvcs 0 1.00 1718.13 10740.56 11.00 157.00 430.50 1151.25 462645.00 ▇▁▁▁▁
Bene_Avg_Age 0 1.00 71.43 5.54 11.00 70.00 72.00 74.00 89.00 ▁▁▁▇▅
Bene_Age_LT_65_Cnt 2011 0.60 53.89 74.11 0.00 21.00 35.00 58.00 1582.00 ▇▁▁▁▁
Bene_Age_65_74_Cnt 792 0.84 142.20 237.40 0.00 40.00 79.00 145.00 4089.00 ▇▁▁▁▁
Bene_Age_75_84_Cnt 1442 0.71 118.40 192.40 0.00 37.00 66.00 121.75 3734.00 ▇▁▁▁▁
Bene_Age_GT_84_Cnt 2170 0.57 65.21 105.65 0.00 19.00 34.00 69.00 1708.00 ▇▁▁▁▁
Bene_Feml_Cnt 613 0.88 180.66 307.23 0.00 49.00 101.00 192.00 5020.00 ▇▁▁▁▁
Bene_Male_Cnt 613 0.88 138.98 237.98 0.00 35.00 75.00 147.50 4823.00 ▇▁▁▁▁
Bene_Race_Wht_Cnt 1592 0.68 305.49 496.67 0.00 86.00 175.00 329.25 7942.00 ▇▁▁▁▁
Bene_Race_Black_Cnt 3332 0.33 62.96 124.81 0.00 16.00 31.00 61.00 1781.00 ▇▁▁▁▁
Bene_Race_API_Cnt 4094 0.18 23.43 52.69 0.00 0.00 11.00 26.00 741.00 ▇▁▁▁▁
Bene_Race_Hspnc_Cnt 3665 0.27 48.70 88.57 0.00 13.00 24.00 47.00 928.00 ▇▁▁▁▁
Bene_Race_NatInd_Cnt 3348 0.33 1.98 13.68 0.00 0.00 0.00 0.00 275.00 ▇▁▁▁▁
Bene_Race_Othr_Cnt 4139 0.17 21.88 26.99 0.00 0.00 16.00 28.00 239.00 ▇▁▁▁▁
Bene_Dual_Cnt 1321 0.74 80.12 137.01 0.00 22.00 42.00 84.00 2869.00 ▇▁▁▁▁
Bene_Ndual_Cnt 1321 0.74 282.12 482.44 0.00 74.00 156.00 296.00 8396.00 ▇▁▁▁▁
Bene_CC_AF_Pct 1577 0.68 0.17 0.10 0.00 0.10 0.15 0.23 0.75 ▇▆▁▁▁
Bene_CC_Alzhmr_Pct 1623 0.68 0.21 0.16 0.00 0.09 0.16 0.29 0.75 ▇▅▂▁▁
Bene_CC_Asthma_Pct 2008 0.60 0.10 0.05 0.00 0.07 0.09 0.12 0.46 ▇▇▁▁▁
Bene_CC_Cncr_Pct 1623 0.68 0.15 0.11 0.00 0.10 0.13 0.17 0.75 ▇▃▁▁▁
Bene_CC_CHF_Pct 1208 0.76 0.28 0.17 0.00 0.15 0.25 0.41 0.75 ▇▇▅▃▁
Bene_CC_CKD_Pct 745 0.85 0.44 0.18 0.00 0.29 0.42 0.58 0.75 ▁▇▇▇▆
Bene_CC_COPD_Pct 1452 0.71 0.20 0.11 0.00 0.12 0.19 0.27 0.75 ▇▇▂▁▁
Bene_CC_Dprssn_Pct 751 0.85 0.33 0.15 0.00 0.23 0.31 0.40 0.75 ▁▇▆▂▁
Bene_CC_Dbts_Pct 753 0.85 0.37 0.13 0.00 0.27 0.35 0.45 0.75 ▁▆▇▅▁
Bene_CC_Hyplpdma_Pct 398 0.92 0.62 0.12 0.00 0.55 0.65 0.73 0.75 ▁▁▁▃▇
Bene_CC_Hyprtnsn_Pct 299 0.94 0.68 0.11 0.00 0.63 0.74 0.75 0.75 ▁▁▁▂▇
Bene_CC_IHD_Pct 774 0.85 0.42 0.16 0.00 0.29 0.40 0.53 0.75 ▁▆▇▆▃
Bene_CC_Opo_Pct 1810 0.64 0.11 0.06 0.00 0.08 0.11 0.14 0.75 ▇▂▁▁▁
Bene_CC_RAOA_Pct 528 0.89 0.49 0.13 0.00 0.40 0.48 0.56 0.75 ▁▁▆▇▃
Bene_CC_Sz_Pct 2872 0.43 0.07 0.13 0.00 0.00 0.04 0.08 0.75 ▇▁▁▁▁
Bene_CC_Strok_Pct 2339 0.53 0.11 0.10 0.00 0.04 0.09 0.15 0.75 ▇▂▁▁▁
Bene_Avg_Risk_Scre 0 1.00 1.66 0.82 0.48 1.07 1.46 2.07 10.72 ▇▁▁▁▁

Applying EDA: Multivariate

Code
library(ggplot2)
library(ggExtra)
library(ggthemes)

# Create a scatter plot with the number of beneficiaries by gender.
g <- ggplot(df.subset, 
            aes(x=Bene_Feml_Cnt, 
                y=Bene_Male_Cnt)) + 
    geom_point() +
    theme(legend.position="none") +
    labs(x="Total Female Beneficiaries",
         y="Total Male Beneficiaries") +
    scale_x_continuous(labels=label_number(scale_cut=cut_short_scale())) +
    scale_y_continuous(labels=label_number(scale_cut=cut_short_scale())) +
    theme_hc()

ggMarginal(g, type="histogram")

Working with missing data

Questions to ask when working with missing data

  • Does “missing” mean something different from “0”?
    • If you have data on the amount of candy sold per day, does a missing value mean no candy was sold? or the amount of candy sold is unknown?
  • Is “missing” captured in another way?
    • Sometimes negative values or “99” can imply a value is missing.
  • Was there a change in how data was being captured?
    • For long standing data capture initiatives (e.g., surveys), the data collection methods can change without notice to the analysts.
      • Was the way data was being capture changed?
      • Did the range of values change?
      • Do the values represent something different?
  • Does it make sense to replace missing values?
    • If a variable is mostly missing, replacing it with any method could lead false conclusions.

Imputing missing data

There are two general approaches:

  • Overly simple approach: replace missing values with mean, median, or mode
  • Sophisticated approach: replacing missing values by analyzing the full dataset and building a model per variable with missing data

Multivariate Imputation by Chained Equations (MICE)

Code
library(mice)

# Remove large dimensional variables
df.to.impute <- df.subset |> select(-Rndrng_NPI,-Rndrng_Prvdr_Zip5)

# Impute missing data using predictive mean matching
imputed <- df.to.impute |>
    mice(m=1, maxit=10, seed=42, method="pmm")

# Show the complete dataset including imputed values
complete(imputed)

Cautionary tales in working with data

Data drift

Changes in the data can happen over time, resulting in “data drift” that can impact model performance or other decisions that can be overlooked if only near term changes are considered.

Four different types of data drift described visually.

Cognitive biases

Cognitive biases are systematic patterns of deviation from norm and/or rationality in judgment. They are often studied in psychology, sociology and behavioral economics.1

Some biases to be aware of:

  • Survivorship bias: Analyzing just the data that is available without analyzing the larger situation.
  • False causality: Seeing correlation between two variables does not imply one causes the other to occur. 2
  • Availability bias: Drawing conclusions on limited data.
  • Confirmation bias: Manipulating data to confirm your own hypothesis.

Simpson’s paradox

Simpson’s paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data. 1


A baseball player can have higher batting average than another on each of two years, but lower than the other when the two are combined. In one case, David Justice had a higher batting average than Derek Jeter in 1995 and 1996, but across the two years, Jeter’s average was higher. 2

Picture explaining Simpson's paradox.

Goodhart’s law

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. 1



or a better way of putting it is:



When a measure becomes a target, it ceases to be a good measure. 2

Sketchplanations cartoon explaining Goodhart's Law

Sharing your data with others

Reproducibility and Replicability

Reproducibility is the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators. Hence, reproducibility requires that another, independent team of investigators have to conduct the same experiment. 1

Easy tips for enabling others to reproduce your work:

  • Use a static random seed
    • in R: set.seed(42)
    • in Python: random.seed(42)
  • Document your environment
    • in R: library(renv)
    • in Python: pip freeze > requirements.txt
  • Use version control (e.g., Github, Bitbucket)
  • Use notebooks

2

Using pins

The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including:

  • folders (to share on a networked drive or with services like DropBox)
  • Posit Connect
  • Amazon S3
  • Google Cloud Storage
  • Azure storage
  • Microsoft 365 (OneDrive and SharePoint).

Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.1 Pins is available in R and Python.

Discussion / Contact Info



Christopher Teixeira

christopherteixeira.com