Working with Messy Data

Christopher Teixeira
November 15, 2023

A little about me…

Christopher Teixeira

Principal Data Scientist

The MITRE Corporation

Interests

Data Analytics
Applied Statistics
Operations Research

Education

MS in Operations Research, George Mason University
BSc in Mathematics, Worcester Polytechnic Institute

Code

# URL for the CMS data used in these slides
url <- "https://data.cms.gov/data-api/v1/dataset/8889d81e-2ee7-448f-8713-f071038289b5/data?size=5000"

# Download and convert the JSON file to a data frame
df <- jsonlite::fromJSON(url)

# View the data frame in a friendly format for Quarto
knitr::kable(head(df), format="html")

Rndrng_NPI	Rndrng_Prvdr_Last_Org_Name	Rndrng_Prvdr_First_Name	Rndrng_Prvdr_MI	Rndrng_Prvdr_Crdntls	Rndrng_Prvdr_Gndr	Rndrng_Prvdr_Ent_Cd	Rndrng_Prvdr_St1	Rndrng_Prvdr_St2	Rndrng_Prvdr_City	Rndrng_Prvdr_State_Abrvtn	Rndrng_Prvdr_State_FIPS	Rndrng_Prvdr_Zip5	Rndrng_Prvdr_RUCA	Rndrng_Prvdr_RUCA_Desc	Rndrng_Prvdr_Cntry	Rndrng_Prvdr_Type	Rndrng_Prvdr_Mdcr_Prtcptg_Ind	Tot_HCPCS_Cds	Tot_Benes	Tot_Srvcs	Tot_Sbmtd_Chrg	Tot_Mdcr_Alowd_Amt	Tot_Mdcr_Pymt_Amt	Tot_Mdcr_Stdzd_Amt	Drug_Sprsn_Ind	Drug_Tot_HCPCS_Cds	Drug_Tot_Benes	Drug_Tot_Srvcs	Drug_Sbmtd_Chrg	Drug_Mdcr_Alowd_Amt	Drug_Mdcr_Pymt_Amt	Drug_Mdcr_Stdzd_Amt	Med_Sprsn_Ind	Med_Tot_HCPCS_Cds	Med_Tot_Benes	Med_Tot_Srvcs	Med_Sbmtd_Chrg	Med_Mdcr_Alowd_Amt	Med_Mdcr_Pymt_Amt	Med_Mdcr_Stdzd_Amt	Bene_Avg_Age	Bene_Age_LT_65_Cnt	Bene_Age_65_74_Cnt	Bene_Age_75_84_Cnt	Bene_Age_GT_84_Cnt	Bene_Feml_Cnt	Bene_Male_Cnt	Bene_Race_Wht_Cnt	Bene_Race_Black_Cnt	Bene_Race_API_Cnt	Bene_Race_Hspnc_Cnt	Bene_Race_NatInd_Cnt	Bene_Race_Othr_Cnt	Bene_Dual_Cnt	Bene_Ndual_Cnt	Bene_CC_AF_Pct	Bene_CC_Alzhmr_Pct	Bene_CC_Asthma_Pct	Bene_CC_Cncr_Pct	Bene_CC_CHF_Pct	Bene_CC_CKD_Pct	Bene_CC_COPD_Pct	Bene_CC_Dprssn_Pct	Bene_CC_Dbts_Pct	Bene_CC_Hyplpdma_Pct	Bene_CC_Hyprtnsn_Pct	Bene_CC_IHD_Pct	Bene_CC_Opo_Pct	Bene_CC_RAOA_Pct	Bene_CC_Sz_Pct	Bene_CC_Strok_Pct	Bene_Avg_Risk_Scre
1003000126	Enkeshafi	Ardalan		M.D.	M	I	6410 Rockledge Dr Ste 304		Bethesda	MD	24	20817	1	Metropolitan area core: primary flow within an urbanized area of 50,000 and greater	US	Internal Medicine	Y	36	661	3749	515976.55	289822.37	231289.23	198266.66	*								#								78	42	197	247	175	378	283	469	83	62	14	0	33	188	473	0.24	0.3	0.12	0.16	0.39	0.52	0.24	0.37	0.44	0.75	0.75	0.62	0.11	0.58	0.06	0.14	1.8026
1003000134	Cibull	Thomas	L	M.D.	M	I	2650 Ridge Ave	Evanston Hospital	Evanston	IL	17	60201	1	Metropolitan area core: primary flow within an urbanized area of 50,000 and greater	US	Pathology	Y	21	3216	7359	1136431.94	273928.38	203094.28	191013.47		0	0	0	0	0	0	0		21	3216	7359	1136431.94	273928.38	203094.28	191013.47	76	88	1396	1198	534	1698	1518	2920	28	66	47	0	155	190	3026	0.1	0.07	0.05	0.14	0.11	0.24	0.06	0.15	0.2	0.52	0.51	0.24	0.11	0.38	0.01	0.03	1.0785
1003000142	Khalil	Rashid		M.D.	M	I	4126 N Holland Sylvania Rd	Suite 220	Toledo	OH	39	43623	1	Metropolitan area core: primary flow within an urbanized area of 50,000 and greater	US	Anesthesiology	Y	52	239	1932	295950.73	122731.08	93430.64	95509.35		5	32	898	8254.52	2818.25	2306.71	3229.16		47	239	1034	287696.21	119912.83	91123.93	92280.19	68	77	98	52	12	147	92	171	55			0		74	165	0.09	0.06	0.09	0.11	0.21	0.37	0.21	0.39	0.38	0.58	0.7	0.33	0.09	0.75		0.06	1.492
1003000423	Velotta	Jennifer	A	M.D.	F	I	11100 Euclid Ave		Cleveland	OH	39	44106	1	Metropolitan area core: primary flow within an urbanized area of 50,000 and greater	US	Obstetrics & Gynecology	Y	19	69	738	21300	8017.57	6735.97	6892.83	*								#								66		47			69	0							14	55						0.16		0.28	0.17	0.55	0.46	0.16	0.17	0.38			0.6362
1003000480	Rothchild	Kevin	B	MD	M	I	12605 E 16th Ave		Aurora	CO	08	80045	1	Metropolitan area core: primary flow within an urbanized area of 50,000 and greater	US	General Surgery	Y	29	112	162	180891	35872.83	28071.92	27509.07		0	0	0	0	0	0	0		29	112	162	180891	35872.83	28071.92	27509.07	65	39	55			64	48	82			15	0		46	66			0.18		0.21	0.35	0.13	0.32	0.4	0.48	0.65	0.31	0.12	0.59			1.8233
1003000530	Semonche	Amanda	M	DO	F	I	1021 Park Ave	Suite 203	Quakertown	PA	42	18951	1	Metropolitan area core: primary flow within an urbanized area of 50,000 and greater	US	Internal Medicine	Y	31	404	1487	257901	180753.38	140941.85	129609.8		5	106	120	14114	10903.23	10903.23	10684.71		26	404	1367	243787	169850.15	130038.62	118925.09	72	50	206	110	38	237	167	383						51	353	0.08	0.07	0.06	0.11	0.11	0.29	0.1	0.27	0.25	0.74	0.69	0.22	0.07	0.33	0.03	0.04	1.1156

Code

import pandas as pd

# URL for the CMS data used in these slides
url = "https://data.cms.gov/data-api/v1/dataset/8889d81e-2ee7-448f-8713-f071038289b5/data?size=5000"

# Download and convert the JSON file to a data frame
df = pd.read_json(url)

# View the data frame in a friendly format for Quarto
df.to_html()

Changing data types

R
Python

Code

# Subset the data down to a select few to work with. 
# Then convert "Bene_" and "Tot_" variables to numeric. 
# The ID and zip code variables get converted to factors.
df.subset <- df |> 
    select(Rndrng_NPI,
           Rndrng_Prvdr_Zip5,
           Tot_Benes,
           Tot_Srvcs,
           starts_with("Bene_")) |>
    mutate(across(starts_with(c("Bene_","Tot_")), as.numeric),
           across(starts_with("Rndrng_"), factor))

# View the data frame in a friendly format for Quarto
knitr::kable(head(df.subset), format="html")

Rndrng_NPI	Rndrng_Prvdr_Zip5	Tot_Benes	Tot_Srvcs	Bene_Avg_Age	Bene_Age_LT_65_Cnt	Bene_Age_65_74_Cnt	Bene_Age_75_84_Cnt	Bene_Age_GT_84_Cnt	Bene_Feml_Cnt	Bene_Male_Cnt	Bene_Race_Wht_Cnt	Bene_Race_Black_Cnt	Bene_Race_API_Cnt	Bene_Race_Hspnc_Cnt	Bene_Race_NatInd_Cnt	Bene_Race_Othr_Cnt	Bene_Dual_Cnt	Bene_Ndual_Cnt	Bene_CC_AF_Pct	Bene_CC_Alzhmr_Pct	Bene_CC_Asthma_Pct	Bene_CC_Cncr_Pct	Bene_CC_CHF_Pct	Bene_CC_CKD_Pct	Bene_CC_COPD_Pct	Bene_CC_Dprssn_Pct	Bene_CC_Dbts_Pct	Bene_CC_Hyplpdma_Pct	Bene_CC_Hyprtnsn_Pct	Bene_CC_IHD_Pct	Bene_CC_Opo_Pct	Bene_CC_RAOA_Pct	Bene_CC_Sz_Pct	Bene_CC_Strok_Pct	Bene_Avg_Risk_Scre
1003000126	20817	661	3749	78	42	197	247	175	378	283	469	83	62	14	0	33	188	473	0.24	0.30	0.12	0.16	0.39	0.52	0.24	0.37	0.44	0.75	0.75	0.62	0.11	0.58	0.06	0.14	1.8026
1003000134	60201	3216	7359	76	88	1396	1198	534	1698	1518	2920	28	66	47	0	155	190	3026	0.10	0.07	0.05	0.14	0.11	0.24	0.06	0.15	0.20	0.52	0.51	0.24	0.11	0.38	0.01	0.03	1.0785
1003000142	43623	239	1932	68	77	98	52	12	147	92	171	55	NA	NA	0	NA	74	165	0.09	0.06	0.09	0.11	0.21	0.37	0.21	0.39	0.38	0.58	0.70	0.33	0.09	0.75	NA	0.06	1.4920
1003000423	44106	69	738	66	NA	47	NA	NA	69	0	NA	NA	NA	NA	NA	NA	14	55	NA	NA	NA	NA	NA	0.16	NA	0.28	0.17	0.55	0.46	0.16	0.17	0.38	NA	NA	0.6362
1003000480	80045	112	162	65	39	55	NA	NA	64	48	82	NA	NA	15	0	NA	46	66	NA	NA	0.18	NA	0.21	0.35	0.13	0.32	0.40	0.48	0.65	0.31	0.12	0.59	NA	NA	1.8233
1003000530	18951	404	1487	72	50	206	110	38	237	167	383	NA	NA	NA	NA	NA	51	353	0.08	0.07	0.06	0.11	0.11	0.29	0.10	0.27	0.25	0.74	0.69	0.22	0.07	0.33	0.03	0.04	1.1156

Code

# Subset down to the columns of interest.
cols = [0,12,19,20,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]
df_subset = df.iloc[:,cols]

# Convert character to numeric columns. 
numeric_cols = df_subset.columns.drop(["Rndrng_NPI","Rndrng_Prvdr_Zip5"])
df_subset[numeric_cols]=df_subset[numeric_cols].apply(pd.to_numeric, errors='coerce')

# Convert ID and zip code columns to categorical. 
categorical_cols = ["Rndrng_NPI","Rndrng_Prvdr_Zip5"]
df_subset[categorical_cols]=df_subset[categorical_cols].apply(pd.Categorical)

Exploratory data analysis

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. ¹

Four primary types of EDA:

Univariate non-graphical: Describe the data and find patterns that exist within a variable.
Univariate graphical: For a single variable, explore the values visaully using graphs like box plots or histograms.
Multivariate nongraphical: Describe the relationships between two or more variables in the data.
Multivariate graphical: Visualize the relationships between two or more variables through graphs like scatter plots or heat maps,

Applying EDA: Univariate

R
Python

Code

# Use the skimr package to examine the dataset. 
# Produces a high level summary (# of rows/columns, column types)
# For each column type, it produces details about each variable.

library(skimr)
skim(df.subset)

Data summary
Name	df.subset
Number of rows	5000
Number of columns	36
_______________________
Column type frequency:
factor	2
numeric	34
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Rndrng_NPI	0	1	FALSE	5000	100: 1, 100: 1, 100: 1, 100: 1
Rndrng_Prvdr_Zip5	0	1	FALSE	2809	770: 30, 559: 25, 021: 23, 100: 22

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Tot_Benes	0	1.00	283.78	511.06	11.00	60.00	148.00	305.00	9843.00	▇▁▁▁▁
Tot_Srvcs	0	1.00	1718.13	10740.56	11.00	157.00	430.50	1151.25	462645.00	▇▁▁▁▁
Bene_Avg_Age	0	1.00	71.43	5.54	11.00	70.00	72.00	74.00	89.00	▁▁▁▇▅
Bene_Age_LT_65_Cnt	2011	0.60	53.89	74.11	0.00	21.00	35.00	58.00	1582.00	▇▁▁▁▁
Bene_Age_65_74_Cnt	792	0.84	142.20	237.40	0.00	40.00	79.00	145.00	4089.00	▇▁▁▁▁
Bene_Age_75_84_Cnt	1442	0.71	118.40	192.40	0.00	37.00	66.00	121.75	3734.00	▇▁▁▁▁
Bene_Age_GT_84_Cnt	2170	0.57	65.21	105.65	0.00	19.00	34.00	69.00	1708.00	▇▁▁▁▁
Bene_Feml_Cnt	613	0.88	180.66	307.23	0.00	49.00	101.00	192.00	5020.00	▇▁▁▁▁
Bene_Male_Cnt	613	0.88	138.98	237.98	0.00	35.00	75.00	147.50	4823.00	▇▁▁▁▁
Bene_Race_Wht_Cnt	1592	0.68	305.49	496.67	0.00	86.00	175.00	329.25	7942.00	▇▁▁▁▁
Bene_Race_Black_Cnt	3332	0.33	62.96	124.81	0.00	16.00	31.00	61.00	1781.00	▇▁▁▁▁
Bene_Race_API_Cnt	4094	0.18	23.43	52.69	0.00	0.00	11.00	26.00	741.00	▇▁▁▁▁
Bene_Race_Hspnc_Cnt	3665	0.27	48.70	88.57	0.00	13.00	24.00	47.00	928.00	▇▁▁▁▁
Bene_Race_NatInd_Cnt	3348	0.33	1.98	13.68	0.00	0.00	0.00	0.00	275.00	▇▁▁▁▁
Bene_Race_Othr_Cnt	4139	0.17	21.88	26.99	0.00	0.00	16.00	28.00	239.00	▇▁▁▁▁
Bene_Dual_Cnt	1321	0.74	80.12	137.01	0.00	22.00	42.00	84.00	2869.00	▇▁▁▁▁
Bene_Ndual_Cnt	1321	0.74	282.12	482.44	0.00	74.00	156.00	296.00	8396.00	▇▁▁▁▁
Bene_CC_AF_Pct	1577	0.68	0.17	0.10	0.00	0.10	0.15	0.23	0.75	▇▆▁▁▁
Bene_CC_Alzhmr_Pct	1623	0.68	0.21	0.16	0.00	0.09	0.16	0.29	0.75	▇▅▂▁▁
Bene_CC_Asthma_Pct	2008	0.60	0.10	0.05	0.00	0.07	0.09	0.12	0.46	▇▇▁▁▁
Bene_CC_Cncr_Pct	1623	0.68	0.15	0.11	0.00	0.10	0.13	0.17	0.75	▇▃▁▁▁
Bene_CC_CHF_Pct	1208	0.76	0.28	0.17	0.00	0.15	0.25	0.41	0.75	▇▇▅▃▁
Bene_CC_CKD_Pct	745	0.85	0.44	0.18	0.00	0.29	0.42	0.58	0.75	▁▇▇▇▆
Bene_CC_COPD_Pct	1452	0.71	0.20	0.11	0.00	0.12	0.19	0.27	0.75	▇▇▂▁▁
Bene_CC_Dprssn_Pct	751	0.85	0.33	0.15	0.00	0.23	0.31	0.40	0.75	▁▇▆▂▁
Bene_CC_Dbts_Pct	753	0.85	0.37	0.13	0.00	0.27	0.35	0.45	0.75	▁▆▇▅▁
Bene_CC_Hyplpdma_Pct	398	0.92	0.62	0.12	0.00	0.55	0.65	0.73	0.75	▁▁▁▃▇
Bene_CC_Hyprtnsn_Pct	299	0.94	0.68	0.11	0.00	0.63	0.74	0.75	0.75	▁▁▁▂▇
Bene_CC_IHD_Pct	774	0.85	0.42	0.16	0.00	0.29	0.40	0.53	0.75	▁▆▇▆▃
Bene_CC_Opo_Pct	1810	0.64	0.11	0.06	0.00	0.08	0.11	0.14	0.75	▇▂▁▁▁
Bene_CC_RAOA_Pct	528	0.89	0.49	0.13	0.00	0.40	0.48	0.56	0.75	▁▁▆▇▃
Bene_CC_Sz_Pct	2872	0.43	0.07	0.13	0.00	0.00	0.04	0.08	0.75	▇▁▁▁▁
Bene_CC_Strok_Pct	2339	0.53	0.11	0.10	0.00	0.04	0.09	0.15	0.75	▇▂▁▁▁
Bene_Avg_Risk_Scre	0	1.00	1.66	0.82	0.48	1.07	1.46	2.07	10.72	▇▁▁▁▁

Applying EDA: Multivariate

R
Python

Code

library(ggplot2)
library(ggExtra)
library(ggthemes)

# Create a scatter plot with the number of beneficiaries by gender.
g <- ggplot(df.subset, 
            aes(x=Bene_Feml_Cnt, 
                y=Bene_Male_Cnt)) + 
    geom_point() +
    theme(legend.position="none") +
    labs(x="Total Female Beneficiaries",
         y="Total Male Beneficiaries") +
    scale_x_continuous(labels=label_number(scale_cut=cut_short_scale())) +
    scale_y_continuous(labels=label_number(scale_cut=cut_short_scale())) +
    theme_hc()

ggMarginal(g, type="histogram")

Working with missing data

Questions to ask when working with missing data

Does “missing” mean something different from “0”?
- If you have data on the amount of candy sold per day, does a missing value mean no candy was sold? or the amount of candy sold is unknown?
Is “missing” captured in another way?
- Sometimes negative values or “99” can imply a value is missing.
Was there a change in how data was being captured?
- For long standing data capture initiatives (e.g., surveys), the data collection methods can change without notice to the analysts.
  - Was the way data was being capture changed?
  - Did the range of values change?
  - Do the values represent something different?
Does it make sense to replace missing values?
- If a variable is mostly missing, replacing it with any method could lead false conclusions.

Imputing missing data

There are two general approaches:

Overly simple approach: replace missing values with mean, median, or mode
Sophisticated approach: replacing missing values by analyzing the full dataset and building a model per variable with missing data

Multivariate Imputation by Chained Equations (MICE)

R
Python

Code

library(mice)

# Remove large dimensional variables
df.to.impute <- df.subset |> select(-Rndrng_NPI,-Rndrng_Prvdr_Zip5)

# Impute missing data using predictive mean matching
imputed <- df.to.impute |>
    mice(m=1, maxit=10, seed=42, method="pmm")

# Show the complete dataset including imputed values
complete(imputed)

Cautionary tales in working with data

Data drift

Changes in the data can happen over time, resulting in “data drift” that can impact model performance or other decisions that can be overlooked if only near term changes are considered.

Cognitive biases

Cognitive biases are systematic patterns of deviation from norm and/or rationality in judgment. They are often studied in psychology, sociology and behavioral economics.¹

Some biases to be aware of:

Survivorship bias: Analyzing just the data that is available without analyzing the larger situation.
False causality: Seeing correlation between two variables does not imply one causes the other to occur. ²
Availability bias: Drawing conclusions on limited data.
Confirmation bias: Manipulating data to confirm your own hypothesis.

Simpson’s paradox

Simpson’s paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data. ¹

A baseball player can have higher batting average than another on each of two years, but lower than the other when the two are combined. In one case, David Justice had a higher batting average than Derek Jeter in 1995 and 1996, but across the two years, Jeter’s average was higher. ²

Goodhart’s law

Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. ¹

or a better way of putting it is:

When a measure becomes a target, it ceases to be a good measure. ²

Sketchplanations cartoon explaining Goodhart's Law

Reproducibility and Replicability

Reproducibility is the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators. Hence, reproducibility requires that another, independent team of investigators have to conduct the same experiment. ¹

Easy tips for enabling others to reproduce your work:

Use a static random seed
- in R: set.seed(42)
- in Python: random.seed(42)
Document your environment
- in R: library(renv)
- in Python: pip freeze > requirements.txt

Use version control (e.g., Github, Bitbucket)
Use notebooks
- in R: R Markdown
- in Python: Jupyter
- in either: Quarto

Using pins

The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including:

folders (to share on a networked drive or with services like DropBox)
Posit Connect
Amazon S3
Google Cloud Storage
Azure storage
Microsoft 365 (OneDrive and SharePoint).

Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.¹ Pins is available in R and Python.

Discussion / Contact Info

Christopher Teixeira

christopherteixeira.com

chris@christopherteixeira.com

in/christopherteixeira

ct-analytics

Working with Messy Data

A little about me…

Christopher Teixeira

Principal Data Scientist

The MITRE Corporation

Interests

Education

Reading in data

Changing data types

Exploratory data analysis

Applying EDA: Univariate

Applying EDA: Multivariate

Working with missing data

Questions to ask when working with missing data

Imputing missing data

Multivariate Imputation by Chained Equations (MICE)

Cautionary tales in working with data

Data drift

Cognitive biases

Simpson’s paradox

Goodhart’s law

Sharing your data with others

Reproducibility and Replicability

Using pins

Discussion / Contact Info

Christopher Teixeira