Christopher Teixeira
November 15, 2023
The author’s affiliation with The MITRE Corporation is provided for identification pusposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the author.
# URL for the CMS data used in these slides
url <- "https://data.cms.gov/data-api/v1/dataset/8889d81e-2ee7-448f-8713-f071038289b5/data?size=5000"
# Download and convert the JSON file to a data frame
df <- jsonlite::fromJSON(url)
# View the data frame in a friendly format for Quarto
knitr::kable(head(df), format="html")
Rndrng_NPI | Rndrng_Prvdr_Last_Org_Name | Rndrng_Prvdr_First_Name | Rndrng_Prvdr_MI | Rndrng_Prvdr_Crdntls | Rndrng_Prvdr_Gndr | Rndrng_Prvdr_Ent_Cd | Rndrng_Prvdr_St1 | Rndrng_Prvdr_St2 | Rndrng_Prvdr_City | Rndrng_Prvdr_State_Abrvtn | Rndrng_Prvdr_State_FIPS | Rndrng_Prvdr_Zip5 | Rndrng_Prvdr_RUCA | Rndrng_Prvdr_RUCA_Desc | Rndrng_Prvdr_Cntry | Rndrng_Prvdr_Type | Rndrng_Prvdr_Mdcr_Prtcptg_Ind | Tot_HCPCS_Cds | Tot_Benes | Tot_Srvcs | Tot_Sbmtd_Chrg | Tot_Mdcr_Alowd_Amt | Tot_Mdcr_Pymt_Amt | Tot_Mdcr_Stdzd_Amt | Drug_Sprsn_Ind | Drug_Tot_HCPCS_Cds | Drug_Tot_Benes | Drug_Tot_Srvcs | Drug_Sbmtd_Chrg | Drug_Mdcr_Alowd_Amt | Drug_Mdcr_Pymt_Amt | Drug_Mdcr_Stdzd_Amt | Med_Sprsn_Ind | Med_Tot_HCPCS_Cds | Med_Tot_Benes | Med_Tot_Srvcs | Med_Sbmtd_Chrg | Med_Mdcr_Alowd_Amt | Med_Mdcr_Pymt_Amt | Med_Mdcr_Stdzd_Amt | Bene_Avg_Age | Bene_Age_LT_65_Cnt | Bene_Age_65_74_Cnt | Bene_Age_75_84_Cnt | Bene_Age_GT_84_Cnt | Bene_Feml_Cnt | Bene_Male_Cnt | Bene_Race_Wht_Cnt | Bene_Race_Black_Cnt | Bene_Race_API_Cnt | Bene_Race_Hspnc_Cnt | Bene_Race_NatInd_Cnt | Bene_Race_Othr_Cnt | Bene_Dual_Cnt | Bene_Ndual_Cnt | Bene_CC_AF_Pct | Bene_CC_Alzhmr_Pct | Bene_CC_Asthma_Pct | Bene_CC_Cncr_Pct | Bene_CC_CHF_Pct | Bene_CC_CKD_Pct | Bene_CC_COPD_Pct | Bene_CC_Dprssn_Pct | Bene_CC_Dbts_Pct | Bene_CC_Hyplpdma_Pct | Bene_CC_Hyprtnsn_Pct | Bene_CC_IHD_Pct | Bene_CC_Opo_Pct | Bene_CC_RAOA_Pct | Bene_CC_Sz_Pct | Bene_CC_Strok_Pct | Bene_Avg_Risk_Scre |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1003000126 | Enkeshafi | Ardalan | M.D. | M | I | 6410 Rockledge Dr Ste 304 | Bethesda | MD | 24 | 20817 | 1 | Metropolitan area core: primary flow within an urbanized area of 50,000 and greater | US | Internal Medicine | Y | 36 | 661 | 3749 | 515976.55 | 289822.37 | 231289.23 | 198266.66 | * | # | 78 | 42 | 197 | 247 | 175 | 378 | 283 | 469 | 83 | 62 | 14 | 0 | 33 | 188 | 473 | 0.24 | 0.3 | 0.12 | 0.16 | 0.39 | 0.52 | 0.24 | 0.37 | 0.44 | 0.75 | 0.75 | 0.62 | 0.11 | 0.58 | 0.06 | 0.14 | 1.8026 | ||||||||||||||||
1003000134 | Cibull | Thomas | L | M.D. | M | I | 2650 Ridge Ave | Evanston Hospital | Evanston | IL | 17 | 60201 | 1 | Metropolitan area core: primary flow within an urbanized area of 50,000 and greater | US | Pathology | Y | 21 | 3216 | 7359 | 1136431.94 | 273928.38 | 203094.28 | 191013.47 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 3216 | 7359 | 1136431.94 | 273928.38 | 203094.28 | 191013.47 | 76 | 88 | 1396 | 1198 | 534 | 1698 | 1518 | 2920 | 28 | 66 | 47 | 0 | 155 | 190 | 3026 | 0.1 | 0.07 | 0.05 | 0.14 | 0.11 | 0.24 | 0.06 | 0.15 | 0.2 | 0.52 | 0.51 | 0.24 | 0.11 | 0.38 | 0.01 | 0.03 | 1.0785 | ||
1003000142 | Khalil | Rashid | M.D. | M | I | 4126 N Holland Sylvania Rd | Suite 220 | Toledo | OH | 39 | 43623 | 1 | Metropolitan area core: primary flow within an urbanized area of 50,000 and greater | US | Anesthesiology | Y | 52 | 239 | 1932 | 295950.73 | 122731.08 | 93430.64 | 95509.35 | 5 | 32 | 898 | 8254.52 | 2818.25 | 2306.71 | 3229.16 | 47 | 239 | 1034 | 287696.21 | 119912.83 | 91123.93 | 92280.19 | 68 | 77 | 98 | 52 | 12 | 147 | 92 | 171 | 55 | 0 | 74 | 165 | 0.09 | 0.06 | 0.09 | 0.11 | 0.21 | 0.37 | 0.21 | 0.39 | 0.38 | 0.58 | 0.7 | 0.33 | 0.09 | 0.75 | 0.06 | 1.492 | |||||||
1003000423 | Velotta | Jennifer | A | M.D. | F | I | 11100 Euclid Ave | Cleveland | OH | 39 | 44106 | 1 | Metropolitan area core: primary flow within an urbanized area of 50,000 and greater | US | Obstetrics & Gynecology | Y | 19 | 69 | 738 | 21300 | 8017.57 | 6735.97 | 6892.83 | * | # | 66 | 47 | 69 | 0 | 14 | 55 | 0.16 | 0.28 | 0.17 | 0.55 | 0.46 | 0.16 | 0.17 | 0.38 | 0.6362 | ||||||||||||||||||||||||||||||||
1003000480 | Rothchild | Kevin | B | MD | M | I | 12605 E 16th Ave | Aurora | CO | 08 | 80045 | 1 | Metropolitan area core: primary flow within an urbanized area of 50,000 and greater | US | General Surgery | Y | 29 | 112 | 162 | 180891 | 35872.83 | 28071.92 | 27509.07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29 | 112 | 162 | 180891 | 35872.83 | 28071.92 | 27509.07 | 65 | 39 | 55 | 64 | 48 | 82 | 15 | 0 | 46 | 66 | 0.18 | 0.21 | 0.35 | 0.13 | 0.32 | 0.4 | 0.48 | 0.65 | 0.31 | 0.12 | 0.59 | 1.8233 | |||||||||||||
1003000530 | Semonche | Amanda | M | DO | F | I | 1021 Park Ave | Suite 203 | Quakertown | PA | 42 | 18951 | 1 | Metropolitan area core: primary flow within an urbanized area of 50,000 and greater | US | Internal Medicine | Y | 31 | 404 | 1487 | 257901 | 180753.38 | 140941.85 | 129609.8 | 5 | 106 | 120 | 14114 | 10903.23 | 10903.23 | 10684.71 | 26 | 404 | 1367 | 243787 | 169850.15 | 130038.62 | 118925.09 | 72 | 50 | 206 | 110 | 38 | 237 | 167 | 383 | 51 | 353 | 0.08 | 0.07 | 0.06 | 0.11 | 0.11 | 0.29 | 0.1 | 0.27 | 0.25 | 0.74 | 0.69 | 0.22 | 0.07 | 0.33 | 0.03 | 0.04 | 1.1156 |
import pandas as pd
# URL for the CMS data used in these slides
url = "https://data.cms.gov/data-api/v1/dataset/8889d81e-2ee7-448f-8713-f071038289b5/data?size=5000"
# Download and convert the JSON file to a data frame
df = pd.read_json(url)
# View the data frame in a friendly format for Quarto
df.to_html()
# Subset the data down to a select few to work with.
# Then convert "Bene_" and "Tot_" variables to numeric.
# The ID and zip code variables get converted to factors.
df.subset <- df |>
select(Rndrng_NPI,
Rndrng_Prvdr_Zip5,
Tot_Benes,
Tot_Srvcs,
starts_with("Bene_")) |>
mutate(across(starts_with(c("Bene_","Tot_")), as.numeric),
across(starts_with("Rndrng_"), factor))
# View the data frame in a friendly format for Quarto
knitr::kable(head(df.subset), format="html")
Rndrng_NPI | Rndrng_Prvdr_Zip5 | Tot_Benes | Tot_Srvcs | Bene_Avg_Age | Bene_Age_LT_65_Cnt | Bene_Age_65_74_Cnt | Bene_Age_75_84_Cnt | Bene_Age_GT_84_Cnt | Bene_Feml_Cnt | Bene_Male_Cnt | Bene_Race_Wht_Cnt | Bene_Race_Black_Cnt | Bene_Race_API_Cnt | Bene_Race_Hspnc_Cnt | Bene_Race_NatInd_Cnt | Bene_Race_Othr_Cnt | Bene_Dual_Cnt | Bene_Ndual_Cnt | Bene_CC_AF_Pct | Bene_CC_Alzhmr_Pct | Bene_CC_Asthma_Pct | Bene_CC_Cncr_Pct | Bene_CC_CHF_Pct | Bene_CC_CKD_Pct | Bene_CC_COPD_Pct | Bene_CC_Dprssn_Pct | Bene_CC_Dbts_Pct | Bene_CC_Hyplpdma_Pct | Bene_CC_Hyprtnsn_Pct | Bene_CC_IHD_Pct | Bene_CC_Opo_Pct | Bene_CC_RAOA_Pct | Bene_CC_Sz_Pct | Bene_CC_Strok_Pct | Bene_Avg_Risk_Scre |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1003000126 | 20817 | 661 | 3749 | 78 | 42 | 197 | 247 | 175 | 378 | 283 | 469 | 83 | 62 | 14 | 0 | 33 | 188 | 473 | 0.24 | 0.30 | 0.12 | 0.16 | 0.39 | 0.52 | 0.24 | 0.37 | 0.44 | 0.75 | 0.75 | 0.62 | 0.11 | 0.58 | 0.06 | 0.14 | 1.8026 |
1003000134 | 60201 | 3216 | 7359 | 76 | 88 | 1396 | 1198 | 534 | 1698 | 1518 | 2920 | 28 | 66 | 47 | 0 | 155 | 190 | 3026 | 0.10 | 0.07 | 0.05 | 0.14 | 0.11 | 0.24 | 0.06 | 0.15 | 0.20 | 0.52 | 0.51 | 0.24 | 0.11 | 0.38 | 0.01 | 0.03 | 1.0785 |
1003000142 | 43623 | 239 | 1932 | 68 | 77 | 98 | 52 | 12 | 147 | 92 | 171 | 55 | NA | NA | 0 | NA | 74 | 165 | 0.09 | 0.06 | 0.09 | 0.11 | 0.21 | 0.37 | 0.21 | 0.39 | 0.38 | 0.58 | 0.70 | 0.33 | 0.09 | 0.75 | NA | 0.06 | 1.4920 |
1003000423 | 44106 | 69 | 738 | 66 | NA | 47 | NA | NA | 69 | 0 | NA | NA | NA | NA | NA | NA | 14 | 55 | NA | NA | NA | NA | NA | 0.16 | NA | 0.28 | 0.17 | 0.55 | 0.46 | 0.16 | 0.17 | 0.38 | NA | NA | 0.6362 |
1003000480 | 80045 | 112 | 162 | 65 | 39 | 55 | NA | NA | 64 | 48 | 82 | NA | NA | 15 | 0 | NA | 46 | 66 | NA | NA | 0.18 | NA | 0.21 | 0.35 | 0.13 | 0.32 | 0.40 | 0.48 | 0.65 | 0.31 | 0.12 | 0.59 | NA | NA | 1.8233 |
1003000530 | 18951 | 404 | 1487 | 72 | 50 | 206 | 110 | 38 | 237 | 167 | 383 | NA | NA | NA | NA | NA | 51 | 353 | 0.08 | 0.07 | 0.06 | 0.11 | 0.11 | 0.29 | 0.10 | 0.27 | 0.25 | 0.74 | 0.69 | 0.22 | 0.07 | 0.33 | 0.03 | 0.04 | 1.1156 |
# Subset down to the columns of interest.
cols = [0,12,19,20,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72]
df_subset = df.iloc[:,cols]
# Convert character to numeric columns.
numeric_cols = df_subset.columns.drop(["Rndrng_NPI","Rndrng_Prvdr_Zip5"])
df_subset[numeric_cols]=df_subset[numeric_cols].apply(pd.to_numeric, errors='coerce')
# Convert ID and zip code columns to categorical.
categorical_cols = ["Rndrng_NPI","Rndrng_Prvdr_Zip5"]
df_subset[categorical_cols]=df_subset[categorical_cols].apply(pd.Categorical)
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions. 1
Four primary types of EDA:
Name | df.subset |
Number of rows | 5000 |
Number of columns | 36 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 34 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Rndrng_NPI | 0 | 1 | FALSE | 5000 | 100: 1, 100: 1, 100: 1, 100: 1 |
Rndrng_Prvdr_Zip5 | 0 | 1 | FALSE | 2809 | 770: 30, 559: 25, 021: 23, 100: 22 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Tot_Benes | 0 | 1.00 | 283.78 | 511.06 | 11.00 | 60.00 | 148.00 | 305.00 | 9843.00 | ▇▁▁▁▁ |
Tot_Srvcs | 0 | 1.00 | 1718.13 | 10740.56 | 11.00 | 157.00 | 430.50 | 1151.25 | 462645.00 | ▇▁▁▁▁ |
Bene_Avg_Age | 0 | 1.00 | 71.43 | 5.54 | 11.00 | 70.00 | 72.00 | 74.00 | 89.00 | ▁▁▁▇▅ |
Bene_Age_LT_65_Cnt | 2011 | 0.60 | 53.89 | 74.11 | 0.00 | 21.00 | 35.00 | 58.00 | 1582.00 | ▇▁▁▁▁ |
Bene_Age_65_74_Cnt | 792 | 0.84 | 142.20 | 237.40 | 0.00 | 40.00 | 79.00 | 145.00 | 4089.00 | ▇▁▁▁▁ |
Bene_Age_75_84_Cnt | 1442 | 0.71 | 118.40 | 192.40 | 0.00 | 37.00 | 66.00 | 121.75 | 3734.00 | ▇▁▁▁▁ |
Bene_Age_GT_84_Cnt | 2170 | 0.57 | 65.21 | 105.65 | 0.00 | 19.00 | 34.00 | 69.00 | 1708.00 | ▇▁▁▁▁ |
Bene_Feml_Cnt | 613 | 0.88 | 180.66 | 307.23 | 0.00 | 49.00 | 101.00 | 192.00 | 5020.00 | ▇▁▁▁▁ |
Bene_Male_Cnt | 613 | 0.88 | 138.98 | 237.98 | 0.00 | 35.00 | 75.00 | 147.50 | 4823.00 | ▇▁▁▁▁ |
Bene_Race_Wht_Cnt | 1592 | 0.68 | 305.49 | 496.67 | 0.00 | 86.00 | 175.00 | 329.25 | 7942.00 | ▇▁▁▁▁ |
Bene_Race_Black_Cnt | 3332 | 0.33 | 62.96 | 124.81 | 0.00 | 16.00 | 31.00 | 61.00 | 1781.00 | ▇▁▁▁▁ |
Bene_Race_API_Cnt | 4094 | 0.18 | 23.43 | 52.69 | 0.00 | 0.00 | 11.00 | 26.00 | 741.00 | ▇▁▁▁▁ |
Bene_Race_Hspnc_Cnt | 3665 | 0.27 | 48.70 | 88.57 | 0.00 | 13.00 | 24.00 | 47.00 | 928.00 | ▇▁▁▁▁ |
Bene_Race_NatInd_Cnt | 3348 | 0.33 | 1.98 | 13.68 | 0.00 | 0.00 | 0.00 | 0.00 | 275.00 | ▇▁▁▁▁ |
Bene_Race_Othr_Cnt | 4139 | 0.17 | 21.88 | 26.99 | 0.00 | 0.00 | 16.00 | 28.00 | 239.00 | ▇▁▁▁▁ |
Bene_Dual_Cnt | 1321 | 0.74 | 80.12 | 137.01 | 0.00 | 22.00 | 42.00 | 84.00 | 2869.00 | ▇▁▁▁▁ |
Bene_Ndual_Cnt | 1321 | 0.74 | 282.12 | 482.44 | 0.00 | 74.00 | 156.00 | 296.00 | 8396.00 | ▇▁▁▁▁ |
Bene_CC_AF_Pct | 1577 | 0.68 | 0.17 | 0.10 | 0.00 | 0.10 | 0.15 | 0.23 | 0.75 | ▇▆▁▁▁ |
Bene_CC_Alzhmr_Pct | 1623 | 0.68 | 0.21 | 0.16 | 0.00 | 0.09 | 0.16 | 0.29 | 0.75 | ▇▅▂▁▁ |
Bene_CC_Asthma_Pct | 2008 | 0.60 | 0.10 | 0.05 | 0.00 | 0.07 | 0.09 | 0.12 | 0.46 | ▇▇▁▁▁ |
Bene_CC_Cncr_Pct | 1623 | 0.68 | 0.15 | 0.11 | 0.00 | 0.10 | 0.13 | 0.17 | 0.75 | ▇▃▁▁▁ |
Bene_CC_CHF_Pct | 1208 | 0.76 | 0.28 | 0.17 | 0.00 | 0.15 | 0.25 | 0.41 | 0.75 | ▇▇▅▃▁ |
Bene_CC_CKD_Pct | 745 | 0.85 | 0.44 | 0.18 | 0.00 | 0.29 | 0.42 | 0.58 | 0.75 | ▁▇▇▇▆ |
Bene_CC_COPD_Pct | 1452 | 0.71 | 0.20 | 0.11 | 0.00 | 0.12 | 0.19 | 0.27 | 0.75 | ▇▇▂▁▁ |
Bene_CC_Dprssn_Pct | 751 | 0.85 | 0.33 | 0.15 | 0.00 | 0.23 | 0.31 | 0.40 | 0.75 | ▁▇▆▂▁ |
Bene_CC_Dbts_Pct | 753 | 0.85 | 0.37 | 0.13 | 0.00 | 0.27 | 0.35 | 0.45 | 0.75 | ▁▆▇▅▁ |
Bene_CC_Hyplpdma_Pct | 398 | 0.92 | 0.62 | 0.12 | 0.00 | 0.55 | 0.65 | 0.73 | 0.75 | ▁▁▁▃▇ |
Bene_CC_Hyprtnsn_Pct | 299 | 0.94 | 0.68 | 0.11 | 0.00 | 0.63 | 0.74 | 0.75 | 0.75 | ▁▁▁▂▇ |
Bene_CC_IHD_Pct | 774 | 0.85 | 0.42 | 0.16 | 0.00 | 0.29 | 0.40 | 0.53 | 0.75 | ▁▆▇▆▃ |
Bene_CC_Opo_Pct | 1810 | 0.64 | 0.11 | 0.06 | 0.00 | 0.08 | 0.11 | 0.14 | 0.75 | ▇▂▁▁▁ |
Bene_CC_RAOA_Pct | 528 | 0.89 | 0.49 | 0.13 | 0.00 | 0.40 | 0.48 | 0.56 | 0.75 | ▁▁▆▇▃ |
Bene_CC_Sz_Pct | 2872 | 0.43 | 0.07 | 0.13 | 0.00 | 0.00 | 0.04 | 0.08 | 0.75 | ▇▁▁▁▁ |
Bene_CC_Strok_Pct | 2339 | 0.53 | 0.11 | 0.10 | 0.00 | 0.04 | 0.09 | 0.15 | 0.75 | ▇▂▁▁▁ |
Bene_Avg_Risk_Scre | 0 | 1.00 | 1.66 | 0.82 | 0.48 | 1.07 | 1.46 | 2.07 | 10.72 | ▇▁▁▁▁ |
library(ggplot2)
library(ggExtra)
library(ggthemes)
# Create a scatter plot with the number of beneficiaries by gender.
g <- ggplot(df.subset,
aes(x=Bene_Feml_Cnt,
y=Bene_Male_Cnt)) +
geom_point() +
theme(legend.position="none") +
labs(x="Total Female Beneficiaries",
y="Total Male Beneficiaries") +
scale_x_continuous(labels=label_number(scale_cut=cut_short_scale())) +
scale_y_continuous(labels=label_number(scale_cut=cut_short_scale())) +
theme_hc()
ggMarginal(g, type="histogram")
There are two general approaches:
library(mice)
# Remove large dimensional variables
df.to.impute <- df.subset |> select(-Rndrng_NPI,-Rndrng_Prvdr_Zip5)
# Impute missing data using predictive mean matching
imputed <- df.to.impute |>
mice(m=1, maxit=10, seed=42, method="pmm")
# Show the complete dataset including imputed values
complete(imputed)
Changes in the data can happen over time, resulting in “data drift” that can impact model performance or other decisions that can be overlooked if only near term changes are considered.
Cognitive biases are systematic patterns of deviation from norm and/or rationality in judgment. They are often studied in psychology, sociology and behavioral economics.1
Some biases to be aware of:
Simpson’s paradox occurs when groups of data show one particular trend, but this trend is reversed when the groups are combined together. Understanding and identifying this paradox is important for correctly interpreting data. 1
A baseball player can have higher batting average than another on each of two years, but lower than the other when the two are combined. In one case, David Justice had a higher batting average than Derek Jeter in 1995 and 1996, but across the two years, Jeter’s average was higher. 2
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. 1
or a better way of putting it is:
When a measure becomes a target, it ceases to be a good measure. 2
Reproducibility is the ability of independent investigators to draw the same conclusions from an experiment by following the documentation shared by the original investigators. Hence, reproducibility requires that another, independent team of investigators have to conduct the same experiment. 1
Easy tips for enabling others to reproduce your work:
set.seed(42)
random.seed(42)
library(renv)
pip freeze > requirements.txt
2
The pins package publishes data, models, and other R objects, making it easy to share them across projects and with your colleagues. You can pin objects to a variety of pin boards, including:
Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.1 Pins is available in R and Python.
© 2023 THE MITRE CORPORATION. ALL RIGHTS RESERVED. Approved for public release; distribution unlimited. Case number 23-03859-1.