---
title: "Categorizing Pitches in Baseball"
format:
revealjs:
theme: [default, custom.scss]
logo: img/icon.png
footer: "Christopher Teixeira"
width: 1350
code-fold: false
disable-layout: false
self-contained: true
incremental: false
# pptx:
# logo: img/icon.png
# footer: "Christopher Teixeira"
# incremental: false
editor_options:
chunk_output_type: console
---
## Introduction to the problem {visibility="hidden"}
{{< video img/video.mp4 height="550" >}}
Source: [MLB on Youtube](https://www.youtube.com/watch?v=0DFYJkneoMo)
## {background-video="img/video.mp4"}
## How is pitch type detected today?
There are three key features of every pitch to watch for:
::: {.incremental}
1. **Velocity** - the speed of the pitch
2. **Movement** - the general movement of the pitch
3. **Break** - the shift in direction
:::
. . .
...and by asking the pitcher what types he throws.
Source: [Baseball pitches illustrated by Lokesh Dhakar](https://lokeshdhakar.com/baseball-pitches-illustrated/)
## What types of pitches are there? {.smaller}
:::: {.columns}
::: {.column .subtitle .nonincremental width="33%"}
#### Fastballs
Four-seam
Two-seam
Cutter
Sinker
![](img/four_seam_fastball.svg){fig-align="center" fig-alt="Four-seam fastball path"}
:::
::: {.column .subtitle .nonincremental width="34%"}
#### Breaking balls
Curveball
Slider
Slurve
Screwball
![](img/curveball.svg){fig-align="center" fig-alt="Curveball breaking ball path"}
:::
::: {.column .subtitle .nonincremental width="33%"}
#### Offspeed
Changeup
Palmball
Circle change
Splitter
![](img/changeup.svg){fig-align="center" fig-alt="Changeup offspeed pitch path"}
:::
::::
There's another couple of pitches that don't quite fall into any of those categories: eephus, knuckle, fork, gyro.
Source: [Baseball pitches illustrated by Lokesh Dhakar](https://lokeshdhakar.com/baseball-pitches-illustrated/)
## Set up code {.smaller visibility="hidden"}
```{r setup}
#| label: read in data
library(tidyverse)
library(ggthemes)
library(tidymodels)
library(tidyclust)
set.seed(42)
pitchtypes.raw <- "*CH - Changeup - Offspeed\n*CU - Curveball - Breaking\n*FA - Fastball - Fastball\n*FT - Two Seamer - Fastball\n*FF - Four Seamer - Fastball\n*FC - Cutter - Fastball\n*SL - Slider - Breaking\n*FS - Splitter - Offspeed\n*SI - Sinker - Fastball\n*FO - Forkball - Other\n*KN - Knuckleball - Other\n*KC - Knuckle Curve - Breaking\n*SC - Screwball - Breaking\n*GY - Gyroball - Other\n*EP - Eephus - Other\n*PO - Pitchout - Other\n*IN - Intentional Ball - Other\n*AB - Automatic Ball - Other\n*AS - Automatic Strike - Other\n*UN - Unknown - Other" %>%
str_match_all("(\\w{2}) - (\\w+\\s?\\w+) - (\\w+)")
pitchtypes <- pitchtypes.raw[[1]] %>% as_tibble() %>% select(pitchType=V2,pitch=V3,category=V4)
data.2013 <- read_csv("../data/2013.csv")
data.2014 <- read_csv("../data/2014.csv")
data.2015 <- read_csv("../data/2015.csv")
data.all <- bind_rows(data.2013,data.2014,data.2015)
```
## A quick look at data {.smaller visibility="hidden"}
```{r glimpse_data}
#| label: skim data
#| eval: false
data.all |>
select(pitcherId,pitchType,releaseVelocity,spinRate,spinDir,px,pz,szt,szb,x0,y0,z0,vx0,vy0,vz0,ax,ay,az) |>
mutate(pitchType=factor(pitchType)) |>
# skimr::skim()
glimpse()
```
## How many pitch types exist? {.smaller}
```{r pitch_distribution}
#| label: glimpse data
#| fig-align: center
#| fig-alt: "Distribution of pitch types in data"
#| echo: true
#| code-fold: true
pitch.types.df <- data.all |>
left_join(pitchtypes) |>
mutate(pitch=factor(pitch)) |>
group_by(pitch,category) |>
count() |>
arrange(desc(n))
pitch.types.df |>
ggplot(aes(x=reorder(pitch,-n),y=n)) +
geom_bar(stat="identity",fill="#999999") +
geom_text(aes(label=pitch,y=0),angle=90,nudge_y = 5000,hjust=0)+
labs(x="Pitch Type",
y="Number of Pitches") +
scale_y_continuous(labels=scales::label_comma()) +
ggthemes::theme_hc() +
theme(axis.ticks.x = element_blank(),
axis.text.x = element_blank())
```
## What data do we have to work with? {.smaller .nonincremental}
```{r clean_data}
#| label: filter data
pitch.data <- data.all |>
select(pitcherId,pitchType,releaseVelocity,spinRate,spinDir,px,pz,x0,z0,vx0,vy0,vz0,ax,ay,az) |>
left_join(pitchtypes) |>
tidyr::drop_na()
```
Variables that I use from the dataset:
::: {.fragment .highlight-current-blue}
- Pitcher ID (_pitcherId_)
- Pitch type (_pitchType_, _pitch_, _category_)
:::
::: {.fragment .highlight-current-blue}
- Release velocity (_releaseVelocity_)
- Velocity along the X axis of the field at the first measurement (_vx0_, _vy0_, _vz0_)
- Acceleration of the ball (_ax_, _ay_, _az_)
- Spin rate (_spinRate_)
- Spin direction (_spinDir_)
:::
::: {.fragment .highlight-current-blue}
- Distance from the horizontal center of the plate as the ball crosses the front plane of the plate (_px_)
- Height above the ground as the ball crosses the front plane of the plate (_pz_)
- Distance from the center of the plate at the first measurement (_x0_, _y0_, _z0_)
:::
## How many clusters should I use? {.smaller}
:::: {.columns}
::: {.column width="55%"}
```{r clustering}
#| eval: false
#| echo: true
#| code-fold: false
#| code-overflow: scroll
#| code-line-numbers: 2|7-8|14
pitch.data.cv <- vfold_cv(pitch.data, v=5)
km.stats.spec <- k_means(mode="partition",engine="stats",num_clusters=tune())
pitch.data.recipe <- recipe(~ .,
data = pitch.data) |>
step_rm(pitcherId, pitchType, pitch, category)
km.workflow <- workflow(pitch.data.recipe, km.stats.spec)
cluster.grid <- grid_regular(num_clusters(range=c(4,15)),
levels=12)
km.results <- tune_cluster(
km.workflow,
resamples=pitch.data.cv,
grid=cluster.grid,
control=control_grid(save_pred=TRUE, extract=identity),
metrics=cluster_metric_set(sse_within_total, sse_total, sse_ratio)
)
km.performance <- km.results |> collect_metrics()
km.performance |>
filter(.metric=="sse_ratio") |>
ggplot(aes(x=num_clusters, y=mean)) +
geom_point() +
geom_line() +
ggthemes::theme_hc() +
labs(x="Number of Clusters",
y="Mean WSS/TSS ratio over 5 folds") +
scale_x_continuous(breaks=4:15)
```
:::
::: {.column width="45%"}
![](img/elbow12.png)
:::
::::
## Clustering pitches together {.smaller}
```{r kmeans}
#| eval: true
#| echo: true
#| code-fold: true
#| code-overflow: scroll
n_clusts <- 10
km.spec.n <- k_means(mode="partition", engine="stats", num_clusters = n_clusts)
km.fit <- km.spec.n |>
fit(~ . -pitcherId -pitchType -pitch -category, data=pitch.data)
km.fit |> tidy() |>
select(cluster, size, releaseVelocity, spinRate, spinDir, px, pz, x0, z0, vx0, vy0, vz0, ax, ay, az) |>
knitr::kable(digits=2,format.args = list(big.mark = ",",scientific = FALSE)) |>
kableExtra::kable_styling(latex_options = "scale_down")
```
```{r distances}
#| eval: true
#| echo: false
#| code-fold: true
#| code-overflow: scroll
dists <- Rfast::dista(km.fit |>
extract_centroids() |>
select(-.cluster),
pitch.data |> select(-pitcherId, -pitchType, -pitch, -category),
trans = FALSE) |>
as_tibble() |>
setNames(km.fit |>
extract_centroids() |>
select(.cluster) |>
mutate(.cluster=str_replace(.cluster,"_"," ")) |>
pull())
# dists <- dists |>
# rowwise() |>
# mutate(dist_from_centroid = min(c_across(starts_with("C"))))
dists$dist_from_centroid <- apply(dists,1,FUN=min)
dists$.cluster_pred <- extract_cluster_assignment(km.fit) |>
mutate(.cluster=str_replace(.cluster,"_"," "),
.cluster=factor(.cluster,levels=sapply(seq(1,n_clusts),\(x) paste("Cluster", x)))) |>
pull()
dists$pitchType <- factor(pitch.data$pitch, levels=c("Four Seamer","Two Seamer", "Cutter", "Sinker", "Fastball", "Curveball", "Slider", "Knuckle Curve", "Screwball", "Changeup", "Splitter", "Knuckleball", "Forkball", "Eephus", "Intentional Ball", "Pitchout", "Automatic Ball", "Unknown"))
dists$category <- factor(pitch.data$category,levels=c("Fastball","Breaking","Offspeed","Other"))
```
## How did the model perform? {.smaller}
```{r model_performance}
#| eval: true
#| echo: true
#| code-fold: true
#| code-overflow: scroll
#| fig-align: center
#| fig-height: 7
#| #| fig-width: 10
dists |>
group_by(category, pitchType,.cluster_pred) |>
count() |>
ggplot(aes(x=pitchType,
y=.cluster_pred,
fill=.cluster_pred,
alpha=n,
label=scales::label_number(accuracy=1,scale_cut = cut_short_scale())(n),
group=category)) +
geom_tile() +
geom_text(aes(color=ifelse(n<15000,"white","black"))) +
labs(x="Pitch Type",
y="") +
scale_fill_discrete(type="viridis") +
scale_x_discrete(limits=rev) +
scale_color_identity() +
facet_grid(category~., scales="free", switch="y", space="free_y") +
coord_flip() +
guides(fill="none",alpha="none") +
ggthemes::theme_hc() +
theme(strip.placement="outside",
strip.background = element_blank(),
panel.grid.major.y = element_blank())
```
## What about those pitches that don't really fit? {.smaller}
```{r}
#| label: outliers
#| eval: true
#| echo: true
#| code-fold: true
#| code-overflow: scroll
#| fig-align: center
#| fig-height: 7
outliers <- dists |>
group_by(.cluster_pred) |>
mutate(outlier = ifelse(dist_from_centroid > quantile(dist_from_centroid, 0.75)+1.5*IQR(dist_from_centroid),"#DC3220","#212529")) |>
mutate(outlier = ifelse(dist_from_centroid < quantile(dist_from_centroid, 0.25)-1.5*IQR(dist_from_centroid),"#DC3220",outlier))
outliers |>
ggplot(aes(x=.cluster_pred,y=dist_from_centroid)) +
geom_boxplot(outlier.shape=NA) +
geom_jitter(data=outliers |> filter(outlier=="#DC3220"), aes(color=outlier), width=.15, alpha=.1) +
labs(y="Distance from Centroid",
x="") +
scale_color_identity() +
guides(color="none") +
ggthemes::theme_hc()
```
## What do those outliers look like? {.smaller}
::: {.r-stack}
::: {.fragment}
```{r}
#| eval: true
#| echo: true
#| code-fold: true
#| code-overflow: scroll
#| fig-align: center
#| fig-height: 6
outliers |>
filter(.cluster_pred=="Cluster 1") |>
group_by(category,pitchType,outlier) |>
count() |> group_by(outlier) |>
mutate(pct=n/sum(n),
pct=ifelse(outlier=="#212529",-1*pct,pct)) |>
ggplot(aes(x=pitchType,y=pct,fill=outlier)) +
geom_col() +
facet_grid(category~., scales="free", switch="y", space="free_y") +
scale_y_continuous(breaks = c(-1,-.5,0,.5,1),
limits = c(-1,1),
labels=c("100%","50%","0%","50%","100%")) +
scale_fill_identity() +
coord_flip() +
labs(x="Pitch Type",
y="",
title="Proportion of Pitch Types assigned to Cluster 1") +
ggthemes::theme_hc() +
theme(strip.placement="outside",
strip.background = element_blank())
```
:::
::: {.fragment .fade-in-then-out}
```{r}
#| eval: true
#| echo: true
#| code-fold: true
#| code-overflow: scroll
#| fig-align: center
#| fig-height: 6
outliers |>
filter(.cluster_pred=="Cluster 5") |>
group_by(category,pitchType,outlier) |>
count() |> group_by(outlier) |>
mutate(pct=n/sum(n),
pct=ifelse(outlier=="#212529",-1*pct,pct)) |>
ggplot(aes(x=pitchType,y=pct,fill=outlier)) +
geom_col() +
facet_grid(category~., scales="free", switch="y", space="free_y") +
scale_y_continuous(breaks = c(-1,-.5,0,.5,1),
limits = c(-1,1),
labels=c("100%","50%","0%","50%","100%")) +
scale_fill_identity() +
coord_flip() +
labs(x="Pitch Type",
y="",
title="Proportion of Pitch Types assigned to Cluster 5") +
ggthemes::theme_hc() +
theme(strip.placement="outside",
strip.background = element_blank())
```
:::
::: {.fragment .fade-in-then-out}
```{r}
#| eval: true
#| echo: true
#| code-fold: true
#| code-overflow: scroll
#| fig-align: center
#| fig-height: 6
outliers |>
filter(.cluster_pred=="Cluster 10") |>
group_by(category,pitchType,outlier) |>
count() |> group_by(outlier) |>
mutate(pct=n/sum(n),
pct=ifelse(outlier=="#212529",-1*pct,pct)) |>
ggplot(aes(x=pitchType,y=pct,fill=outlier)) +
geom_col() +
facet_grid(category~., scales="free", switch="y", space="free_y") +
scale_y_continuous(breaks = c(-1,-.5,0,.5,1),
limits = c(-1,1),
labels=c("100%","50%","0%","50%","100%")) +
scale_fill_identity() +
coord_flip() +
labs(x="Pitch Type",
y="",
title="Proportion of Pitch Types assigned to Cluster 10") +
ggthemes::theme_hc() +
theme(strip.placement="outside",
strip.background = element_blank())
```
:::
:::
## Thank you
::: {id="credit"}
### Christopher Teixeira
[christopherteixeira.com](https://www.christopherteixeira.com)
:::
[{{< fa envelope size=xl >}} chris@christopherteixeira.com](mailto:chris@christopherteixeira.com)
[{{< fa brands linkedin size=xl >}} in/christopherteixeira](https://www.linkedin.com/in/christopherteixeira/)
[{{< fa brands github size=xl >}} ct-analytics](https://github.com/ct-analytics)