How To: Analyze and Plot Three Point Shooting Trends

Sravan January 31, 2024 [NBA] #shooting #3PT #win-loss #trends #tutorial

This tutorial goes through the process of generating figures in my Three Point Shooting Trends blog. Please read the blog first before continuing on to the tutorial. In this tutorial you will learn the following:

  1. Scrape league average stats for several seasons using nba_api.
  2. Plot using plotnine which is a Python port of R's ggplot. This package is how I generate most of my visualizations.
  3. Saving data in different formats including: csv, parquet & feather. We'll go through pros and cons of each.
  4. Processing data using pandas

If you want to run the code yourself while reading the tutorial, you can find the notebook version of this tutorial on my github:

https://github.com/sravanpannala/NBA-Tutorials/blob/main/three_pt_shooting_trends/plotting_three_pt_shooting_trends.ipynb

First, let's start by loading the basic packages required for this tutorial.

import pandas as pd  # for processing data
import numpy as np  # for numerical operations on arrays
from tqdm import tqdm  # gives up progress bar

# don't raise warnings when chaining pandas operations
pd.options.mode.chained_assignment = None

data_DIR = "./data/"

Scraping the Data

Now, let's load the function required for scraping league average data.

from nba_api.stats.endpoints import leaguedashteamstats
year = 2023
# the function requires season to be in the form "2023-24" instead of "2024"
season = str(year) + "-" + str(year + 1)[-2:]
# Get per game stats for a regular season
stats = leaguedashteamstats.LeagueDashTeamStats(per_mode_detailed="PerGame", season_type_all_star="Regular Season", season = season)
# Get data frames
df = stats.get_data_frames()[0]
# We don't need all columns, so we load only a subset 
df = df[["TEAM_NAME","W","L","W_PCT","FG3M","FG3A","FG3_PCT"]]
df.head(2)
TEAM_NAME W L W_PCT FG3M FG3A FG3_PCT
0 Atlanta Hawks 20 27 0.426 13.6 37.7 0.361
1 Boston Celtics 37 11 0.771 16.2 42.6 0.380
# Sum all the columns and append sum as a new row
df.loc["Total"] = df.sum()
df.tail(1)
TEAM_NAME W L W_PCT FG3M FG3A FG3_PCT
Total Atlanta HawksBoston CelticsBrooklyn NetsCharlo... 701 701 14.981 384.6 1049.6 10.997
# Average the makes and attempts for 30 teams
df.loc["Total","FG3M"] = df.loc["Total","FG3M"]/30
df.loc["Total","FG3A"] = df.loc["Total","FG3A"]/30 
# Get the percentage by dividing makes column with attempts column
df.loc["Total","FG3_PCT"] = df.loc["Total","FG3M"]/df.loc["Total","FG3A"]
# Rename the team_name column entry of total row to the season
df.loc["Total","TEAM_NAME"] = year +1
# round the columns to three decimals
df.iloc[:,4:] = df.iloc[:,4:].round(3)
# get only the last row i.e. the total row
dft = df.tail(1)
dft = dft[["TEAM_NAME","FG3M", "FG3A", "FG3_PCT"]]
dft
TEAM_NAME FG3M FG3A FG3_PCT
Total 2024 12.82 34.987 0.366

Now let's run the above code for the past 10 seasons

dfa = []
for year in tqdm(range(2014,2024)):
    # the function requires season to be in the form "2023-24" instead of "2024"
    season = str(year) + "-" + str(year + 1)[-2:]
    # Get per game stats for a regular season
    stats = leaguedashteamstats.LeagueDashTeamStats(per_mode_detailed="PerGame", season_type_all_star="Regular Season", season = season)
    # Get data frames
    df = stats.get_data_frames()[0]
    # We don't need all columns, so we load only a subset 
    df = df[["TEAM_NAME","W","L","W_PCT","FG3M","FG3A","FG3_PCT"]]
    
    df.loc["Total"] = df.sum()
    df.loc["Total","FG3M"] = df.loc["Total","FG3M"]/30
    df.loc["Total","FG3A"] = df.loc["Total","FG3A"]/30 
    df.loc["Total","FG3_PCT"] = df.loc["Total","FG3M"]/df.loc["Total","FG3A"]
    df.loc["Total","TEAM_NAME"] = year +1
    df.iloc[:,4:] = df.iloc[:,4:].round(3)
    dft = df.tail(1)
    dft = dft[["TEAM_NAME","FG3M", "FG3A", "FG3_PCT"]]
    dfa.append(dft)
df_3pct = pd.concat(dfa)
 80%|████████  | 8/10 [00:06<00:01,  1.14it/s]

100%|██████████| 10/10 [00:08<00:00,  1.17it/s]
# rebane team name column to season
df_3pct = df_3pct.rename(columns={"TEAM_NAME":"Season"}).reset_index(drop=True)
df_3pct
Season FG3M FG3A FG3_PCT
0 2015 7.850 22.410 0.350
1 2016 8.507 24.083 0.353
2 2017 9.650 27.003 0.357
3 2018 10.487 28.997 0.362
4 2019 11.360 32.007 0.355
5 2020 12.193 34.103 0.358
6 2021 12.707 34.637 0.367
7 2022 12.440 35.180 0.354
8 2023 12.340 34.207 0.361
9 2024 12.820 34.987 0.366

Plotting the Data using Plotnine

Now that we have the data, let's plot it. We'll do it using plotnine package. I recommend reading the documentation and going through the tutorials there.
Let's load the modules from plotnine, we'll use in plotting:

from plotnine import ggplot, aes, geom_line, geom_point, labs
p = (
    # load the dataframe here
    ggplot(df_3pct)
    # define the x and y data i.e. aesthetics
    + aes(x="Season", y= "FG3A")
    # line plot
    + geom_line(group=1)
    # scatter plot
    + geom_point()
    # Set title and caption
    + labs(
            title= "3PT Attempts per Game 2014-2023",
            caption="@SravanNBA | source:nba.com/stats",
        )
)
# draw the plot
p.draw()

png

Wow, that was easy. The default plot here is much better formatted compared to a default matplotlib plot. This is what makes using plotnine fun and easy. We need to just worry about the data and then plotnine takes care about the formatting.
Now, lets make the plot even better looking by adding a theme. We need to import some things again from plotnine

from plotnine import themes, theme, element_text

We are going to be using the xkcd theme. For a list of available themes refer to plotnine themes documentation. Let's define the theme now:

theme_idv = themes.theme_xkcd(base_size=11)

I want to slightly modify the theme by making the title bold and increase the title font size to 16. That can be done by updating the theme like so:

theme_idv += theme(
    plot_title=element_text(face="bold", size=16),
)

Now, we add the theme to the previous plot by just adding this line to the code + theme_idv

p = (
    ggplot(df_3pct)
    + aes(x="Season", y= "FG3A")
    + geom_line(group=1)
    + geom_point()
    + labs(
            title= "3PT Attempts per Game 2014-2023",
            caption="@SravanNBA | source:nba.com/stats",
        )
    # adding the new theme
    + theme_idv
)
p.draw()

png

Wasn't that easy? Now the graph with xkcd looks much better than the vanilla theme we saw before. Similarly we can generate plot for 3PT Makes by just changing y="FGM" in the code. For generating 3PT percentage trends, we set y="FG3_PCT". Once you start getting used to plotnine, visualizing data will become much easier.

p = (
    ggplot(df_3pct)
    # we change the y aesthetic here
    + aes(x="Season", y= "FG3_PCT")
    + geom_line(group=1)
    + geom_point()
    + labs(
            title= "3PT Attempts per Game 2014-2023",
            caption="@SravanNBA | source:nba.com/stats",
        )
    + theme_idv
)
p.draw()

png

3FG% vs Wins Year to Year when Team 3P% > Opponent 3P% + 5%

Scraping Data and saving them in different file formats

Now, let's load the function required for boxscores, which are required to get individual game 3FG% and whether a team has won or not

from nba_api.stats.endpoints import leaguegamelog
season = "2022"
# api call for getting the data
stats = leaguegamelog.LeagueGameLog(
    player_or_team_abbreviation="T",
    season=season,
    season_type_all_star="Regular Season",
)
df = stats.get_data_frames()[0]
# define the file name string
file_name = data_DIR + "NBA_BoxScores_" + "Standard" + "_" + season 

pandas offers us several ways of saving dataframes. The most commonly used one is to_csv() which saves the data in a csv format. csv I want to introduce you to other formats of storing data including parquet and feather.

df.to_csv(file_name + ".csv")
df.to_parquet(file_name + ".parquet")
df.to_feather(file_name + ".feather")

New, let's try to evaluate how long it takes to save the file by using timeit magic function. Read more about jupyter/IPython built-in magic commands here.

%%timeit
df.to_csv(file_name + ".csv")
32.8 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.to_parquet(file_name + ".parquet")
19.3 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.to_feather(file_name + ".feather")
10.3 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Time taken to save: feather <parquet < csv
Let's look at the file sizes

import os
file_stats = os.stat(file_name + ".csv")
file_size = file_stats.st_size >> 10
print(f"{file_size} kB")
355 kB
file_stats = os.stat(file_name + ".parquet")
file_size = file_stats.st_size >> 10
print(f"{file_size} kB")
88 kB
file_stats = os.stat(file_name + ".feather")
file_size = file_stats.st_size >> 10
print(f"{file_size} kB")
256 kB

So,File size: parquet < feather < csv

I personally use parquet files over csv for the below reasons:

Parquet Files

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Now let's back to scraping the data:
First we save the data in parquet format for all 10 seasons

season_start = 2014
season_end = 2024
seasons = np.arange(season_start, season_end, 1).astype(str)
for season in tqdm(seasons):
    stats = leaguegamelog.LeagueGameLog(
        player_or_team_abbreviation="T",
        season=season,
        season_type_all_star="Regular Season",
    )
    df = stats.get_data_frames()[0]
    df.to_parquet(data_DIR + "NBA_BoxScores_" + "Standard" + "_" + season + ".parquet")
100%|██████████| 10/10 [00:11<00:00,  1.17s/it]

Processing Data

Let's load data for a season

year = 2023
df = pd.read_parquet(data_DIR + f"NBA_BoxScores_Standard_{year}.parquet")
df.head(2)
SEASON_ID TEAM_ID TEAM_ABBREVIATION TEAM_NAME GAME_ID GAME_DATE MATCHUP WL MIN FGM ... DREB REB AST STL BLK TOV PF PTS PLUS_MINUS VIDEO_AVAILABLE
0 22023 1610612747 LAL Los Angeles Lakers 0022300061 2023-10-24 LAL @ DEN L 240 41 ... 31 44 23 5 4 12 18 107 -12 1
1 22023 1610612743 DEN Denver Nuggets 0022300061 2023-10-24 DEN vs. LAL W 240 48 ... 33 42 29 9 6 12 15 119 12 1

2 rows × 29 columns

Each row, only has that team's data but we need both teams data in a single row to compare 3PT shooting between two teams. To have data for both teams in the same row, we follow same steps as in the previous tutorial.

df = df[["GAME_ID","TEAM_NAME","FG3A","FG3M","FG3_PCT","PLUS_MINUS"]]
df1 = df.groupby("GAME_ID")
df1_1 = df1.nth(0)
df1_2 = df1.nth(1)
df1_1.columns = ["GAME_ID"] + [s + "1" for s in df1_1.columns if s != "GAME_ID"]
df1_2.columns = ["GAME_ID"] + [s + "2" for s in df1_2.columns if s != "GAME_ID"]
df1_3 = pd.merge(df1_1,df1_2)
df1_4 = df1.nth(1)
df1_5 = df1.nth(0)
df1_4.columns = ["GAME_ID"] + [s + "1" for s in df1_4.columns if s != "GAME_ID"]
df1_5.columns = ["GAME_ID"] + [s + "2" for s in df1_5.columns if s != "GAME_ID"]
df1_6 = pd.merge(df1_4,df1_5)
df2 = pd.concat([df1_3,df1_6]).sort_values("GAME_ID").reset_index(drop=True)
df2.head(2)
GAME_ID TEAM_NAME1 FG3A1 FG3M1 FG3_PCT1 PLUS_MINUS1 TEAM_NAME2 FG3A2 FG3M2 FG3_PCT2 PLUS_MINUS2
0 0022300001 Indiana Pacers 31 15 0.484 5 Cleveland Cavaliers 28 8 0.286 -5
1 0022300001 Cleveland Cavaliers 28 8 0.286 -5 Indiana Pacers 31 15 0.484 5
df3 = df2[["TEAM_NAME1","TEAM_NAME2","FG3_PCT1", "FG3_PCT2","PLUS_MINUS1","PLUS_MINUS2"]]
# 3 Point shooting of team > 5% of opponent
df3["More_3PT_PCT"] = df3["FG3_PCT1"] > (df3["FG3_PCT2"] + 0.05) 
# Define Win and Loss
df3["Win"] = df3["PLUS_MINUS1"] > 0
df3["Loss"] = df3["PLUS_MINUS1"] < 0
# Wins when 3FG% of team > 3FG% of opponent + 5%
df3["Win_More_3PT_PCT"] = df3["More_3PT_PCT"] & df3["Win"]
df3["Loss_More_3PT_PCT"] = df3["More_3PT_PCT"] & df3["Loss"]
# Aggregate all values together for a team
df4 = df3.groupby("TEAM_NAME1")[["Win_More_3PT_PCT","Loss_More_3PT_PCT"]]\
    .agg({"Win_More_3PT_PCT":["sum"],"Loss_More_3PT_PCT":["sum"]})
df4.columns = ["Win_More_3PT_PCT","Loss_More_3PT_PCT"]
df4 = df4.sort_values(by="Win_More_3PT_PCT",ascending=False).reset_index().rename(columns={"TEAM_NAME1":"Team"})
# Sum values from all teams to total row for the season
df4.loc['Total']= df4.sum()
df4.loc['Total','Team'] = "Total"
df4.loc['Total','Team'] = year +1
# get the total row out as a dataframe
dft = df4.tail(1)

Now, we have to run the same process for all 10 seasons

dfa = []
for year in range(2014,2024):
    df = pd.read_parquet(data_DIR + f"NBA_BoxScores_Standard_{year}.parquet")
    df = df[["GAME_ID","TEAM_NAME","FG3A","FG3M","FG3_PCT","PLUS_MINUS"]]
    df1 = df.groupby("GAME_ID")
    df1_1 = df1.nth(0)
    df1_2 = df1.nth(1)
    df1_1.columns = ["GAME_ID"] + [s + "1" for s in df1_1.columns if s != "GAME_ID"]
    df1_2.columns = ["GAME_ID"] + [s + "2" for s in df1_2.columns if s != "GAME_ID"]
    df1_3 = pd.merge(df1_1,df1_2)
    df1_4 = df1.nth(1)
    df1_5 = df1.nth(0)
    df1_4.columns = ["GAME_ID"] + [s + "1" for s in df1_4.columns if s != "GAME_ID"]
    df1_5.columns = ["GAME_ID"] + [s + "2" for s in df1_5.columns if s != "GAME_ID"]
    df1_6 = pd.merge(df1_4,df1_5)
    df2 = pd.concat([df1_3,df1_6]).sort_values("GAME_ID").reset_index(drop=True)
    df3 = df2[["TEAM_NAME1","TEAM_NAME2","FG3_PCT1", "FG3_PCT2","PLUS_MINUS1","PLUS_MINUS2"]]
    df3["More_3PT_PCT"] = df3["FG3_PCT1"] > (df3["FG3_PCT2"] + 0.05) 
    df3["Win"] = df3["PLUS_MINUS1"] > 0
    df3["Loss"] = df3["PLUS_MINUS1"] < 0
    df3["Win_More_3PT_PCT"] = df3["More_3PT_PCT"] & df3["Win"]
    df3["Loss_More_3PT_PCT"] = df3["More_3PT_PCT"] & df3["Loss"]
    df4 = df3.groupby("TEAM_NAME1")[["Win_More_3PT_PCT","Loss_More_3PT_PCT"]]\
        .agg({"Win_More_3PT_PCT":["sum"],"Loss_More_3PT_PCT":["sum"]})
    df4.columns = ["Win_More_3PT_PCT","Loss_More_3PT_PCT"]
    df4 = df4.sort_values(by="Win_More_3PT_PCT",ascending=False).reset_index().rename(columns={"TEAM_NAME1":"Team"})
    df4.loc['Total']= df4.sum()
    df4.loc['Total','Team'] = "Total"
    df4.loc['Total','Team'] = year +1
    dft = df4.tail(1)
    dfa.append(dft)
df_pw = pd.concat(dfa).reset_index(drop=True)
df_pw = df_pw.rename(columns={"Team":"Season"})
df_pw["Win_PCT"] = (df_pw["Win_More_3PT_PCT"]/(df_pw["Win_More_3PT_PCT"]+df_pw["Loss_More_3PT_PCT"])).round(3)

Here is the dataframe which we will plot next

df_pw
Season Win_More_3PT_PCT Loss_More_3PT_PCT Win_PCT
0 2015 655 252 0.722
1 2016 656 222 0.747
2 2017 693 195 0.780
3 2018 635 210 0.751
4 2019 661 189 0.778
5 2020 565 171 0.768
6 2021 600 143 0.808
7 2022 655 166 0.798
8 2023 635 163 0.796
9 2024 351 98 0.782
p = (
    ggplot(df_pw)
    + aes(x="Season", y= "Win_PCT")
    + geom_line(group=1)
    + geom_point()
    + labs(
            title= "Team Win % When Having 5% higher 3PT% than Opponent",
            caption="@SravanNBA | source:nba.com/stats",
            y = "Win %",
        )
    + theme_idv
    + theme(plot_title=element_text(face="bold", size=15))
)
p.draw()

png

Now, lets format the Y axis to show data in percentage instead of fraction

from mizani.formatters import percent_format
from plotnine import scale_y_continuous
p = (
    ggplot(df_pw)
    + aes(x="Season", y= "Win_PCT")
    + geom_line(group=1)
    + geom_point()
    # Change Y scale to percent
    + scale_y_continuous(labels=percent_format())
    + labs(
            title= "Team Win % When Having 5% higher 3PT% than Opponent",
            caption="@SravanNBA | source:nba.com/stats",
            y = "Win %",
        )
    + theme_idv
    + theme(plot_title=element_text(face="bold", size=15))
)
p.draw()

png

Finally you can save the figure using the save command

p.save("FG3PCT_Wins_seasons.png", width=6.2, height=4.8, dpi=300)
c:\Users\pansr\AppData\Local\pypoetry\Cache\virtualenvs\nba-ub9Z_EQq-py3.11\Lib\site-packages\plotnine\ggplot.py:587: PlotnineWarning: Saving 6.2 x 4.8 in image.
c:\Users\pansr\AppData\Local\pypoetry\Cache\virtualenvs\nba-ub9Z_EQq-py3.11\Lib\site-packages\plotnine\ggplot.py:588: PlotnineWarning: Filename: FG3PCT_Wins_seasons.png

Reminder

You can find the notebook version of this tutorial on my github: https://github.com/sravanpannala/NBA-Tutorials/blob/main/three_pt_shooting_trends/plotting_three_pt_shooting_trends.ipynb