How To: Adjusting NBA Teams' Offensive and Defensive Ratings using Strength of Schedule

Sravan January 01, 2024 [NBA] #strength-of-schedule #team-ratings #tutorial

This tutorial goes through my process of adjusting NBA Teams' Offensive and Defensive Ratings for strength of schedule (SoS). First read my blog post on the same topic, before going any further. The blog post explains the details including the math necessary for understanding the code in this tutorial.

The code for RAPM styled approach is adopted from Ryan Davis' RAPM Tutorial and I suggest you read that tutorial before continuing.

If you want to run the code yourself while reading the tutorial, you can find the notebook version of this tutorial on my github:

(https://github.com/sravanpannala/NBA-Tutorials/blob/main/sos_adjusted_ratings/how_to_adjust_nba_team_ratings_for_sos.ipynb

First let's import the necessary packages to run this code:

import pandas as pd  # for processing data
import numpy as np  # for numerical operations on arrays
from tqdm import tqdm  # gives up progress bar
import time  # for time related stuff
from sklearn.linear_model import RidgeCV

# don't raise warnings when chaining pandas operations
pd.options.mode.chained_assignment = None

Then we will load the team information as two variable. There are 30 teams in the NBA and each team has a name and a team ID

  1. teams_list will have a list of all team IDs
  2. teams_dict is a dictionary mapping the team IDs to the team names.
team_data = pd.read_csv("../data/NBA_teams_database.csv")
teams_list = team_data["TeamID"].tolist()
team_dict1 = team_data.to_dict(orient="records")
teams_dict = {team["TeamID"]: team["Team"] for team in team_dict1}

Scraping the Data Required

This section will cover the scraping part of the tutorial. You can skip the tutorial and go to the next section if you wish so. The data has already been scraped and is available for the 2023-24 season in the data folder.
We will be using the nba_api to get the necessary data. It should be installed already if you followed the instructions in Readme. The team ratings i.e. offensive, defensive and net ratings can be found for each game by using the boxscoreadvancedv3 endpoint. This endpoint needs needs the GameID to get the boxscores for both teams in that game. To get GameIDs for all games played in the 2023-24 season, we will use the leaguegamelog endpoint.

from nba_api.stats.endpoints import leaguegamelog, boxscoreadvancedv3

# for 2023-24 season
season = "2023"
# get the information
stats = leaguegamelog.LeagueGameLog(
    player_or_team_abbreviation="T",
    season=season,
    season_type_all_star="Regular Season",
)
# output the information as pandas dataframe
df = stats.get_data_frames()[0]
# get the GameIDs as a list
game_ids = df["GAME_ID"].tolist()
# GameIDs are repeated twich, once for home team and once for away team
# We can use numpy unique to remove the duplicates
game_ids = np.unique(game_ids)

Now we have a list of game_ids to use in boxscoreadvancedv3 endpoint. We just put the game_ids in a for loop to get the data for each game as a dataframe. We append the generated dataframe for each game to a list of dataframes dfa. Finally we can use pandas.concat to concatenate all the dataframes into a single dataframe for the season.
This process might take a while (10-20 minutes, depending on the number of games played), so grab a coffee or a snack and come back after some time. There is a small (maybe big) issue, if you just run a vanilla for loop. The stats.nba.com endpoint we use to scrape the data, times out when requested too many times in a short period of time and results in a error:

HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)

Any error will stop the for loop and we have to repeat again. To prevent this issue, we wrap the call to the endpoint in try except blocks and retry the endpoint for that gameId till it succeeds.
I found an elegant solution for this issue while creating this tutorial which is to use the tenacity package.

  1. We import the necessary modules from tenacity:
    1. retry: decorator to enable retries on the function
    2. stop_after_attempt: to define the maximum number of attempts. I set it as 5
    3. wait_fixed: to wait for a certain amount of fixed time before retrying. The number I use is 0.6 seconds as recommended by the authors of the nba_api
from tenacity import retry
from tenacity.stop import stop_after_attempt
from tenacity.wait import wait_fixed
  1. We add the retry decorator with the necessary options to the get_boxscores function, which has the try except block to handle errors
@retry(stop=stop_after_attempt(5), wait=wait_fixed(0.6))
def get_boxscores(game_id):
    try:
        stats = boxscoreadvancedv3.BoxScoreAdvancedV3(game_id=game_id)
        df1 = stats.get_data_frames()[1]
    except Exception as error:
        print(error)
    return df1
  1. Now we run the for loop with the decorated get_boxscores function. Finally, we save the scraped data as a csv file in the data folder.
dfa = []
for game_id in tqdm(game_ids):
    df1 = get_boxscores(game_id)
    dfa.append(df1)
df = pd.concat(dfa)
df.to_csv(f"./data/NBA_BoxScores_Adv_{season}.csv")
 13%|█▎        | 49/363 [00:59<06:26,  1.23s/it]

HTTPSConnectionPool(host='stats.nba.com', port=443): Max retries exceeded with url: /stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID=0022300050&RangeType=0&StartPeriod=0&StartRange=0 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000207782F5090>, 'Connection to stats.nba.com timed out. (connect timeout=30)'))


100%|██████████| 363/363 [05:27<00:00,  1.11it/s]

Loading and Pre-Processing the Data

Now lets load the data. The data has a lot of columns we don't use. So to we import only the data necessary by using the usecols option in pandas.read_csv().

season = "2023"
cols = [
    "gameId",
    "teamName",
    "teamId",
    "offensiveRating",
    "defensiveRating",
    "netRating",
    "possessions",
]
df = pd.read_csv(f"./data/NBA_BoxScores_Adv_{season}.csv", usecols=cols)
cols = ["gameId", "tId", "team", "ORtg", "DRtg", "NRtg", "poss"]
df.columns = cols
df.head(4)
gameId tId team ORtg DRtg NRtg poss
0 22300001 1610612754 Pacers 118.6 112.6 6.0 102.0
1 22300001 1610612739 Cavaliers 112.6 118.6 -6.0 103.0
2 22300002 1610612749 Bucks 110.0 104.0 6.0 100.0
3 22300002 1610612752 Knicks 104.0 110.0 -6.0 101.0

As you see the printed table, each gameId has two entries, one of each team in the game. Each row has only the information for that team. But what we need is a combined row entry with the opponent information also.
We will use pandas.groupby to achieve that. The variable to apply the operation will be gameId. This operation will create a groupby object, on which further operations can be run.

df1 = df.groupby("gameId")
df1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000207787B7710>

We then use the nth operation to get the 1st and 2nd rows of each game.

df1_1 = df1.nth(0)
df1_2 = df1.nth(1)
display(df1_1.head(2))
display(df1_2.head(2))
gameId tId team ORtg DRtg NRtg poss
0 22300001 1610612754 Pacers 118.6 112.6 6.0 102.0
2 22300002 1610612749 Bucks 110.0 104.0 6.0 100.0
gameId tId team ORtg DRtg NRtg poss
1 22300001 1610612739 Cavaliers 112.6 118.6 -6.0 103.0
3 22300002 1610612752 Knicks 104.0 110.0 -6.0 101.0

We can then rename the columns of the 1st dataframe, adding 1 to all its column names, except the gameId column (which is needed for the merging operation later). For the 2nd dataframe, similarly add 2 to the columns names.

df1_1.columns = ["gameId"] + [s + "1" for s in df1_1.columns if s != "gameId"]
df1_2.columns = ["gameId"] + [s + "2" for s in df1_2.columns if s != "gameId"]
display(df1_1.head(2))
display(df1_2.head(2))
gameId tId1 team1 ORtg1 DRtg1 NRtg1 poss1
0 22300001 1610612754 Pacers 118.6 112.6 6.0 102.0
2 22300002 1610612749 Bucks 110.0 104.0 6.0 100.0
gameId tId2 team2 ORtg2 DRtg2 NRtg2 poss2
1 22300001 1610612739 Cavaliers 112.6 118.6 -6.0 103.0
3 22300002 1610612752 Knicks 104.0 110.0 -6.0 101.0

We then merge the two dataframes df1_1 and df1_2 on the column gameId, generating the dataframe we need.

df1_3 = pd.merge(df1_1, df1_2, on="gameId")
display(df1_3.head(2))
gameId tId1 team1 ORtg1 DRtg1 NRtg1 poss1 tId2 team2 ORtg2 DRtg2 NRtg2 poss2
0 22300001 1610612754 Pacers 118.6 112.6 6.0 102.0 1610612739 Cavaliers 112.6 118.6 -6.0 103.0
1 22300002 1610612749 Bucks 110.0 104.0 6.0 100.0 1610612752 Knicks 104.0 110.0 -6.0 101.0

One more step remaining. What we have right now is one row of each game. But, what we need is two rows for each game as described in my blog post. To get that dataframe, we repeat the process above, with 0 and 1 flipped when performing the nth operation. Finally we merge the two dataframes df1_3 and df1_6, to get the combined dataframe with two rows for each game.

df1_4 = df1.nth(1)
df1_5 = df1.nth(0)
df1_4.columns = ["gameId"] + [s + "1" for s in df1_4.columns if s != "gameId"]
df1_5.columns = ["gameId"] + [s + "2" for s in df1_5.columns if s != "gameId"]
df1_6 = pd.merge(df1_4, df1_5, on="gameId")
df2 = pd.concat([df1_3, df1_6]).sort_values(by="gameId").reset_index(drop=True)
data = df2.copy()
data.head(4)
gameId tId1 team1 ORtg1 DRtg1 NRtg1 poss1 tId2 team2 ORtg2 DRtg2 NRtg2 poss2
0 22300001 1610612754 Pacers 118.6 112.6 6.0 102.0 1610612739 Cavaliers 112.6 118.6 -6.0 103.0
1 22300001 1610612739 Cavaliers 112.6 118.6 -6.0 103.0 1610612754 Pacers 118.6 112.6 6.0 102.0
2 22300002 1610612752 Knicks 104.0 110.0 -6.0 101.0 1610612749 Bucks 110.0 104.0 6.0 100.0
3 22300002 1610612749 Bucks 110.0 104.0 6.0 100.0 1610612752 Knicks 104.0 110.0 -6.0 101.0

Processing the Data

To process the data in a format required by the Ridge Regression algorithm RidgeCV, we define the following functions:

maps_teams()

  1. Makes the matrix rows to be used in ridge regression
  2. The weights for each team = 1/2
  3. Equations per game are:
    $$\frac{1}{2}\hat{Team}^1_{OFF} + \frac{1}{2}\hat{Team}^2_{DEF} = Team^1_{OFF} $$ $$\frac{1}{2}\hat{Team}^2_{OFF} + \frac{1}{2}\hat{Team}^1_{DEF} = Team^2_{OFF} $$
  4. The reason for doing this is that for unadjusted values of a game: $$ Team^1_{OFF} = Team^2_{DEF} $$
  5. So, $$ Team^1_{OFF} = 0.5\times Team^1_{OFF} + 0.5\times Team^2_{DEF} $$
  6. Therefore I use a similar structure for estimating adjusted ratings
def map_teams(row_in, teams, scale):
    t1 = row_in[0]
    t2 = row_in[1]

    rowOut = np.zeros([len(teams) * 2])
    rowOut[teams.index(t1)] = scale
    rowOut[teams.index(t2) + len(teams)] = scale

    return rowOut

convert_to_matrices()

  1. Converts each row of data dataframe to x stints.
  2. Then maps those rows using map_teams function to get matrix X rows
  3. Gets Y rows. Here Y is ORtg1 i.e. we are trying to predict the offensive rating of the 1st team for every row
def convert_to_matricies(possessions, name, teams, scale=1):
    # extract only the columns we need
    # Convert the columns of player ids into a numpy matrix
    stints_x_base = possessions[["tId1", "tId2"]].to_numpy()
    # Apply our mapping function to the numpy matrix
    stint_X_rows = np.apply_along_axis(map_teams, 1, stints_x_base, teams, scale=scale)
    # Convert the column of target values into a numpy matrix
    stint_Y_rows = possessions[name].to_numpy()

    # return matricies and possessions series
    return stint_X_rows, stint_Y_rows

lambda_to_alpha()

def lambda_to_alpha(lambda_value, samples):
    return (lambda_value * samples) / 2.0

calculate_netrtg()

  1. Converts lambdas to alphas using lambda_to_alpha function
  2. Defines the ridge regression problem using scikit-learn's RidgeCV algorithm
  3. cv=5 is chosen i.e. k-fold cross-validation splitting strategy using k=5
  4. Intercept is set as true. This value is to be added later to our estimation results to get Offensive and Defensive ratings.
  5. Gets coefficients and intercept
  6. Add intercept to intercept to get adjusted ratings. Use adjusted off and def ratings to calculate adjusted net rating.
  7. Create and return adjusted ratings dataframe
def calculate_netrtg(train_x, train_y, lambdas, teams_list):
    alphas = [lambda_to_alpha(l, train_x.shape[0]) for l in lambdas]
    # create a 5 fold CV ridgeCV model. Our target data is not centered at 0, so we want to fit to an intercept.
    clf = RidgeCV(alphas=alphas, cv=5, fit_intercept=True)

    # fit our training data
    model = clf.fit(
        train_x,
        train_y,
    )

    # convert our list of players into a mx1 matrix
    team_arr = np.transpose(np.array(teams_list).reshape(1, len(teams_list)))

    # extract our coefficients into the offensive and defensive parts
    coef_offensive_array = model.coef_[0 : len(teams_list)][np.newaxis].T
    coef_defensive_array = model.coef_[len(teams_list) : 2 * len(teams_list)][
        np.newaxis
    ].T
    # concatenate the offensive and defensive values with the playey ids into a mx3 matrix
    team_id_with_coef = np.concatenate(
        [team_arr, coef_offensive_array, coef_defensive_array], axis=1
    )
    # build a dataframe from our matrix
    teams_coef = pd.DataFrame(team_id_with_coef)
    intercept = model.intercept_
    teams_coef.columns = ["tId", "aOFF", "aDEF"]
    teams_coef["aNET"] = teams_coef["aOFF"] - teams_coef["aDEF"]
    teams_coef["aOFF"] = teams_coef["aOFF"] + intercept
    teams_coef["aDEF"] = teams_coef["aDEF"] + intercept
    teams_coef["Team"] = teams_coef["tId"].map(teams_dict)
    results = teams_coef[["tId", "Team", "aOFF", "aDEF", "aNET"]]
    results = results.sort_values(by=["aNET"], ascending=False).reset_index(drop=True)
    return results, model, intercept

Estimating Adjusted Ratings

Next, we run the functions defined above to generated the adjusted ratings

train_x, train_y = convert_to_matricies(data, "ORtg1", teams_list, scale=0.5)
lambdas_net = [0.015, 0.075, 0.15]
results_adj, model, intercept = calculate_netrtg(
    train_x, train_y, lambdas_net, teams_list
)
print(f"Intercept = {intercept}")
Intercept = 114.2197043446658

The intercept here can be interpreted as the league average offensive/defensive rating. Here are the adjusted ratings.

results_adj
tId Team aOFF aDEF aNET
0 1.610613e+09 Philadelphia 76ers 121.065207 110.772873 10.292335
1 1.610613e+09 Boston Celtics 118.828331 108.764236 10.064095
2 1.610613e+09 Oklahoma City Thunder 117.702690 110.814878 6.887812
3 1.610613e+09 Minnesota Timberwolves 113.207243 106.628440 6.578803
4 1.610613e+09 Denver Nuggets 118.395144 113.090987 5.304157
5 1.610613e+09 LA Clippers 115.366218 111.064890 4.301329
6 1.610613e+09 Orlando Magic 113.446035 109.345141 4.100894
7 1.610613e+09 New York Knicks 117.214095 113.291210 3.922885
8 1.610613e+09 Houston Rockets 111.781191 107.967128 3.814063
9 1.610613e+09 Milwaukee Bucks 118.657846 115.338466 3.319381
10 1.610613e+09 Brooklyn Nets 116.937071 114.575413 2.361658
11 1.610613e+09 Indiana Pacers 122.514626 120.553872 1.960754
12 1.610613e+09 Dallas Mavericks 118.932015 117.355888 1.576127
13 1.610613e+09 New Orleans Pelicans 114.101714 113.092424 1.009290
14 1.610613e+09 Golden State Warriors 114.940190 114.182474 0.757716
15 1.610613e+09 Miami Heat 114.132399 113.518409 0.613991
16 1.610613e+09 Atlanta Hawks 118.941745 118.485097 0.456648
17 1.610613e+09 Phoenix Suns 116.960966 116.528575 0.432391
18 1.610613e+09 Cleveland Cavaliers 111.023001 110.911526 0.111475
19 1.610613e+09 Los Angeles Lakers 112.332601 112.222641 0.109959
20 1.610613e+09 Sacramento Kings 115.306403 115.211154 0.095249
21 1.610613e+09 Toronto Raptors 112.389538 114.521311 -2.131772
22 1.610613e+09 Chicago Bulls 111.271702 115.736148 -4.464446
23 1.610613e+09 Memphis Grizzlies 106.322251 113.173911 -6.851659
24 1.610613e+09 Portland Trail Blazers 106.938770 114.757369 -7.818600
25 1.610613e+09 Charlotte Hornets 112.456339 120.395203 -7.938864
26 1.610613e+09 Utah Jazz 110.533878 118.914809 -8.380931
27 1.610613e+09 Washington Wizards 111.230457 120.661742 -9.431285
28 1.610613e+09 San Antonio Spurs 107.330475 117.157667 -9.827192
29 1.610613e+09 Detroit Pistons 106.330988 117.557249 -11.226261

Finishing Touches

We're not done yet. Now we need to compare the adjusted ratings with the unadjusted ones. But, we haven't calculated the unadjusted ratings yet. Let's do it now.

For a single game: $$ PTS_{OFF}*100 = ORtg^1 \times poss^1 $$ $$ PTS_{DEF}*100 = DRtg^1 \times poss^1 $$

Applying these operations on the data dataframe:

data["pts_off"] = data["ORtg1"] * data["poss1"]
data["pts_def"] = data["DRtg1"] * data["poss1"]

We have to use the groupby operation again, now on the tId1 column. After the groupby operation, we chain an agg (aggregate) operation, which applies a function on all rows of the group. The function we chose here is sum, which adds all the pts and and poss for a team.

off_p = data.groupby(["tId1"])[["poss1", "pts_off"]].agg("sum").reset_index()
def_p = data.groupby(["tId1"])[["poss1", "pts_def"]].agg("sum").reset_index()

The unadjusted team ratings would then be: $$ OFF = \frac{PTS_{OFF}^{Total}}{poss^{Total}} $$ $$ DEF = \frac{PTS_{DEF}^{Total}}{poss^{Total}} $$

off_p["OFF"] = off_p["pts_off"] / off_p["poss1"]
off_p = off_p[["tId1", "OFF"]]
def_p["DEF"] = def_p["pts_def"] / def_p["poss1"]
def_p = def_p[["tId1", "DEF"]]

We then merge these ratings to the results_adj dataframe

results_net = pd.merge(off_p, def_p, on=["tId1"])
results_net["NET"] = results_net["OFF"] - results_net["DEF"]
results_net.rename(columns={"tId1": "tId"}, inplace=True)
results_net = results_net.astype(float).round(2)
results_net["tId"] = results_net["tId"].astype(int)
results_adj["tId"] = results_adj["tId"].astype(int)
results_comb = pd.merge(results_net, results_adj, on=["tId"])
results_comb["aOFF"] = results_comb["aOFF"]
results_comb["aDEF"] = results_comb["aDEF"]
results_comb["oSOS"] = results_comb["aOFF"] - results_comb["OFF"]
results_comb["dSOS"] = results_comb["DEF"] - results_comb["aDEF"]
results_comb["SOS"] = results_comb["oSOS"] + results_comb["dSOS"]
results_comb.iloc[:, 1:] = results_comb.iloc[:, 1:].round(1)
results = results_comb[
    ["Team", "OFF", "oSOS", "aOFF", "DEF", "dSOS", "aDEF", "NET", "SOS", "aNET"]
]
results = results.sort_values(by="aNET", ascending=0).reset_index(drop=True)
results.index = results.index + 1

Reminder

You can find the notebook version of this tutorial on my github: (https://github.com/sravanpannala/NBA-Tutorials/blob/main/sos_adjusted_ratings/how_to_adjust_nba_team_ratings_for_sos.ipynb

Final Combined Data table:

You can save it as csv file and then you some fancy visualization tool to create a pretty looking table and/or efficiency landscape graph

results
Team OFF oSOS aOFF DEF dSOS aDEF NET SOS aNET
1 Philadelphia 76ers 121.2 -0.1 121.1 110.9 0.2 110.8 10.3 0.0 10.3
2 Boston Celtics 118.3 0.5 118.8 109.6 0.9 108.8 8.7 1.4 10.1
3 Oklahoma City Thunder 117.6 0.1 117.7 110.6 -0.2 110.8 7.0 -0.1 6.9
4 Minnesota Timberwolves 113.3 -0.1 113.2 106.6 -0.1 106.6 6.7 -0.2 6.6
5 Denver Nuggets 117.3 1.1 118.4 112.6 -0.5 113.1 4.7 0.6 5.3
6 LA Clippers 115.4 -0.1 115.4 110.6 -0.5 111.1 4.8 -0.5 4.3
7 Orlando Magic 113.8 -0.4 113.4 109.5 0.2 109.3 4.3 -0.2 4.1
8 New York Knicks 117.3 -0.1 117.2 113.3 0.0 113.3 4.0 -0.1 3.9
9 Houston Rockets 111.4 0.3 111.8 107.4 -0.5 108.0 4.0 -0.2 3.8
10 Milwaukee Bucks 119.3 -0.7 118.7 115.7 0.4 115.3 3.6 -0.3 3.3
11 Brooklyn Nets 116.9 0.0 116.9 115.0 0.4 114.6 2.0 0.4 2.4
12 Indiana Pacers 122.4 0.1 122.5 120.1 -0.4 120.6 2.3 -0.3 2.0
13 Dallas Mavericks 118.5 0.4 118.9 116.4 -1.0 117.4 2.2 -0.6 1.6
14 New Orleans Pelicans 114.1 0.0 114.1 113.1 -0.0 113.1 1.0 -0.0 1.0
15 Golden State Warriors 114.0 1.0 114.9 114.2 0.0 114.2 -0.2 1.0 0.8
16 Miami Heat 114.7 -0.6 114.1 113.5 -0.0 113.5 1.2 -0.6 0.6
17 Atlanta Hawks 118.7 0.2 118.9 118.8 0.4 118.5 -0.1 0.6 0.5
18 Phoenix Suns 116.6 0.4 117.0 115.4 -1.1 116.5 1.2 -0.7 0.4
19 Los Angeles Lakers 112.2 0.1 112.3 111.8 -0.5 112.2 0.4 -0.3 0.1
20 Sacramento Kings 114.6 0.7 115.3 114.9 -0.3 115.2 -0.2 0.3 0.1
21 Cleveland Cavaliers 111.1 -0.0 111.0 111.5 0.6 110.9 -0.5 0.6 0.1
22 Toronto Raptors 112.7 -0.3 112.4 114.9 0.4 114.5 -2.2 0.1 -2.1
23 Chicago Bulls 111.6 -0.3 111.3 115.7 -0.1 115.7 -4.1 -0.4 -4.5
24 Memphis Grizzlies 106.5 -0.2 106.3 112.7 -0.5 113.2 -6.2 -0.7 -6.9
25 Portland Trail Blazers 107.2 -0.3 106.9 114.2 -0.5 114.8 -7.0 -0.8 -7.8
26 Charlotte Hornets 112.6 -0.1 112.5 120.4 -0.0 120.4 -7.8 -0.2 -7.9
27 Utah Jazz 110.4 0.1 110.5 118.1 -0.8 118.9 -7.6 -0.7 -8.4
28 Washington Wizards 111.8 -0.6 111.2 121.4 0.7 120.7 -9.6 0.1 -9.4
29 San Antonio Spurs 107.0 0.3 107.3 117.3 0.1 117.2 -10.3 0.5 -9.8
30 Detroit Pistons 106.8 -0.5 106.3 117.8 0.2 117.6 -11.0 -0.2 -11.2