How To: Adjusting NBA Teams' Offensive and Defensive Ratings using Strength of Schedule

Sravan January 01, 2024 [NBA] #strength-of-schedule #team-ratings #tutorial

This tutorial goes through my process of adjusting NBA Teams' Offensive and Defensive Ratings for strength of schedule (SoS). First read my blog post on the same topic, before going any further. The blog post explains the details including the math necessary for understanding the code in this tutorial.

The code for RAPM styled approach is adopted from Ryan Davis' RAPM Tutorial and I suggest you read that tutorial before continuing.

If you want to run the code yourself while reading the tutorial, you can find the notebook version of this tutorial on my github:

(https://github.com/sravanpannala/NBA-Tutorials/blob/main/sos_adjusted_ratings/how_to_adjust_nba_team_ratings_for_sos.ipynb

First let's import the necessary packages to run this code:

import pandas as pd  # for processing data
import numpy as np  # for numerical operations on arrays
from tqdm import tqdm  # gives up progress bar
import time  # for time related stuff
from sklearn.linear_model import RidgeCV

# don't raise warnings when chaining pandas operations
pd.options.mode.chained_assignment = None

Then we will load the team information as two variable. There are 30 teams in the NBA and each team has a name and a team ID

teams_list will have a list of all team IDs
teams_dict is a dictionary mapping the team IDs to the team names.

team_data = pd.read_csv("../data/NBA_teams_database.csv")
teams_list = team_data["TeamID"].tolist()
team_dict1 = team_data.to_dict(orient="records")
teams_dict = {team["TeamID"]: team["Team"] for team in team_dict1}

Scraping the Data Required

This section will cover the scraping part of the tutorial. You can skip the tutorial and go to the next section if you wish so. The data has already been scraped and is available for the 2023-24 season in the data folder.
We will be using the nba_api to get the necessary data. It should be installed already if you followed the instructions in Readme. The team ratings i.e. offensive, defensive and net ratings can be found for each game by using the boxscoreadvancedv3 endpoint. This endpoint needs needs the GameID to get the boxscores for both teams in that game. To get GameIDs for all games played in the 2023-24 season, we will use the leaguegamelog endpoint.

from nba_api.stats.endpoints import leaguegamelog, boxscoreadvancedv3

# for 2023-24 season
season = "2023"
# get the information
stats = leaguegamelog.LeagueGameLog(
    player_or_team_abbreviation="T",
    season=season,
    season_type_all_star="Regular Season",
)
# output the information as pandas dataframe
df = stats.get_data_frames()[0]
# get the GameIDs as a list
game_ids = df["GAME_ID"].tolist()
# GameIDs are repeated twich, once for home team and once for away team
# We can use numpy unique to remove the duplicates
game_ids = np.unique(game_ids)

Now we have a list of game_ids to use in boxscoreadvancedv3 endpoint. We just put the game_ids in a for loop to get the data for each game as a dataframe. We append the generated dataframe for each game to a list of dataframes dfa. Finally we can use pandas.concat to concatenate all the dataframes into a single dataframe for the season.
This process might take a while (10-20 minutes, depending on the number of games played), so grab a coffee or a snack and come back after some time. There is a small (maybe big) issue, if you just run a vanilla for loop. The stats.nba.com endpoint we use to scrape the data, times out when requested too many times in a short period of time and results in a error:

HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)

Any error will stop the for loop and we have to repeat again. To prevent this issue, we wrap the call to the endpoint in try except blocks and retry the endpoint for that gameId till it succeeds.
I found an elegant solution for this issue while creating this tutorial which is to use the tenacity package.

We import the necessary modules from tenacity:
1. retry: decorator to enable retries on the function
2. stop_after_attempt: to define the maximum number of attempts. I set it as 5
3. wait_fixed: to wait for a certain amount of fixed time before retrying. The number I use is 0.6 seconds as recommended by the authors of the nba_api

from tenacity import retry
from tenacity.stop import stop_after_attempt
from tenacity.wait import wait_fixed

We add the retry decorator with the necessary options to the get_boxscores function, which has the try except block to handle errors

@retry(stop=stop_after_attempt(5), wait=wait_fixed(0.6))
def get_boxscores(game_id):
    try:
        stats = boxscoreadvancedv3.BoxScoreAdvancedV3(game_id=game_id)
        df1 = stats.get_data_frames()[1]
    except Exception as error:
        print(error)
    return df1

Now we run the for loop with the decorated get_boxscores function. Finally, we save the scraped data as a csv file in the data folder.

dfa = []
for game_id in tqdm(game_ids):
    df1 = get_boxscores(game_id)
    dfa.append(df1)
df = pd.concat(dfa)
df.to_csv(f"./data/NBA_BoxScores_Adv_{season}.csv")

 13%|█▎        | 49/363 [00:59<06:26,  1.23s/it]

HTTPSConnectionPool(host='stats.nba.com', port=443): Max retries exceeded with url: /stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID=0022300050&RangeType=0&StartPeriod=0&StartRange=0 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000207782F5090>, 'Connection to stats.nba.com timed out. (connect timeout=30)'))


100%|██████████| 363/363 [05:27<00:00,  1.11it/s]

Loading and Pre-Processing the Data

Now lets load the data. The data has a lot of columns we don't use. So to we import only the data necessary by using the usecols option in pandas.read_csv().

season = "2023"
cols = [
    "gameId",
    "teamName",
    "teamId",
    "offensiveRating",
    "defensiveRating",
    "netRating",
    "possessions",
]
df = pd.read_csv(f"./data/NBA_BoxScores_Adv_{season}.csv", usecols=cols)
cols = ["gameId", "tId", "team", "ORtg", "DRtg", "NRtg", "poss"]
df.columns = cols
df.head(4)

	gameId	tId	team	ORtg	DRtg	NRtg	poss
0	22300001	1610612754	Pacers	118.6	112.6	6.0	102.0
1	22300001	1610612739	Cavaliers	112.6	118.6	-6.0	103.0
2	22300002	1610612749	Bucks	110.0	104.0	6.0	100.0
3	22300002	1610612752	Knicks	104.0	110.0	-6.0	101.0

As you see the printed table, each gameId has two entries, one of each team in the game. Each row has only the information for that team. But what we need is a combined row entry with the opponent information also.
We will use pandas.groupby to achieve that. The variable to apply the operation will be gameId. This operation will create a groupby object, on which further operations can be run.

df1 = df.groupby("gameId")
df1

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000207787B7710>

We then use the nth operation to get the 1st and 2nd rows of each game.

df1_1 = df1.nth(0)
df1_2 = df1.nth(1)
display(df1_1.head(2))
display(df1_2.head(2))

	gameId	tId	team	ORtg	DRtg	NRtg	poss
0	22300001	1610612754	Pacers	118.6	112.6	6.0	102.0
2	22300002	1610612749	Bucks	110.0	104.0	6.0	100.0

	gameId	tId	team	ORtg	DRtg	NRtg	poss
1	22300001	1610612739	Cavaliers	112.6	118.6	-6.0	103.0
3	22300002	1610612752	Knicks	104.0	110.0	-6.0	101.0

We can then rename the columns of the 1st dataframe, adding 1 to all its column names, except the gameId column (which is needed for the merging operation later). For the 2nd dataframe, similarly add 2 to the columns names.

df1_1.columns = ["gameId"] + [s + "1" for s in df1_1.columns if s != "gameId"]
df1_2.columns = ["gameId"] + [s + "2" for s in df1_2.columns if s != "gameId"]
display(df1_1.head(2))
display(df1_2.head(2))

	gameId	tId1	team1	ORtg1	DRtg1	NRtg1	poss1
0	22300001	1610612754	Pacers	118.6	112.6	6.0	102.0
2	22300002	1610612749	Bucks	110.0	104.0	6.0	100.0

	gameId	tId2	team2	ORtg2	DRtg2	NRtg2	poss2
1	22300001	1610612739	Cavaliers	112.6	118.6	-6.0	103.0
3	22300002	1610612752	Knicks	104.0	110.0	-6.0	101.0

We then merge the two dataframes df1_1 and df1_2 on the column gameId, generating the dataframe we need.

df1_3 = pd.merge(df1_1, df1_2, on="gameId")
display(df1_3.head(2))

	gameId	tId1	team1	ORtg1	DRtg1	NRtg1	poss1	tId2	team2	ORtg2	DRtg2	NRtg2	poss2
0	22300001	1610612754	Pacers	118.6	112.6	6.0	102.0	1610612739	Cavaliers	112.6	118.6	-6.0	103.0
1	22300002	1610612749	Bucks	110.0	104.0	6.0	100.0	1610612752	Knicks	104.0	110.0	-6.0	101.0

One more step remaining. What we have right now is one row of each game. But, what we need is two rows for each game as described in my blog post. To get that dataframe, we repeat the process above, with 0 and 1 flipped when performing the nth operation. Finally we merge the two dataframes df1_3 and df1_6, to get the combined dataframe with two rows for each game.

df1_4 = df1.nth(1)
df1_5 = df1.nth(0)
df1_4.columns = ["gameId"] + [s + "1" for s in df1_4.columns if s != "gameId"]
df1_5.columns = ["gameId"] + [s + "2" for s in df1_5.columns if s != "gameId"]
df1_6 = pd.merge(df1_4, df1_5, on="gameId")
df2 = pd.concat([df1_3, df1_6]).sort_values(by="gameId").reset_index(drop=True)
data = df2.copy()
data.head(4)

	gameId	tId1	team1	ORtg1	DRtg1	NRtg1	poss1	tId2	team2	ORtg2	DRtg2	NRtg2	poss2
0	22300001	1610612754	Pacers	118.6	112.6	6.0	102.0	1610612739	Cavaliers	112.6	118.6	-6.0	103.0
1	22300001	1610612739	Cavaliers	112.6	118.6	-6.0	103.0	1610612754	Pacers	118.6	112.6	6.0	102.0
2	22300002	1610612752	Knicks	104.0	110.0	-6.0	101.0	1610612749	Bucks	110.0	104.0	6.0	100.0
3	22300002	1610612749	Bucks	110.0	104.0	6.0	100.0	1610612752	Knicks	104.0	110.0	-6.0	101.0

Processing the Data

To process the data in a format required by the Ridge Regression algorithm RidgeCV, we define the following functions:

maps_teams()

Makes the matrix rows to be used in ridge regression
The weights for each team = 1/2
Equations per game are:
$$\frac{1}{2}\hat{Team}^1_{OFF} + \frac{1}{2}\hat{Team}^2_{DEF} = Team^1_{OFF} $$ $$\frac{1}{2}\hat{Team}^2_{OFF} + \frac{1}{2}\hat{Team}^1_{DEF} = Team^2_{OFF} $$
The reason for doing this is that for unadjusted values of a game: $$ Team^1_{OFF} = Team^2_{DEF} $$
So, $$ Team^1_{OFF} = 0.5\times Team^1_{OFF} + 0.5\times Team^2_{DEF} $$
Therefore I use a similar structure for estimating adjusted ratings

def map_teams(row_in, teams, scale):
    t1 = row_in[0]
    t2 = row_in[1]

    rowOut = np.zeros([len(teams) * 2])
    rowOut[teams.index(t1)] = scale
    rowOut[teams.index(t2) + len(teams)] = scale

    return rowOut

convert_to_matrices()

Converts each row of data dataframe to x stints.
Then maps those rows using map_teams function to get matrix X rows
Gets Y rows. Here Y is ORtg1 i.e. we are trying to predict the offensive rating of the 1st team for every row

def convert_to_matricies(possessions, name, teams, scale=1):
    # extract only the columns we need
    # Convert the columns of player ids into a numpy matrix
    stints_x_base = possessions[["tId1", "tId2"]].to_numpy()
    # Apply our mapping function to the numpy matrix
    stint_X_rows = np.apply_along_axis(map_teams, 1, stints_x_base, teams, scale=scale)
    # Convert the column of target values into a numpy matrix
    stint_Y_rows = possessions[name].to_numpy()

    # return matricies and possessions series
    return stint_X_rows, stint_Y_rows

lambda_to_alpha()

In stats world (R), glmnet() is used for Ridge Regression and uses the parameter $\lambda$. Most the NBA stats people use this parameter $\lambda$ for discussing the regularization parameter. But sklearn.linear_model.RidgeCV() has a parameter $\alpha$, which isn't the same.
So we need to convert $\lambda$ to $\alpha$ needed for Ridge CV. More details here

def lambda_to_alpha(lambda_value, samples):
    return (lambda_value * samples) / 2.0

calculate_netrtg()

Converts lambdas to alphas using lambda_to_alpha function
Defines the ridge regression problem using scikit-learn's RidgeCV algorithm
cv=5 is chosen i.e. k-fold cross-validation splitting strategy using k=5
Intercept is set as true. This value is to be added later to our estimation results to get Offensive and Defensive ratings.
Gets coefficients and intercept
Add intercept to intercept to get adjusted ratings. Use adjusted off and def ratings to calculate adjusted net rating.
Create and return adjusted ratings dataframe

def calculate_netrtg(train_x, train_y, lambdas, teams_list):
    alphas = [lambda_to_alpha(l, train_x.shape[0]) for l in lambdas]
    # create a 5 fold CV ridgeCV model. Our target data is not centered at 0, so we want to fit to an intercept.
    clf = RidgeCV(alphas=alphas, cv=5, fit_intercept=True)

    # fit our training data
    model = clf.fit(
        train_x,
        train_y,
    )

    # convert our list of players into a mx1 matrix
    team_arr = np.transpose(np.array(teams_list).reshape(1, len(teams_list)))

    # extract our coefficients into the offensive and defensive parts
    coef_offensive_array = model.coef_[0 : len(teams_list)][np.newaxis].T
    coef_defensive_array = model.coef_[len(teams_list) : 2 * len(teams_list)][
        np.newaxis
    ].T
    # concatenate the offensive and defensive values with the playey ids into a mx3 matrix
    team_id_with_coef = np.concatenate(
        [team_arr, coef_offensive_array, coef_defensive_array], axis=1
    )
    # build a dataframe from our matrix
    teams_coef = pd.DataFrame(team_id_with_coef)
    intercept = model.intercept_
    teams_coef.columns = ["tId", "aOFF", "aDEF"]
    teams_coef["aNET"] = teams_coef["aOFF"] - teams_coef["aDEF"]
    teams_coef["aOFF"] = teams_coef["aOFF"] + intercept
    teams_coef["aDEF"] = teams_coef["aDEF"] + intercept
    teams_coef["Team"] = teams_coef["tId"].map(teams_dict)
    results = teams_coef[["tId", "Team", "aOFF", "aDEF", "aNET"]]
    results = results.sort_values(by=["aNET"], ascending=False).reset_index(drop=True)
    return results, model, intercept

Estimating Adjusted Ratings

Next, we run the functions defined above to generated the adjusted ratings

train_x, train_y = convert_to_matricies(data, "ORtg1", teams_list, scale=0.5)
lambdas_net = [0.015, 0.075, 0.15]
results_adj, model, intercept = calculate_netrtg(
    train_x, train_y, lambdas_net, teams_list
)
print(f"Intercept = {intercept}")

Intercept = 114.2197043446658

The intercept here can be interpreted as the league average offensive/defensive rating. Here are the adjusted ratings.

results_adj

	tId	Team	aOFF	aDEF	aNET
0	1.610613e+09	Philadelphia 76ers	121.065207	110.772873	10.292335
1	1.610613e+09	Boston Celtics	118.828331	108.764236	10.064095
2	1.610613e+09	Oklahoma City Thunder	117.702690	110.814878	6.887812
3	1.610613e+09	Minnesota Timberwolves	113.207243	106.628440	6.578803
4	1.610613e+09	Denver Nuggets	118.395144	113.090987	5.304157
5	1.610613e+09	LA Clippers	115.366218	111.064890	4.301329
6	1.610613e+09	Orlando Magic	113.446035	109.345141	4.100894
7	1.610613e+09	New York Knicks	117.214095	113.291210	3.922885
8	1.610613e+09	Houston Rockets	111.781191	107.967128	3.814063
9	1.610613e+09	Milwaukee Bucks	118.657846	115.338466	3.319381
10	1.610613e+09	Brooklyn Nets	116.937071	114.575413	2.361658
11	1.610613e+09	Indiana Pacers	122.514626	120.553872	1.960754
12	1.610613e+09	Dallas Mavericks	118.932015	117.355888	1.576127
13	1.610613e+09	New Orleans Pelicans	114.101714	113.092424	1.009290
14	1.610613e+09	Golden State Warriors	114.940190	114.182474	0.757716
15	1.610613e+09	Miami Heat	114.132399	113.518409	0.613991
16	1.610613e+09	Atlanta Hawks	118.941745	118.485097	0.456648
17	1.610613e+09	Phoenix Suns	116.960966	116.528575	0.432391
18	1.610613e+09	Cleveland Cavaliers	111.023001	110.911526	0.111475
19	1.610613e+09	Los Angeles Lakers	112.332601	112.222641	0.109959
20	1.610613e+09	Sacramento Kings	115.306403	115.211154	0.095249
21	1.610613e+09	Toronto Raptors	112.389538	114.521311	-2.131772
22	1.610613e+09	Chicago Bulls	111.271702	115.736148	-4.464446
23	1.610613e+09	Memphis Grizzlies	106.322251	113.173911	-6.851659
24	1.610613e+09	Portland Trail Blazers	106.938770	114.757369	-7.818600
25	1.610613e+09	Charlotte Hornets	112.456339	120.395203	-7.938864
26	1.610613e+09	Utah Jazz	110.533878	118.914809	-8.380931
27	1.610613e+09	Washington Wizards	111.230457	120.661742	-9.431285
28	1.610613e+09	San Antonio Spurs	107.330475	117.157667	-9.827192
29	1.610613e+09	Detroit Pistons	106.330988	117.557249	-11.226261

Finishing Touches

We're not done yet. Now we need to compare the adjusted ratings with the unadjusted ones. But, we haven't calculated the unadjusted ratings yet. Let's do it now.

For a single game: $$ PTS_{OFF}*100 = ORtg^1 \times poss^1 $$ $$ PTS_{DEF}*100 = DRtg^1 \times poss^1 $$

Applying these operations on the data dataframe:

data["pts_off"] = data["ORtg1"] * data["poss1"]
data["pts_def"] = data["DRtg1"] * data["poss1"]

We have to use the groupby operation again, now on the tId1 column. After the groupby operation, we chain an agg (aggregate) operation, which applies a function on all rows of the group. The function we chose here is sum, which adds all the pts and and poss for a team.

off_p = data.groupby(["tId1"])[["poss1", "pts_off"]].agg("sum").reset_index()
def_p = data.groupby(["tId1"])[["poss1", "pts_def"]].agg("sum").reset_index()

The unadjusted team ratings would then be: $$ OFF = \frac{PTS_{OFF}^{Total}}{poss^{Total}} $$ $$ DEF = \frac{PTS_{DEF}^{Total}}{poss^{Total}} $$

off_p["OFF"] = off_p["pts_off"] / off_p["poss1"]
off_p = off_p[["tId1", "OFF"]]
def_p["DEF"] = def_p["pts_def"] / def_p["poss1"]
def_p = def_p[["tId1", "DEF"]]

We then merge these ratings to the results_adj dataframe

results_net = pd.merge(off_p, def_p, on=["tId1"])
results_net["NET"] = results_net["OFF"] - results_net["DEF"]
results_net.rename(columns={"tId1": "tId"}, inplace=True)
results_net = results_net.astype(float).round(2)
results_net["tId"] = results_net["tId"].astype(int)
results_adj["tId"] = results_adj["tId"].astype(int)
results_comb = pd.merge(results_net, results_adj, on=["tId"])
results_comb["aOFF"] = results_comb["aOFF"]
results_comb["aDEF"] = results_comb["aDEF"]
results_comb["oSOS"] = results_comb["aOFF"] - results_comb["OFF"]
results_comb["dSOS"] = results_comb["DEF"] - results_comb["aDEF"]
results_comb["SOS"] = results_comb["oSOS"] + results_comb["dSOS"]
results_comb.iloc[:, 1:] = results_comb.iloc[:, 1:].round(1)
results = results_comb[
    ["Team", "OFF", "oSOS", "aOFF", "DEF", "dSOS", "aDEF", "NET", "SOS", "aNET"]
]
results = results.sort_values(by="aNET", ascending=0).reset_index(drop=True)
results.index = results.index + 1

Reminder

You can find the notebook version of this tutorial on my github: (https://github.com/sravanpannala/NBA-Tutorials/blob/main/sos_adjusted_ratings/how_to_adjust_nba_team_ratings_for_sos.ipynb

Final Combined Data table:

You can save it as csv file and then you some fancy visualization tool to create a pretty looking table and/or efficiency landscape graph

results

	Team	OFF	oSOS	aOFF	DEF	dSOS	aDEF	NET	SOS	aNET
1	Philadelphia 76ers	121.2	-0.1	121.1	110.9	0.2	110.8	10.3	0.0	10.3
2	Boston Celtics	118.3	0.5	118.8	109.6	0.9	108.8	8.7	1.4	10.1
3	Oklahoma City Thunder	117.6	0.1	117.7	110.6	-0.2	110.8	7.0	-0.1	6.9
4	Minnesota Timberwolves	113.3	-0.1	113.2	106.6	-0.1	106.6	6.7	-0.2	6.6
5	Denver Nuggets	117.3	1.1	118.4	112.6	-0.5	113.1	4.7	0.6	5.3
6	LA Clippers	115.4	-0.1	115.4	110.6	-0.5	111.1	4.8	-0.5	4.3
7	Orlando Magic	113.8	-0.4	113.4	109.5	0.2	109.3	4.3	-0.2	4.1
8	New York Knicks	117.3	-0.1	117.2	113.3	0.0	113.3	4.0	-0.1	3.9
9	Houston Rockets	111.4	0.3	111.8	107.4	-0.5	108.0	4.0	-0.2	3.8
10	Milwaukee Bucks	119.3	-0.7	118.7	115.7	0.4	115.3	3.6	-0.3	3.3
11	Brooklyn Nets	116.9	0.0	116.9	115.0	0.4	114.6	2.0	0.4	2.4
12	Indiana Pacers	122.4	0.1	122.5	120.1	-0.4	120.6	2.3	-0.3	2.0
13	Dallas Mavericks	118.5	0.4	118.9	116.4	-1.0	117.4	2.2	-0.6	1.6
14	New Orleans Pelicans	114.1	0.0	114.1	113.1	-0.0	113.1	1.0	-0.0	1.0
15	Golden State Warriors	114.0	1.0	114.9	114.2	0.0	114.2	-0.2	1.0	0.8
16	Miami Heat	114.7	-0.6	114.1	113.5	-0.0	113.5	1.2	-0.6	0.6
17	Atlanta Hawks	118.7	0.2	118.9	118.8	0.4	118.5	-0.1	0.6	0.5
18	Phoenix Suns	116.6	0.4	117.0	115.4	-1.1	116.5	1.2	-0.7	0.4
19	Los Angeles Lakers	112.2	0.1	112.3	111.8	-0.5	112.2	0.4	-0.3	0.1
20	Sacramento Kings	114.6	0.7	115.3	114.9	-0.3	115.2	-0.2	0.3	0.1
21	Cleveland Cavaliers	111.1	-0.0	111.0	111.5	0.6	110.9	-0.5	0.6	0.1
22	Toronto Raptors	112.7	-0.3	112.4	114.9	0.4	114.5	-2.2	0.1	-2.1
23	Chicago Bulls	111.6	-0.3	111.3	115.7	-0.1	115.7	-4.1	-0.4	-4.5
24	Memphis Grizzlies	106.5	-0.2	106.3	112.7	-0.5	113.2	-6.2	-0.7	-6.9
25	Portland Trail Blazers	107.2	-0.3	106.9	114.2	-0.5	114.8	-7.0	-0.8	-7.8
26	Charlotte Hornets	112.6	-0.1	112.5	120.4	-0.0	120.4	-7.8	-0.2	-7.9
27	Utah Jazz	110.4	0.1	110.5	118.1	-0.8	118.9	-7.6	-0.7	-8.4
28	Washington Wizards	111.8	-0.6	111.2	121.4	0.7	120.7	-9.6	0.1	-9.4
29	San Antonio Spurs	107.0	0.3	107.3	117.3	0.1	117.2	-10.3	0.5	-9.8
30	Detroit Pistons	106.8	-0.5	106.3	117.8	0.2	117.6	-11.0	-0.2	-11.2