Using real game data to simulate the results of a game between two NBA teams
Introduction
Let’s start with why I’m doing this. I’ve always enjoyed articles that show all the steps in a project. From idea all the way through the final project. So, I wanted to create a guide along those lines. This is that guide. I’m going to walk you, step by step, through how I approached this project. Basically its like a real-time look at creating the project. And, if you make it to the end, I’ll even talk about some enhancements that can be made to improve it. That’s it for the intro. Let’s get going!
The Idea
First some truth. I love me some sports. Not as much as my dad, mind you. He’ll turn on bowling or tennis or even (shudders involuntarily) golf. I stick to the major US sports: MLB, NFL, NBA, some NHL. But my first love has always been the NBA. So, I decide I want to create a end-to-end guide, doing something with NBA data.
Now that I know the general direction, I need to come up with something I want to explore. I thought about doing shot charts, or designing a new advanced stat. I even thought about doing some similarity scores for players/teams (which may happen in the future). But as I thought about it I decided there was something I have always really wanted to do. I want to simulate game matchups.
Okay now that I know that I need to figure out exactly where I want to go with the project. Do I just want to build a notebook to run the simulations, or do I want something more interactive? In the end I decide something interactive and more user friendly is the way to go. I want to design a simple web app/dashboard that can be used to simulate the outcome of a single game.
Roadmap
Now that I know what I want to do I need to figure out the roadmap to get there. I approach this first by making a layout of the steps that are needed for the project to work. For example this project’s step layout is pretty straightforward.
- Collect Data
- Wrangle Data
- Simulate Game
- Display Results
This gives me a roadmap of pieces or modules that I need to build out. As I look at the project layout I see that I am probably going to want to split the functionality into two files. One would be in charge of the data, collecting it, then wrangling it, and then saving it. The second file will be in charge of taking user inputs, simulating the matchup, and then showing the results.
Tools
The next step is to figure out what tools I want to use. My python experience is much > than my R experience. So python is the obvious choice. I know that I want to display the results visually, so I decide to use Flask/Dash/Plotly. I also want to save the data once I collect it. I could just save it to a .csv, or pickle the dataframes. But I want something more ‘elegant’ so I’ll go with sqlite as a simple db back end.
Now I need something to collect the data I need. I could just use a web-scraper like beautiful soup, but again this seems less than optimal. So I head on over to github and find an excellent project from [Swar Patel called nba-api]. This package allows me to query the nba-api for any data I want. And the package has great documentation, including a couple of notebooks with examples. It will be perfect for what I want to do.
Simulation
This is where I run into my first roadblock. I know I want to simulate games, but I’m not really sure how I want to go about it. So I head to google and look for some ideas. One of the most popular ways to simulate games is by using Elo ratings. [Nate Silver at Fivethirtyeight] took ratings that had been used for chess and applied them to other sports. This seemed interesting on first blush, but wasn’t really what I wanted.
As I read about Elo it got me thinking about what I really wanted to do. I wanted to build an engine that would look at past data and then use that to predict outcomes. In a word I want a [Monte Carlo Simulation].
This leads me to need the following data:
- Game scores
- Pace (possessions/game)
I can use this to calculate a normal distribution of points/possession and possession/game. I can then sample these x times to calculate the winning percentages. This gives me a baseline for my simulation. Later on, when I build my model, I’ll go more in depth into how this all fits together.
The build
At this point I feel pretty confident that I know where I’m going and have a general idea of how to get there. So I decide its time to start writing some code. The code is all posted on my github if you just want to check it out. But I recommend you read through the rest of this so you can see how I build up the codebase and make everything work together.
1 — Gathering the data
So first off I need to get the data for the project. As I mentioned before we’ll be using the [nba-api package from Swar Patel].
First we need to install the package. The easiest way to do that is to use pip:
pip install nba_api
Now this won’t be an in depth tutorial on how to use the nba-api. As I said earlier Swar’s project has excellent documentation on how to use its functionality. So I won’t get into it.
First we need to do some imports.
Imports
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
The first two imports are for endpoints. The leaguegamefinder
endpoint allows us to get game info such as teams, scores, and game_id. The boxscoreadvancedv2
endpoint has a ton of data. But what we want most is the pace data for each game. The SeasonType
parameter allows us to get only regular season games (which is what I plan for my first go round). Finally we need `pandas` because the data we will be getting back will be as data frames.
Now that we have our imports lets build some functions that lets us collect all the data and then combine the data into a single data frame.
The first part of the get_game_data
function will use the leaguegamefinder
endpoint to collect the basic game data. This endpoint is quick because it queries the season at once.
def get_game_data(season, league_id='00'):
'''Query NBA stats via the nba-api
Parameters
----------
season: format is yyyy-yy ie. 2020-21
league_id = '00' => NBA
Returns
-------
result: list of 2 DataFrames
'''
#first we need to get all of the basic game data
gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
season_nullable=season,
season_type_nullable=SeasonType.regular)
game_basic_df = gamefinder.get_data_frames()[0]
The second half of the function will get the advanced data we need, which at this point is really just the pace. First we need a few more imports time
, random
, and IPython.display
. So our imports will look like this:
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
Then we’ll use the boxscoreadvancedv2
endpoint. This is much slower because it has to be queried for each game. So we’ll update the get_game_data
function like this:
def get_game_data(season, league_id='00', test=False):
'''Query NBA stats via the nba-api
Parameters
----------
season: format is yyyy-yy ie. 2020-21
league_id = '00' => NBA
Returns
-------
result: dict of 2 DataFrames {"games_basic": , "games_adv": }
'''
#first we need to get all of the basic game data
gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
season_nullable=season,
season_type_nullable=SeasonType.regular)
games_basic_df = gamefinder.get_data_frames()[0]
#then we need to get the advanced game data. Slow, game by game.
#we need to get a list of the games
game_ids = list(games_basic_df['GAME_ID'].unique())
games_adv_df = pd.DataFrame()
if test:
c = 0 # this is here for testing only
#now we iterate over the list of game_ids
while len(game_ids) > 0:
for i in game_ids:
cooldown = rand.gammavariate(alpha=9, beta=0.4)
clear_output(wait=True)
for attempt in range(10):
try:
time.sleep(cooldown)
data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
end_range=0,
game_id=i,
range_type=0,
start_period=1,
start_range=0)
except:
time.sleep(cooldown)
data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
end_range=0,
game_id=i,
range_type=0,
start_period=1,
start_range=0)
else:
break
else:
print('Connection Error')
break
data_df = data.get_data_frames()[1]
if games_adv_df.empty:
games_adv_df = data_df
else:
games_adv_df = games_adv_df.append(data_df, ignore_index=True)
game_ids.remove(i)
print (i , "completed", len(game_ids), "games left", sep="---")
if test:
c += 1
if c > 3: #again for testing only
games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
print('DONE!')
return games_df_dict
games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
print('DONE!')
return games_df_dict
A couple notes on the above function:
- First off note the new parameter for
test=False
. This allows us to flag for testing the function. If we settest=True
then it will only loop through the `boxscoreadvancedv2` three times. This allows simple testing of the function without waiting for all of the games to be accessed. - Notice the
while len(game_ids_ > 0
and thefor attempt in range(10):
. This is are part of a retry loop if there is a connection error.
3. We also have the linescooldown = rand.gammavariate(alpha=9, beta=0.4)
andtime.sleep(cooldown)
. This is a random timer that pauses on each connection attempt. The NBA stats api has a limiting feature that will lock your ip out, if you make too many requests in too short of a time.
We can then call the function like so:
games_df = get_game_data(season='2020-21')
This will give us a dict with two entries for the two dataframes. We can see the details on each dataframe using .info()
. For the first (basic) dataframe we use:
games['games_basic'].info()
Which gives us the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2084 entries, 0 to 2083
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SEASON\_ID 2084 non-null object
1 TEAM\_ID 2084 non-null int64
2 TEAM\_ABBREVIATION 2084 non-null object
3 TEAM\_NAME 2084 non-null object
4 GAME\_ID 2084 non-null object
5 GAME\_DATE 2084 non-null object
6 MATCHUP 2084 non-null object
7 WL 2084 non-null object
8 MIN 2084 non-null int64
9 PTS 2084 non-null int64
10 FGM 2084 non-null int64
11 FGA 2084 non-null int64
12 FG\_PCT 2084 non-null float64
13 FG3M 2084 non-null int64
14 FG3A 2084 non-null int64
15 FG3\_PCT 2084 non-null float64
16 FTM 2084 non-null int64
17 FTA 2084 non-null int64
18 FT\_PCT 2084 non-null float64
19 OREB 2084 non-null int64
20 DREB 2084 non-null int64
21 REB 2084 non-null int64
22 AST 2084 non-null int64
23 STL 2084 non-null int64
24 BLK 2084 non-null int64
25 TOV 2084 non-null int64
26 PF 2084 non-null int64
27 PLUS\_MINUS 2084 non-null float64
dtypes: float64(4), int64(17), object(7)
memory usage: 456.0+ KB
And for the second (advanced) dataframe we use:
games['games_adv'].info()
Which gives us:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GAME_ID 8 non-null object
1 TEAM_ID 8 non-null int64
2 TEAM_NAME 8 non-null object
3 TEAM_ABBREVIATION 8 non-null object
4 TEAM_CITY 8 non-null object
5 MIN 8 non-null object
6 E_OFF_RATING 8 non-null float64
7 OFF_RATING 8 non-null float64
8 E_DEF_RATING 8 non-null float64
9 DEF_RATING 8 non-null float64
10 E_NET_RATING 8 non-null float64
11 NET_RATING 8 non-null float64
12 AST_PCT 8 non-null float64
13 AST_TOV 8 non-null float64
14 AST_RATIO 8 non-null float64
15 OREB_PCT 8 non-null float64
16 DREB_PCT 8 non-null float64
17 REB_PCT 8 non-null float64
18 E_TM_TOV_PCT 8 non-null float64
19 TM_TOV_PCT 8 non-null float64
20 EFG_PCT 8 non-null float64
21 TS_PCT 8 non-null float64
22 USG_PCT 8 non-null float64
23 E_USG_PCT 8 non-null float64
24 E_PACE 8 non-null float64
25 PACE 8 non-null float64
26 PACE_PER40 8 non-null float64
27 POSS 8 non-null int64
28 PIE 8 non-null float64
dtypes: float64(22), int64(2), object(5)
memory usage: 1.9+ KB
That’s it for the data gathering. So if you need a break you can set this to run and then come back in a couple hours for the next section.
2 — Data Wrangling
Now that we have all our data collected we need to get into the format we need for the simulator. We’re also going to set up sqlite so we can persist our data.
First let’s take a closer look at our dataframes. First let’s run this:
print(games['games_basic'])
This will give us the following:
SEASON_ID TEAM_ID TEAM_ABBREVIATION TEAM_NAME \
0 22020 1610612746 LAC LA Clippers
1 22020 1610612748 MIA Miami Heat
2 22020 1610612741 CHI Chicago Bulls
3 22020 1610612755 PHI Philadelphia 76ers
4 22020 1610612750 MIN Minnesota Timberwolves
... ... ... ... ...
2097 22020 1610612758 SAC Sacramento Kings
2098 22020 1610612744 GSW Golden State Warriors
2099 22020 1610612747 LAL Los Angeles Lakers
2100 22020 1610612746 LAC LA Clippers
2101 22020 1610612751 BKN Brooklyn Nets
GAME_ID GAME_DATE MATCHUP WL MIN PTS ... FT_PCT OREB \
0 0022001047 2021-05-13 LAC @ CHA W 239 113 ... 0.750 6
1 0022001050 2021-05-13 MIA vs. PHI W 241 106 ... 1.000 15
2 0022001051 2021-05-13 CHI vs. TOR W 240 114 ... 0.700 7
3 0022001050 2021-05-13 PHI @ MIA L 238 94 ... 0.750 8
4 0022001052 2021-05-13 MIN vs. DEN L 240 103 ... 0.594 11
... ... ... ... .. ... ... ... ... ...
2097 0022000019 2020-12-23 SAC @ DEN W 265 124 ... 0.774 18
2098 0022000001 2020-12-22 GSW @ BKN L 240 99 ... 0.652 13
2099 0022000002 2020-12-22 LAL vs. LAC L 240 109 ... 0.774 8
2100 0022000002 2020-12-22 LAC @ LAL W 241 116 ... 0.737 11
2101 0022000001 2020-12-22 BKN vs. GSW W 242 125 ... 0.813 13
DREB REB AST STL BLK TOV PF PLUS_MINUS
0 38 44 28 6 6 17 17 23.0
1 32 47 26 5 6 19 15 12.0
2 43 50 31 4 4 17 15 12.0
3 22 30 20 8 1 12 18 -12.0
4 32 43 20 3 5 15 17 -11.0
... ... ... ... ... ... ... .. ...
2097 36 54 27 5 7 14 29 2.0
2098 34 47 26 6 6 18 24 -26.0
2099 37 45 22 4 2 19 20 -7.0
2100 29 40 22 10 3 16 29 7.0
2101 44 57 24 11 7 20 22 26.0
[2102 rows x 28 columns]
Now lets take the first GAME_ID
and use it to filter the dataframe. Like so:
print(games['games_basic'].loc[(games['games_basic']['GAME_ID'] == '0022001047')])
That will give us this:
SEASON_ID TEAM_ID TEAM_ABBREVIATION TEAM_NAME GAME_ID \
0 22020 1610612746 LAC LA Clippers 0022001047
6 22020 1610612766 CHA Charlotte Hornets 0022001047
GAME_DATE MATCHUP WL MIN PTS ... FT_PCT OREB DREB REB AST \
0 2021-05-13 LAC @ CHA W 239 113 ... 0.750 6 38 44 28
6 2021-05-13 CHA vs. LAC L 239 90 ... 0.789 8 27 35 24
STL BLK TOV PF PLUS_MINUS
0 6 6 17 17 23.0
6 9 1 10 15 -23.0
[2 rows x 28 columns]
So that means we have two records with the same GAME_ID
, which makes sense when you see how the data is laid out. There is a row for each team in the game. One row corresponds to the home team, the other to the away team. If we did the same thing for the [‘games_adv’]
dataframe we would see the same structure. This illustrates a couple things for us:
- It’s important to examine data thoroughly so you know how its structured. Don’t just make assumptions.
- This structure isn’t great for our simulator. It will be much easier if we can somehow merge these rows into one row with both teams results in it.
That second point could be a pain, but luckily for us Swar Patel has saved the day for us again. In his [examples] he recognizes the fact that this is something that users would want to do. So he wrote a function to take care of it.
import pandas as pd
def combine_team_games(df, keep_method='home'):
'''Combine a TEAM_ID-GAME_ID unique table into rows by game. Slow.
Parameters
----------
df : Input DataFrame.
keep_method : {'home', 'away', 'winner', 'loser', ``None``}, default 'home'
- 'home' : Keep rows where TEAM_A is the home team.
- 'away' : Keep rows where TEAM_A is the away team.
- 'winner' : Keep rows where TEAM_A is the losing team.
- 'loser' : Keep rows where TEAM_A is the winning team.
- ``None`` : Keep all rows. Will result in an output DataFrame the same
length as the input DataFrame.
Returns
-------
result : DataFrame
'''
# Join every row to all others with the same game ID.
joined = pd.merge(df, df, suffixes=['_A', '_B'],
on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
# Filter out any row that is joined to itself.
result = joined[joined.TEAM_ID_A != joined.TEAM_ID_B]
# Take action based on the keep_method flag.
if keep_method is None:
# Return all the rows.
pass
elif keep_method.lower() == 'home':
# Keep rows where TEAM_A is the home team.
result = result[result.MATCHUP_A.str.contains(' vs. ')]
elif keep_method.lower() == 'away':
# Keep rows where TEAM_A is the away team.
result = result[result.MATCHUP_A.str.contains(' @ ')]
elif keep_method.lower() == 'winner':
result = result[result.WL_A == 'W']
elif keep_method.lower() == 'loser':
result = result[result.WL_A == 'L']
else:
raise ValueError(f'Invalid keep_method: {keep_method}')
return result
This is going to make our job much easier. But we still have a little bit of work to do. Mostly because we are using two dataframes, instead of one. So there are a couple of ways we can approach this. One is to just do it piece by piece, joining, filtering, etc. Or we can write some functions. I like functions so I decide to go that way.
So we’re going to write one function. It’s going to take our two dataframes, join them and then combine rows. When I’m writing something like this I like to build it in pieces, testing along the way. So first we’ll write the join and see how it goes. Here’s the join portion:
def prep_games(dictionary, left_key, right_key, left, right):
'''join two dataframes
Paramaters
----------
dictionary: Input dict
left_key: dictionary key for left dataframe
right_key: dictionary key for right dataframe
left: list of columns for left join
right: list of columns for right join
Returns
-------
result: DataFrame
'''
#join the two DataFrames using the paramaters
joined = dictionary[left_key].merge(dictionary[right_key],
left_on=left,
right_on=right)
return joined
If we then use it as so:
preped_game_df = prep_games(games,
left_key='games_basic',
right_key='games_adv',
left=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'],
right=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'])
And then print
the dataframe we get:
SEASON_ID TEAM_ID TEAM_ABBREVIATION TEAM_NAME_x \
0 22020 1610612746 LAC LA Clippers
1 22020 1610612748 MIA Miami Heat
2 22020 1610612741 CHI Chicago Bulls
3 22020 1610612755 PHI Philadelphia 76ers
4 22020 1610612750 MIN Minnesota Timberwolves
... ... ... ... ...
2097 22020 1610612758 SAC Sacramento Kings
2098 22020 1610612744 GSW Golden State Warriors
2099 22020 1610612747 LAL Los Angeles Lakers
2100 22020 1610612746 LAC LA Clippers
2101 22020 1610612751 BKN Brooklyn Nets
GAME_ID GAME_DATE MATCHUP WL MIN_x PTS ... TM_TOV_PCT \
0 0022001047 2021-05-13 LAC @ CHA W 239 113 ... 17.9
1 0022001050 2021-05-13 MIA vs. PHI W 241 106 ... 22.5
2 0022001051 2021-05-13 CHI vs. TOR W 240 114 ... 18.0
3 0022001050 2021-05-13 PHI @ MIA L 238 94 ... 14.4
4 0022001052 2021-05-13 MIN vs. DEN L 240 103 ... 15.3
... ... ... ... .. ... ... ... ...
2097 0022000019 2020-12-23 SAC @ DEN W 265 124 ... 13.5
2098 0022000001 2020-12-22 GSW @ BKN L 240 99 ... 16.1
2099 0022000002 2020-12-22 LAL vs. LAC L 240 109 ... 18.3
2100 0022000002 2020-12-22 LAC @ LAL W 241 116 ... 15.4
2101 0022000001 2020-12-22 BKN vs. GSW W 242 125 ... 17.9
EFG_PCT TS_PCT USG_PCT E_USG_PCT E_PACE PACE PACE_PER40 POSS \
0 0.673 0.689 1.0 0.200 94.70 94.00 78.33 95
1 0.586 0.617 1.0 0.197 91.32 89.50 74.58 89
2 0.645 0.652 1.0 0.199 100.16 99.00 82.50 100
3 0.506 0.541 1.0 0.194 91.32 89.50 74.58 90
4 0.525 0.547 1.0 0.200 99.46 98.50 82.08 98
... ... ... ... ... ... ... ... ...
2097 0.490 0.536 1.0 0.197 102.43 100.98 84.15 111
2098 0.424 0.454 1.0 0.201 113.60 112.00 93.33 112
2099 0.525 0.576 1.0 0.199 106.00 104.00 86.67 104
2100 0.548 0.572 1.0 0.198 106.00 104.00 86.67 104
2101 0.538 0.589 1.0 0.201 113.60 112.00 93.33 112
PIE
0 0.619
1 0.597
2 0.584
3 0.403
4 0.411
... ...
2097 0.473
2098 0.350
2099 0.509
2100 0.491
2101 0.650
[2102 rows x 54 columns]
You can see that we have joined our dataframes. Now, a quick note before we go on. There are quite a few columns that are redundant or unneeded for what we are doing here. So we could clear them out. But, seeing as how we went through all the work to collect this, I want to keep them. In case I need them in the future.
Now let’s get back to the second half of our function, which will basically use what Swar has already written for us. I’m only going to make some minor changes for what we’re doing. Our final function will look like this:
def prep_games(dictionary, left_key, right_key, left, right):
'''join two dataframes
Paramaters
----------
dictionary: Input dict
left_key: dictionary key for left dataframe
right_key: dictionary key for right dataframe
left: list of columns for left join
right: list of columns for right join
Returns
-------
result: DataFrame
'''
#join the two DataFrames using the paramaters
joined = dictionary[left_key].merge(dictionary[right_key],
left_on=left,
right_on=right)
# Join every row to all others with the same game ID.
joined = pd.merge(joined, joined, suffixes=['_HOME', '_AWAY'],
on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
# Filter out any row that is joined to itself.
result = joined[joined.TEAM_ID_HOME != joined.TEAM_ID_AWAY]
result = result[result.MATCHUP_HOME.str.contains(' vs. ')]
#lowercase the columnn names for easier access
result.columns = result.columns.str.lower()
return result
If we use the function and print the result again we will get something that looks like this:
SEASON_ID TEAM_ID_HOME TEAM_ABBREVIATION_HOME TEAM_NAME_x_HOME \
2 22020 1610612766 CHA Charlotte Hornets
5 22020 1610612748 MIA Miami Heat
9 22020 1610612741 CHI Chicago Bulls
13 22020 1610612750 MIN Minnesota Timberwolves
18 22020 1610612752 NYK New York Knicks
... ... ... ... ...
4185 22020 1610612739 CLE Cleveland Cavaliers
4190 22020 1610612741 CHI Chicago Bulls
4193 22020 1610612743 DEN Denver Nuggets
4198 22020 1610612751 BKN Brooklyn Nets
4201 22020 1610612747 LAL Los Angeles Lakers
GAME_ID GAME_DATE MATCHUP_HOME WL_HOME MIN_x_HOME PTS_HOME ... \
2 0022001047 2021-05-13 CHA vs. LAC L 239 90 ...
5 0022001050 2021-05-13 MIA vs. PHI W 241 106 ...
9 0022001051 2021-05-13 CHI vs. TOR W 240 114 ...
13 0022001052 2021-05-13 MIN vs. DEN L 240 103 ...
18 0022000458 2021-05-13 NYK vs. SAS W 240 102 ...
... ... ... ... ... ... ... ...
4185 0022000010 2020-12-23 CLE vs. CHA W 241 121 ...
4190 0022000015 2020-12-23 CHI vs. ATL L 240 104 ...
4193 0022000019 2020-12-23 DEN vs. SAC L 265 122 ...
4198 0022000001 2020-12-22 BKN vs. GSW W 242 125 ...
4201 0022000002 2020-12-22 LAL vs. LAC L 240 109 ...
TM_TOV_PCT_AWAY EFG_PCT_AWAY TS_PCT_AWAY USG_PCT_AWAY \
2 17.9 0.673 0.689 1.0
5 14.4 0.506 0.541 1.0
9 11.2 0.511 0.521 1.0
13 14.1 0.542 0.565 1.0
18 11.6 0.451 0.493 1.0
... ... ... ... ...
4185 14.7 0.589 0.590 1.0
4190 21.7 0.625 0.668 1.0
4193 13.5 0.490 0.536 1.0
4198 16.1 0.424 0.454 1.0
4201 15.4 0.548 0.572 1.0
E_USG_PCT_AWAY E_PACE_AWAY PACE_AWAY PACE_PER40_AWAY POSS_AWAY \
2 0.200 94.70 94.00 78.33 95
5 0.194 91.32 89.50 74.58 90
9 0.195 100.16 99.00 82.50 98
13 0.195 99.46 98.50 82.08 99
18 0.192 97.64 94.50 78.75 95
... ... ... ... ... ...
4185 0.197 105.20 102.00 85.00 102
4190 0.194 108.04 106.50 88.75 106
4193 0.197 102.43 100.98 84.15 111
4198 0.201 113.60 112.00 93.33 112
4201 0.198 106.00 104.00 86.67 104
PIE_AWAY
2 0.619
5 0.403
9 0.416
13 0.589
18 0.466
... ...
4185 0.443
4190 0.581
4193 0.473
4198 0.350
4201 0.491
[1051 rows x 105 columns]
We can even check the [‘GAME_ID’]
column for duplicates in the following way:
are_dupes = not preped_game_df["GAME_ID"].is_unique
print(are_dupes)
If we get False
then there are no duplicates.
As I look through my data I realize that there’s a couple columns that I want to have. I want home/away points per possession. Now I could just calculate these on the fly when I run my simulations, but I’d like to keep the calculations inside the sim to a minimum, so I can streamline it. So I want to take care of it now. So let’s write a quick function that takes care of the calculations for us:
def calc_points_per_possession(df):
'''Calculate the home/away points per possesion and update the dataframe'''
df['points_per_possession_home'] = df['pts_home'] / df['pace_home']
df['points_per_possession_away'] = df['pts_away'] / df['pace_home']
return df
The reason we can use pace_home
for both calculations is because in any one game both teams play at the same pace.
We then call the function:
preped_game_df = calc_points_per_possession(preped_game_df)
Then if we look at the updated dataframe we will see:
season_id team_id_home team_abbreviation_home team_name_x_home \
2 22020 1610612766 CHA Charlotte Hornets
5 22020 1610612748 MIA Miami Heat
9 22020 1610612741 CHI Chicago Bulls
13 22020 1610612750 MIN Minnesota Timberwolves
18 22020 1610612752 NYK New York Knicks
... ... ... ... ...
4185 22020 1610612739 CLE Cleveland Cavaliers
4190 22020 1610612741 CHI Chicago Bulls
4193 22020 1610612743 DEN Denver Nuggets
4198 22020 1610612751 BKN Brooklyn Nets
4201 22020 1610612747 LAL Los Angeles Lakers
game_id game_date matchup_home wl_home min_x_home pts_home ... \
2 0022001047 2021-05-13 CHA vs. LAC L 239 90 ...
5 0022001050 2021-05-13 MIA vs. PHI W 241 106 ...
9 0022001051 2021-05-13 CHI vs. TOR W 240 114 ...
13 0022001052 2021-05-13 MIN vs. DEN L 240 103 ...
18 0022000458 2021-05-13 NYK vs. SAS W 240 102 ...
... ... ... ... ... ... ... ...
4185 0022000010 2020-12-23 CLE vs. CHA W 241 121 ...
4190 0022000015 2020-12-23 CHI vs. ATL L 240 104 ...
4193 0022000019 2020-12-23 DEN vs. SAC L 265 122 ...
4198 0022000001 2020-12-22 BKN vs. GSW W 242 125 ...
4201 0022000002 2020-12-22 LAL vs. LAC L 240 109 ...
ts_pct_away usg_pct_away e_usg_pct_away e_pace_away pace_away \
2 0.689 1.0 0.200 94.70 94.00
5 0.541 1.0 0.194 91.32 89.50
9 0.521 1.0 0.195 100.16 99.00
13 0.565 1.0 0.195 99.46 98.50
18 0.493 1.0 0.192 97.64 94.50
... ... ... ... ... ...
4185 0.590 1.0 0.197 105.20 102.00
4190 0.668 1.0 0.194 108.04 106.50
4193 0.536 1.0 0.197 102.43 100.98
4198 0.454 1.0 0.201 113.60 112.00
4201 0.572 1.0 0.198 106.00 104.00
pace_per40_away poss_away pie_away points_per_possession_home \
2 78.33 95 0.619 0.957447
5 74.58 90 0.403 1.184358
9 82.50 98 0.416 1.151515
13 82.08 99 0.589 1.045685
18 78.75 95 0.466 1.079365
... ... ... ... ...
4185 85.00 102 0.443 1.186275
4190 88.75 106 0.581 0.976526
4193 84.15 111 0.473 1.208160
4198 93.33 112 0.350 1.116071
4201 86.67 104 0.491 1.048077
points_per_possession_away
2 1.202128
5 1.050279
9 1.030303
13 1.157360
18 1.037037
... ...
4185 1.117647
4190 1.164319
4193 1.227966
4198 0.883929
4201 1.115385
[1051 rows x 107 columns]
Now that we have our data we probably want somewhere to store it so we don’t have to spend hours re-polling all the data from API each time we spin things up. As I mentioned earlier there are a few options:
- .csv files
- pickle
- sqlite
- MySql or PostgreSQL
- Other DB tech, ie. MongoDB
I decided to go with sqllite for a few reasons. While .csv files and pickle are easy to use they don’t offer much flexibility for data retrieval. Basically you get out what you put in. A db offers us the flexibility to only grab the data we need for our calculations. Also I have a ton of experience with SQL, so that’s a plus. I chose sqlite over the other DBs because it’s super simple, doesn’t need a service to run, so it just works.
So now we’re going to write a super simple function to save our dataframe to a sqlite database. We could use an ORM like SQLAlchemy for this, but I think that for something this simple it’s overkill.
First we need to update our imports to include sqlite:
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
Now let’s write our function. This one’s short and sweet:
def save_df_to_db(df, db_name, table_name, exists='replace'):
'''Saves a dataframe to a table in a sqlite database
Paramaters
----------
df: DataFrame to save to table
db_name: name of database
table_name: name of table
exists: what to do if table exists, default='replace'
Return
------
return:
'''
conn = None
try:
conn = sql.connect(db_name)
df.to_sql(table_name, conn, if_exists=exists)
except Exception as e:
print(e)
finally:
if conn:
conn.close()
A few notes on this. We’re using some features of sqlite that make this super easy. Sqlite will automatically create the database if it doesen’t exist, and the same with the table. So we don’t have to create either specifically. This is great for a rapid dev project like this. The if_exists
flag tells sqlite what to do if the table exists. The default=replace
will drop the table and replace it with the current data.
We can call our function like this:
save_df_to_db(preped_game_df, 'nba_data.db', 'games')
Now there’s a couple ways we could see if what we did worked. One would be to query the database and see if the data is there (which we’ll obviously be doing later). Or we could use a tool. My weapon of choice for this is [DB Browser for SQLite]. It’s super simple to install and then you can see if your new table exists and if its full of data. I won’t go over using it here, but you can check it out and use it.
Wow, that was a lot of work. But now we have all the data set up and ready to go. You might want to take a break and get something to drink (my current poison is MTN DEW Major Melon Zero). Then come back so we can get into the real guts of this, the simulator.
3 — Simulation
Hopefully you’re fully rejuvenated now and ready for the next part of our project: the simulation engine. There’s a lot going on here so first I’m going to start with an overview of what we are doing. First let’s talk about Monte Carlo, and I’m not talking about the city.
There’s lots of good resources on what a Monte Carlo simulation is, but I find this one on[Towards Data Science] to be a good primer. Basically the idea is that we take a number of known occurrences of a variable, then we use those knowns to create a probability distribution. We can then randomly sample the distribution to predict an outcome.
So for our simulator we could take the number of points a team scores per game and use those results to create a normal distribution. We could then sample that distribution for a score, thus ‘simulating’ the score for a game.
As you may have noticed I said ‘normal distribution’ in the preceding paragraph. Technically we could use any type of distribution (normal, gaussian, uniform, etc.) but we’re going to stick to a normal/gaussian distribution (ie. the bell curve) for our simulation for simplicity.
Now that that’s out of the way we’re ready to start building our simulation. But first we’re going to need get our data back out of our database and put it into a dataframe. So let’s write a quick function for that:
def get_df_from_db(db_name, table_name, select=['*'],
order_by='', descending=False):
'''Runs a select from a database and returns a DataFrame
Paramaters
----------
db_name: name of database
table_name: name of table
select: list of columns to select, default = ['*']
order_by: column name to order by
descending: if order by is to be set as ascending
Returns
-------
result: DataFrame
'''
select_string = ', '.join(select)
order_by_string = ''
if order_by:
order_by_string = " ORDER BY " + order_by
if descending:
order_by_string = order_by_string + " DESC"
query_string = "SELECT " + select_string + " FROM " + table_name + order_by_string
conn = None
try:
conn = sql.connect(db_name)
df = pd.read_sql_query(query_string , conn)
except Exception as e:
print(e)
return
finally:
if conn:
conn.close
return df
Now we can call this function like this:
games_df = get_df_from_db('nba_data.db','games', order_by='game_date')
Which will give us all the columns. Or we can just get the columns we want — team_abbreviation_home
, team_abbreviation_away
, points_per_possession_home
, points_per_possession_away
, pace_home
). Like this:
games_df = get_df_from_db('nba_data.db','games',
select=['team_abbreviation_home', 'team_abbreviation_away',
'points_per_possession_home', 'points_per_possession_away',
'pace_home'],
order_by='game_date')
Now that we have our data it’s time to get to work on our simulation engine. Now, I know this is going to be a more complicated function. So I want to map out the steps of how it works. Mostly I want to do this because I want the minimum number of calculations within the simulation loop itself. If I want to run 10k simulations, I want them to run as efficiently as possible.
These are the steps for how the simulation engine will work:
- Filter dataframe into teams we want to simulate
- Calculate standard deviation and mean for each statistic
- Sample distributions for each statistic
- Determine winner of simulation
From this I can see that I want the sampling and picking a winner inside the loop but I don’t want step 1 or 2 inside the loop. This helps me figure out the structure for my simulation.
Now that I have a basic layout I can start putting the pieces together. First step is basic function shape and filtering the dataframes (I set the default teams as PHX and LAL, since I’m a Suns fan and hate the Lakers):
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
Now we need to approach gathering all of the standard deviations and means for the calculation. Originally I was going to just put all the calculations into the function, but that gets messy. So instead I’m going to write a function that calculates the standard deviation and mean for a column of a dataframe. Then I can just pass in the columns to get the data.
First we need to update our imports with the stats package:
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats
Then we can write our function:
def calc_std_dev_and_mean(df_column):
'''calculates the standard deviation and mean of a DataFrame column, return dict'''
std_dev = df_column.std()
mean = df_column.mean()
stats_dict = {'std_dev': std_dev, 'mean': mean}
return stats_dict
Now that I have a simple helper function to get standard deviation and mean I’m going to build another function that will use that to calculate all the stats for a team. I’m going to pass a team dataframe to it and it will build it into a dictionary with all the stats needed for the simulation. That function looks like this:
def get_team_stats(df, home=True):
'''gets all the needed stats for a team for the simulator, return dict'''
pace = calc_std_dev_and_mean(df['pace_home'])
home_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_home'])
away_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_away'])
if home:
stats_dict = {"pace": pace,
"points_scored": home_pts_per_possession,
"points_given_up": away_pts_per_possession}
else:
stats_dict = {"pace": pace,
"points_scored": away_pts_per_possession,
"points_given_up": home_pts_per_possession}
return stats_dict
Now we can update our original function to:
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
home_stats = get_team_stats(home_df)
away_stats = get_team_stats(away_df)
Now we’re on the home stretch. We now need to write a function that will take our two dictionaries of statistics and uses those to simulate a game and choose a winner. I’ll put it here then we can talk about how it works after:
def simulate_game(home_stats, away_stats):
'''uses stats from home and away teams to simulate a single game'''
home_pace = rand.gauss(home_stats['pace']['mean'],
home_stats['pace']['std_dev'])
home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
home_stats['points_scored']['std_dev'])
home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
home_stats['points_given_up']['std_dev'])
away_pace = rand.gauss(away_stats['pace']['mean'],
away_stats['pace']['std_dev'])
away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
away_stats['points_scored']['std_dev'])
away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
away_stats['points_given_up']['std_dev'])
pace = round((home_pace + away_pace) / 2)
home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace}
return results
The first part of the function uses the .gauss() function of the random package to create a gaussian distribution based on the parameters (mean, standard deviation) and then select a random value from that distribution. We do that for the various stats we need random values for. Then we use those to calculate what happens in our simulation. We take the random value for each team and then average them for the actual simulated value. We then return the points for each team and the pace for the game.
So now I can use this to test our nba_game_simulator
so far. Here’s our updated function:
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
home_stats = get_team_stats(home_df)
away_stats = get_team_stats(away_df, home=False)
for i in range(sims):
game_result = simulate_game(home_stats, away_stats)
print(game_result)
Notice that I’m just using a simple print
to show the result of a single simulation. So when we call the function like so:
nba_game_simulator(games_df)
We get something like the following:
{'home_pts': 101, 'away_pts': 109, 'pace': 95}
Now all we need to do is modify the interior of our loop to keep count of the winners, like so:
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
#containers for counts
home_wins = 0
away_wins = 0
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
home_stats = get_team_stats(home_df)
away_stats = get_team_stats(away_df, home=False)
for i in range(sims):
game_result = simulate_game(home_stats, away_stats)
if (game_result['home_pts'] > game_result['away_pts']):
home_wins += 1
else:
away_wins += 1
results = {"home_wins": home_wins, "away_wins": away_wins}
return results
When I updated this I realized that there is a problem in our game simulator. We can have ties, which is impossible in an NBA game. The easiest way to remedy this is to add a while loop in our simulate_game
function. This is pretty simple:
def simulate_game(home_stats, away_stats):
'''uses stats from home and away teams to simulate a single game'''
# set the margin to 0 for our while loop
margin = 0
while(margin == 0):
home_pace = rand.gauss(home_stats['pace']['mean'],
home_stats['pace']['std_dev'])
home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
home_stats['points_scored']['std_dev'])
home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
home_stats['points_given_up']['std_dev'])
away_pace = rand.gauss(away_stats['pace']['mean'],
away_stats['pace']['std_dev'])
away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
away_stats['points_scored']['std_dev'])
away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
away_stats['points_given_up']['std_dev'])
pace = round((home_pace + away_pace) / 2)
home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
margin = home_pts - away_pts
results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace}
return results
So now we can run a simulation for 10k games for Suns v. Lakers like so, and print the results:
suns_v_lakers = nba_game_simulator(games_df, sims=10000)
suns_win_pct = round((suns_v_lakers['home_wins'] / 10000) * 100)
lakers_win_pct = 100 - suns_win_pct
print("SUNS: " + str(suns_win_pct) + "% | LAKERS: " + str(lakers_win_pct) + "%")
Which will give us something like this:
SUNS: 64% | LAKERS: 36%
Once I set this up I realized that I wanted to have some more information than just how many wins. I wanted some other info like avg points, avg margin, etc. So we need to make just a couple more changes to our nba_game_simulator
and to our simulate_game
functions:
def simulate_game(home_stats, away_stats):
'''uses stats from home and away teams to simulate a single game'''
# set the margin to 0 for our while loop
margin = 0
while(margin == 0):
home_pace = rand.gauss(home_stats['pace']['mean'],
home_stats['pace']['std_dev'])
home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
home_stats['points_scored']['std_dev'])
home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
home_stats['points_given_up']['std_dev'])
away_pace = rand.gauss(away_stats['pace']['mean'],
away_stats['pace']['std_dev'])
away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
away_stats['points_scored']['std_dev'])
away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
away_stats['points_given_up']['std_dev'])
pace = round((home_pace + away_pace) / 2)
home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
margin = home_pts - away_pts
results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace, "margin": margin}
return results
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
#containers for counts
home_wins = 0
away_wins = 0
#containers for stats
home_margins = []
away_margins = []
home_pts = []
away_pts = []
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
home_stats = get_team_stats(home_df)
away_stats = get_team_stats(away_df, home=False)
for i in range(sims):
game_result = simulate_game(home_stats, away_stats)
home_pts.append(game_result['home_pts'])
away_pts.append(game_result['away_pts'])
if (game_result['margin'] > 0):
home_wins += 1
home_margins.append(game_result['margin'])
else:
away_wins += 1
away_margins.append(-game_result['margin'])
results = {"home_wins": home_wins, "away_wins": away_wins,
"home_margins": home_margins, "away_margins": away_margins,
"home_pts": home_pts, "away_pts": away_pts}
return results
Then we can call and print the results like so:
suns_win_pct = round((suns_v_lakers['home_wins'] / 10000) * 100)
lakers_win_pct = 100 - suns_win_pct
suns_avg_margin = round(stats.mean(suns_v_lakers['home_margins']))
lakers_avg_margin = round(stats.mean(suns_v_lakers['away_margins']))
suns_avg_pts = round(stats.mean(suns_v_lakers['home_pts']))
lakers_avg_pts = round(stats.mean(suns_v_lakers['away_pts']))
print("SUNS: " + str(suns_win_pct) + "% | LAKERS: " + str(lakers_win_pct) + "%")
print("SUNS: Avg. Margin of Victory = " + str(suns_avg_margin))
print("LAKERS: Avg. Margin of Victory = " + str(lakers_avg_margin))
print("SUNS: Avg. Points = " + str(suns_avg_pts))
print("LAKERS: Avg. Points = " + str(lakers_avg_pts))
Which gives us something like this:
SUNS: 64% | LAKERS: 36%
SUNS: Avg. Margin of Victory = 10
LAKERS: Avg. Margin of Victory = 8
SUNS: Avg. Points = 111
LAKERS: Avg. Points = 107
Wow! That was a lot. This was definitely the hardest/longest part of this guide. All that we have left to do is roll this up in a simple Flask/Dash app shell and the we’ll be done. So go grab a snack and then we’ll get started on the final piece of the puzzle, putting this into a nice webapp.
4 — Dash
Hopefully you replenished your strength after that grueling section on building our simulation. Now we’re on to the last section which is to put together a web based app to show the results of our hard work.
A couple of notes here before we get started on this last section. We have some options here on how to lay out our codebase. We could just have one large file with all the functions inside it, including the flask/dash app. This is by far the least efficient as it will re-update the database every time it runs. That would be bad. We could do two files; one that contains all of the data collection/wrangling and then one that has our webapp and the simulation functions. There’s nothing wrong with this as it will work just fine. I think our third option is the best: one file for the data work, one for the simulation, and then a file for the app that calls the simulation app. So this is how we could set up the codebase:
nba-game-sim
|-- app.py
|-- generatedata.py
|-- simulategames.py
|-- nba_data.db
If we assume this structure then generatedata.py
would look like this:
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats
def get_game_data(season, league_id='00', test=False):
'''Query NBA stats via the nba-api
Parameters
----------
season: format is yyyy-yy ie. 2020-21
league_id = '00' => NBA
Returns
-------
result: dict of 2 DataFrames {"games_basic": , "games_adv": }
'''
#first we need to get all of the basic game data
gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
season_nullable=season,
season_type_nullable=SeasonType.regular)
games_basic_df = gamefinder.get_data_frames()[0]
#then we need to get the advanced game data. Slow, game by game.
#we need to get a list of the games
game_ids = list(games_basic_df['GAME_ID'].unique())
games_adv_df = pd.DataFrame()
if test:
c = 0 # this is here for testing only
#now we iterate over the list of game_ids
while len(game_ids) > 0:
for i in game_ids:
cooldown = rand.gammavariate(alpha=9, beta=0.4)
clear_output(wait=True)
for attempt in range(10):
try:
time.sleep(cooldown)
data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
end_range=0,
game_id=i,
range_type=0,
start_period=1,
start_range=0)
except:
time.sleep(cooldown)
data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
end_range=0,
game_id=i,
range_type=0,
start_period=1,
start_range=0)
else:
break
else:
print('Connection Error')
break
data_df = data.get_data_frames()[1]
if games_adv_df.empty:
games_adv_df = data_df
else:
games_adv_df = games_adv_df.append(data_df, ignore_index=True)
game_ids.remove(i)
print (i , "completed", len(game_ids), "games left", sep="---")
if test:
c += 1
if c > 3: #again for testing only
games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
print('DONE!')
return games_df_dict
games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
print('DONE!')
return games_df_dict
def prep_games(dictionary, left_key, right_key, left, right):
'''join two dataframes
Paramaters
----------
dictionary: Input dict
left_key: dictionary key for left dataframe
right_key: dictionary key for right dataframe
left: list of columns for left join
right: list of columns for right join
Returns
-------
result: DataFrame
'''
#join the two DataFrames using the paramaters
joined = dictionary[left_key].merge(dictionary[right_key],
left_on=left,
right_on=right)
# Join every row to all others with the same game ID.
joined = pd.merge(joined, joined, suffixes=['_HOME', '_AWAY'],
on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
# Filter out any row that is joined to itself.
result = joined[joined.TEAM_ID_HOME != joined.TEAM_ID_AWAY]
result = result[result.MATCHUP_HOME.str.contains(' vs. ')]
#lowercase the columnn names for easier access
result.columns = result.columns.str.lower()
return result
def calc_points_per_possession(df):
'''Calculate the home/away points per possesion and update the dataframe'''
df['points_per_possession_home'] = df['pts_home'] / df['pace_home']
df['points_per_possession_away'] = df['pts_away'] / df['pace_home']
return df
def save_df_to_db(df, db_name, table_name, exists='replace'):
'''Saves a dataframe to a table in a sqlite database
Paramaters
----------
df: DataFrame to save to table
db_name: name of database
table_name: name of table
exists: what to do if table exists, default='replace'
Return
------
return:
'''
conn = None
try:
conn = sql.connect(db_name)
df.to_sql(table_name, conn, if_exists=exists)
except Exception as e:
print(e)
finally:
if conn:
conn.close()
def generate_season_data(season):
'''this will call all the functions needed to get and save a season of data'''
df = get_game_data(season)
df = prep_games(df,
left_key='games_basic',
right_key='games_adv',
left=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'],
right=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'])
df = calc_points_per_possession(df)
save_df_to_db(df, 'nba_data.db', 'games')
return True
season_data = generate_season_data('2020-21')
if season_data:
print(season_data, "SEASON GENERATED AND SAVED SUCCESFULLY")
You may have noticed a new function at the end generate_season_data
. This function calls all the other processing functions so we just have to pass a single function call at the end.
Our other file simulategames.py
will look like this:
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats
def get_df_from_db(db_name, table_name, select=['*'],
order_by='', descending=False):
'''Runs a select from a database and returns a DataFrame
Paramaters
----------
db_name: name of database
table_name: name of table
select: list of columns to select, default = ['*']
order_by: column name to order by
descending: if order by is to be set as ascending
Returns
-------
result: DataFrame
'''
select_string = ', '.join(select)
order_by_string = ''
if order_by:
order_by_string = " ORDER BY " + order_by
if descending:
order_by_string = order_by_string + " DESC"
query_string = "SELECT " + select_string + " FROM " + table_name + order_by_string
conn = None
try:
conn = sql.connect(db_name)
df = pd.read_sql_query(query_string , conn)
except Exception as e:
print(e)
return
finally:
if conn:
conn.close
return df
def calc_std_dev_and_mean(df_column):
'''calculates the standard deviation and mean of a DataFrame column, return dict'''
std_dev = df_column.std()
mean = df_column.mean()
stats_dict = {'std_dev': std_dev, 'mean': mean}
return stats_dict
def get_team_stats(df, home=True):
'''gets all the needed stats for a team for the simulator, return dict'''
pace = calc_std_dev_and_mean(df['pace_home'])
home_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_home'])
away_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_away'])
if home:
stats_dict = {"pace": pace,
"points_scored": home_pts_per_possession,
"points_given_up": away_pts_per_possession}
else:
stats_dict = {"pace": pace,
"points_scored": away_pts_per_possession,
"points_given_up": home_pts_per_possession}
return stats_dict
def simulate_game(home_stats, away_stats):
'''uses stats from home and away teams to simulate a single game'''
# set the margin to 0 for our while loop
margin = 0
while(margin == 0):
home_pace = rand.gauss(home_stats['pace']['mean'],
home_stats['pace']['std_dev'])
home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
home_stats['points_scored']['std_dev'])
home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
home_stats['points_given_up']['std_dev'])
away_pace = rand.gauss(away_stats['pace']['mean'],
away_stats['pace']['std_dev'])
away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
away_stats['points_scored']['std_dev'])
away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
away_stats['points_given_up']['std_dev'])
pace = round((home_pace + away_pace) / 2)
home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
margin = home_pts - away_pts
results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace, "margin": margin}
return results
def retrieve_data():
'''retrieves data_frame'''
games_df = get_df_from_db('nba_data.db','games',
select=['team_abbreviation_home', 'team_abbreviation_away',
'points_per_possession_home', 'points_per_possession_away',
'pace_home'],
order_by='game_date')
return games_df
We added a new function here. Just a simple one to call the get_df_from_db
function.
Now before starting the web app I want to think about what it’s going to look like. I do a couple sketches until I get one I like. This is what I designed:
On the left are the controls, which are simple. Just two drop downs; one for home, one for away. Below that I want some numerical results like our tests, only in a table layout. The right side will be a large histogram of the results. I think that will be cool because you usually don’t get to see a histogram on these, just the percentages.
Now that I have my layout I realize that I’m going to need some more data from my database. I’m going to need the abbreviations for each team. That means we need to update our simulategames.py
. Luckily we don’t need to create a new function, we can just use our get_df_from_db
function to grab what we need. We just need to update retrieve_data
like this:
def retrieve_data():
'''retrieves data_frame'''
games_df = get_df_from_db('nba_data.db','games',
select=['team_abbreviation_home', 'team_abbreviation_away',
'points_per_possession_home', 'points_per_possession_away',
'pace_home'],
order_by='game_date')
teams_df = get_df_from_db('nba_data.db', 'games',
select=['DISTINCT team_abbreviation_home'],
order_by='team_abbreviation_home')
dfs_dict = {"games":games_df, "teams":teams_df}
Now I’ll get a dictionary back with two dataframes: one for the games and one with the team abbreviations.
Now that the housekeeping is out of the way it’s time to build our web app. Now this isn’t a full blown tutorial on how to build an app via Dash. That being said there are tons out there. I found the [docs] to be the most useful though. We are going to use [bootstrap components] as well if you want to take a look at their docs.
All right. First we need to install some dependencies. This can be done via pip:
pip install dash
pip install dash-renderer
pip install dash-html-components
pip install dash-core-components
pip install dash-bootstrap-components
pip install plotly
Next we open up app.py
. Let’s start with our imports first. We’ll be importing dash and then also our simulategames.py
file.
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
import simulategames
First let’s get the data we need.
nba_data = simulategames.retrieve_data()
Now we can start building our site. First we need to initiate our app. We will pass through a stylesheet for Bootstrap which makes styling easier.
external_stylesheets = [dbc.themes.BOOTSTRAP]
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
Now that our app is started we can start building the structure. The easiest way to do this is to start at the component level, then build outwards until we have our main structure. So first we’ll make our controls.
controls = dbc.Card(
[
dbc.FormGroup(
[
dbc.Label("Home Team"),
dcc.Dropdown(
id='hteam',
options=get_options(teamslist),
value='PHX'
),
]
),
dbc.FormGroup(
[
dbc.Label('Away Team'),
dcc.Dropdown(
id='ateam',
options=get_options(teamslist),
value='LAL'
),
]
)
],
body=True
)
These are form groups. One for the home team. And one for the away team. As you can see we have a new function that will take our dataframe and make an options list out of it. Here is the function.
def get_options(df, colname):
'''take a column from a dataframe and create and options list'''
new_list = df[colname].tolist()
options = []
for i in new_list:
options.append({'label': i, 'value': i})
return options
Now we need to make a table to display our data
table_header = [
html.Thead(html.Tr([html.Th() , html.Th("Home Team") , html.Th("Away Team")]))
]
row1 = html.Tr([html.Th("Win %"),
html.Td(id='htwinper'),
html.Td(id='atwinper')])
row2 = html.Tr([html.Th("Avg. Margin of Victory"),
html.Td(id='htmvict'),
html.Td(id='atmvict')])
row3 = html.Tr([html.Th("Avg. Points Scored"),
html.Td(id='htpts'),
html.Td(id='atpts')])
table_body = [html.Tbody([row1 , row2, row3])]
table = dbc.Table(table_header + table_body)
simResults = dbc.Card(
[
html.H4('Simulation Results'),
table
],
body=True
)
Here we first made our table header. Then we make each row and then put them together in the table body. The we take that table and put it into a card.
Now that we have our components we can put the page together like so:
app.layout = dbc.Container(
[
html.H1(children='Choose two teams to see the results of 10k sims', style={'textAlign':'center'}),
html.Hr(),
dbc.Row(
[
dbc.Col([controls, html.Br(), ], table, md=4),
dbc.Col([html.H3('All margins of victory | Home - Away'), dcc.Graph(id='histogram')] , md=8)
],
align='center',
),
],
fluid=True
)
Now if we want to test our layout we can add this at the end of our file:
if __name__ == '__main__':
app.run_server(debug=True)
Then we just save. Then we can navigate to the folder and run our app.py file to test it out. The site will be available at [http://127.0.0.1:8050]. It should look something like this:
As you can see we have a layout but no data. So now we need our data. The way we populate our page is via a callback function. There are two parts to the callback: a decorator and the function. Our decorator will look like this:
@app.callback(
Output('histogram' , 'figure'),
Output('htwinper' , 'children'),
Output('atwinper' , 'children'),
Output('htmvict' , 'children'),
Output('atmvict' , 'children'),
Output('htpts' , 'children'),
Output('atpts', 'children'),
Input('hteam' , 'value' ),
Input('ateam' , 'value'))
The first part of the decorator is our outputs from the callback function. The last part is the inputs (the fields) where we’re getting our data for our callback function.
Now we need our callback function:
def callback( home , away):
results = simulategames.nba_game_simulator(nba_data['games'],
home, away, sims=10000)
return px.histogram(results['margins']),\
results['home_win_pct'],\
results['away_win_pct'],\
results['home_wins'],\
results['away_wins'],\
results['home_avg_pts'],\
results['away_avg_pts']
For the histogram we are going to need plotly express. So we need to add that to our imports:
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
import plotly.express as px
import simulategames
We also need to edit our nba_game_simulator function to return all the calculations like so:
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
#containers for counts
home_wins = 0
away_wins = 0
#containers for stats
margins = []
home_margins = []
away_margins = []
home_pts = []
away_pts = []
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
home_stats = get_team_stats(home_df)
away_stats = get_team_stats(away_df, home=False)
for i in range(sims):
game_result = simulate_game(home_stats, away_stats)
home_pts.append(game_result['home_pts'])
away_pts.append(game_result['away_pts'])
margins.append(game_result['margin'])
if (game_result['margin'] > 0):
home_wins += 1
home_margins.append(game_result['margin'])
else:
away_wins += 1
away_margins.append(-game_result['margin'])
home_win_pct = round(home_wins / 100)
away_win_pct = round(away_wins / 100)
home_avg_pts = round(stats.mean(home_pts))
away_avg_pts = round(stats.mean(away_pts))
home_avg_margin = round(stats.mean(home_margins))
away_avg_margin = round(stats.mean(away_margins))
results = {"home_avg_margin": home_avg_margin, "away_avg_margin": away_avg_margin,
"home_win_pct": home_win_pct, "away_win_pct": away_win_pct,
"home_avg_pts": home_avg_pts, "away_avg_pts": away_avg_pts,
"margins": margins,}
return results
Now, for the moment of truth. If you head back to where you have your app spun up at [http://127.0.0.1:8050/]you should see something like this:
When you change the teams it should live update the results.
Congratulations! We’re done.
Here is the final code for our three .py files. You can also view the [repo on my github].
generatedata.py
from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats
def get_game_data(season, league_id='00', test=False):
'''Query NBA stats via the nba-api
Parameters
----------
season: format is yyyy-yy ie. 2020-21
league_id = '00' => NBA
Returns
-------
result: dict of 2 DataFrames {"games_basic": , "games_adv": }
'''
#first we need to get all of the basic game data
gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
season_nullable=season,
season_type_nullable=SeasonType.regular)
games_basic_df = gamefinder.get_data_frames()[0]
#then we need to get the advanced game data. Slow, game by game.
#we need to get a list of the games
game_ids = list(games_basic_df['GAME_ID'].unique())
games_adv_df = pd.DataFrame()
if test:
c = 0 # this is here for testing only
#now we iterate over the list of game_ids
while len(game_ids) > 0:
for i in game_ids:
cooldown = rand.gammavariate(alpha=9, beta=0.4)
clear_output(wait=True)
for attempt in range(10):
try:
time.sleep(cooldown)
data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
end_range=0,
game_id=i,
range_type=0,
start_period=1,
start_range=0)
except:
time.sleep(cooldown)
data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
end_range=0,
game_id=i,
range_type=0,
start_period=1,
start_range=0)
else:
break
else:
print('Connection Error')
break
data_df = data.get_data_frames()[1]
if games_adv_df.empty:
games_adv_df = data_df
else:
games_adv_df = games_adv_df.append(data_df, ignore_index=True)
game_ids.remove(i)
print (i , "completed", len(game_ids), "games left", sep="---")
if test:
c += 1
if c > 3: #again for testing only
games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
print('DONE!')
return games_df_dict
games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
print('DONE!')
return games_df_dict
def prep_games(dictionary, left_key, right_key, left, right):
'''join two dataframes
Paramaters
----------
dictionary: Input dict
left_key: dictionary key for left dataframe
right_key: dictionary key for right dataframe
left: list of columns for left join
right: list of columns for right join
Returns
-------
result: DataFrame
'''
#join the two DataFrames using the paramaters
joined = dictionary[left_key].merge(dictionary[right_key],
left_on=left,
right_on=right)
# Join every row to all others with the same game ID.
joined = pd.merge(joined, joined, suffixes=['_HOME', '_AWAY'],
on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
# Filter out any row that is joined to itself.
result = joined[joined.TEAM_ID_HOME != joined.TEAM_ID_AWAY]
result = result[result.MATCHUP_HOME.str.contains(' vs. ')]
#lowercase the columnn names for easier access
result.columns = result.columns.str.lower()
return result
def calc_points_per_possession(df):
'''Calculate the home/away points per possesion and update the dataframe'''
df['points_per_possession_home'] = df['pts_home'] / df['pace_home']
df['points_per_possession_away'] = df['pts_away'] / df['pace_home']
return df
def save_df_to_db(df, db_name, table_name, exists='replace'):
'''Saves a dataframe to a table in a sqlite database
Paramaters
----------
df: DataFrame to save to table
db_name: name of database
table_name: name of table
exists: what to do if table exists, default='replace'
Return
------
return:
'''
conn = None
try:
conn = sql.connect(db_name)
df.to_sql(table_name, conn, if_exists=exists)
except Exception as e:
print(e)
finally:
if conn:
conn.close()
def generate_season_data(season):
'''this will call all the functions needed to get and save a season of data'''
df = get_game_data(season)
df = prep_games(df,
left_key='games_basic',
right_key='games_adv',
left=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'],
right=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'])
df = calc_points_per_possession(df)
save_df_to_db(df, 'nba_data.db', 'games')
return True
season_data = generate_season_data('2020-21')
if season_data:
print(season_data, "SEASON GENERATED AND SAVED SUCCESFULLY")
simulategames.py
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats
def get_df_from_db(db_name, table_name, select=['*'],
order_by='', descending=False):
'''Runs a select from a database and returns a DataFrame
Paramaters
----------
db_name: name of database
table_name: name of table
select: list of columns to select, default = ['*']
order_by: column name to order by
descending: if order by is to be set as ascending
Returns
-------
result: DataFrame
'''
select_string = ', '.join(select)
order_by_string = ''
if order_by:
order_by_string = " ORDER BY " + order_by
if descending:
order_by_string = order_by_string + " DESC"
query_string = "SELECT " + select_string + " FROM " + table_name + order_by_string
conn = None
try:
conn = sql.connect(db_name)
df = pd.read_sql_query(query_string , conn)
except Exception as e:
print(e)
return
finally:
if conn:
conn.close
return df
def calc_std_dev_and_mean(df_column):
'''calculates the standard deviation and mean of a DataFrame column, return dict'''
std_dev = df_column.std()
mean = df_column.mean()
stats_dict = {'std_dev': std_dev, 'mean': mean}
return stats_dict
def get_team_stats(df, home=True):
'''gets all the needed stats for a team for the simulator, return dict'''
pace = calc_std_dev_and_mean(df['pace_home'])
home_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_home'])
away_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_away'])
if home:
stats_dict = {"pace": pace,
"points_scored": home_pts_per_possession,
"points_given_up": away_pts_per_possession}
else:
stats_dict = {"pace": pace,
"points_scored": away_pts_per_possession,
"points_given_up": home_pts_per_possession}
return stats_dict
def simulate_game(home_stats, away_stats):
'''uses stats from home and away teams to simulate a single game'''
# set the margin to 0 for our while loop
margin = 0
while(margin == 0):
home_pace = rand.gauss(home_stats['pace']['mean'],
home_stats['pace']['std_dev'])
home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
home_stats['points_scored']['std_dev'])
home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
home_stats['points_given_up']['std_dev'])
away_pace = rand.gauss(away_stats['pace']['mean'],
away_stats['pace']['std_dev'])
away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
away_stats['points_scored']['std_dev'])
away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
away_stats['points_given_up']['std_dev'])
pace = round((home_pace + away_pace) / 2)
home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
margin = home_pts - away_pts
results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace, "margin": margin}
return results
def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
'''This will run x number of game simulations'''
#containers for counts
home_wins = 0
away_wins = 0
#containers for stats
margins = []
home_margins = []
away_margins = []
home_pts = []
away_pts = []
home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
home_stats = get_team_stats(home_df)
away_stats = get_team_stats(away_df, home=False)
for i in range(sims):
game_result = simulate_game(home_stats, away_stats)
home_pts.append(game_result['home_pts'])
away_pts.append(game_result['away_pts'])
margins.append(game_result['margin'])
if (game_result['margin'] > 0):
home_wins += 1
home_margins.append(game_result['margin'])
else:
away_wins += 1
away_margins.append(-game_result['margin'])
home_win_pct = round(home_wins / 100)
away_win_pct = round(away_wins / 100)
home_avg_pts = round(stats.mean(home_pts))
away_avg_pts = round(stats.mean(away_pts))
home_avg_margin = round(stats.mean(home_margins))
away_avg_margin = round(stats.mean(away_margins))
results = {"home_avg_margin": home_avg_margin, "away_avg_margin": away_avg_margin,
"home_win_pct": home_win_pct, "away_win_pct": away_win_pct,
"home_avg_pts": home_avg_pts, "away_avg_pts": away_avg_pts,
"margins": margins,}
return results
def retrieve_data():
'''retrieves data_frame'''
games_df = get_df_from_db('nba_data.db','games',
select=['team_abbreviation_home', 'team_abbreviation_away',
'points_per_possession_home', 'points_per_possession_away',
'pace_home'],
order_by='game_date')
teams_df = get_df_from_db('nba_data.db', 'games',
select=['DISTINCT team_abbreviation_home'],
order_by='team_abbreviation_home')
dfs_dict = {"games":games_df, "teams":teams_df}
return dfs_dict
app.py
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
import plotly.express as px
import simulategames
nba_data = simulategames.retrieve_data()
external_stylesheets = [dbc.themes.BOOTSTRAP]
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
def get_options(df, colname):
'''take a column from a dataframe and create and options list'''
new_list = df[colname].tolist()
options_list = []
for i in new_list:
options_list.append({'label': i, 'value': i})
return options_list
controls = dbc.Card(
[
dbc.FormGroup(
[
dbc.Label("Home Team"),
dcc.Dropdown(
id='hteam',
options=get_options(nba_data['teams'], 'team_abbreviation_home'),
value='PHX'
),
]
),
dbc.FormGroup(
[
dbc.Label('Away Team'),
dcc.Dropdown(
id='ateam',
options=get_options(nba_data['teams'], 'team_abbreviation_home'),
value='LAL'
),
]
)
],
body=True
)
table_header = [
html.Thead(html.Tr([html.Th() , html.Th("Home Team") , html.Th("Away Team")]))
]
row1 = html.Tr([html.Th("Win %"),
html.Td(id='htwinper'),
html.Td(id='atwinper')])
row2 = html.Tr([html.Th("Avg. Margin of Victory"),
html.Td(id='htmvict'),
html.Td(id='atmvict')])
row3 = html.Tr([html.Th("Avg. Points Scored"),
html.Td(id='htpts'),
html.Td(id='atpts')])
table_body = [html.Tbody([row1 , row2, row3])]
table = dbc.Table(table_header + table_body)
simResults = dbc.Card(
[
html.H4('Simulation Results'),
table
],
body=True
)
app.layout = dbc.Container(
[
html.H1(children='Choose two teams to see the results of 10k sims', style={'textAlign':'center'}),
html.Hr(),
dbc.Row(
[
dbc.Col([controls, html.Br(), simResults], md=4),
dbc.Col([html.H3('All margins of victory | Home - Away'), dcc.Graph(id='histogram')] , md=8)
],
align='center',
),
],
fluid=True
)
@app.callback(
Output('histogram' , 'figure'),
Output('htwinper' , 'children'),
Output('atwinper' , 'children'),
Output('htmvict' , 'children'),
Output('atmvict' , 'children'),
Output('htpts' , 'children'),
Output('atpts', 'children'),
Input('hteam' , 'value' ),
Input('ateam' , 'value'))
def callback( home , away):
results = simulategames.nba_game_simulator(nba_data['games'],
home, away, sims=10000)
return px.histogram(results['margins']),\
results['home_win_pct'],\
results['away_win_pct'],\
results['home_avg_margin'],\
results['away_avg_margin'],\
results['home_avg_pts'],\
results['away_avg_pts']
if __name__ == '__main__':
app.run_server(debug=True)
Next steps
This app obviously has a lot of room for more work. Here are some ideas I thought of that could enhance it.
- You could add some more stats. Such as:
— Top margins of victory along with counts
— Highest scores
— Lowest scores
— etc. - You could set it up on Heroku or Pythonanywhere ([mine is up on Pythonanywhere, with margin counts])
- You could rewrite the data collection so it updates with missing data instead of just getting all the data, and then use a cron job to run it.
— *A quick note on this — if you want to do this make sure not to run the data collection from AWS or Google Cloud, their IPs are all blocked by the API* - You could build a season simulator based on this that would run the entire season/playoffs.
Final thoughts
So there you have it. I really enjoyed designing this app and writing the guide. And I hope you enjoyed it as well.
My goal is to write one of these a week — or more likely every couple, since they are a lot of work. For my next guide I am thinking of building a financial simulation that will tell you how much money you ‘might’ have when you retire. But if you would like to see something else — or if you have questions about what I did here. Please post in the comments or email me @ jd [at] carbonlab [dot] io.