NBA Monte Carlo Game Simulator

Using real game data to simulate the results of a game between two NBA teams

Introduction

Let’s start with why I’m doing this. I’ve always enjoyed articles that show all the steps in a project. From idea all the way through the final project. So, I wanted to create a guide along those lines. This is that guide. I’m going to walk you, step by step, through how I approached this project. Basically its like a real-time look at creating the project. And, if you make it to the end, I’ll even talk about some enhancements that can be made to improve it. That’s it for the intro. Let’s get going!

The Idea

First some truth. I love me some sports. Not as much as my dad, mind you. He’ll turn on bowling or tennis or even (shudders involuntarily) golf. I stick to the major US sports: MLB, NFL, NBA, some NHL. But my first love has always been the NBA. So, I decide I want to create a end-to-end guide, doing something with NBA data.

Now that I know the general direction, I need to come up with something I want to explore. I thought about doing shot charts, or designing a new advanced stat. I even thought about doing some similarity scores for players/teams (which may happen in the future). But as I thought about it I decided there was something I have always really wanted to do. I want to simulate game matchups.

Okay now that I know that I need to figure out exactly where I want to go with the project. Do I just want to build a notebook to run the simulations, or do I want something more interactive? In the end I decide something interactive and more user friendly is the way to go. I want to design a simple web app/dashboard that can be used to simulate the outcome of a single game.

Roadmap

Now that I know what I want to do I need to figure out the roadmap to get there. I approach this first by making a layout of the steps that are needed for the project to work. For example this project’s step layout is pretty straightforward.

  1. Collect Data
  2. Wrangle Data
  3. Simulate Game
  4. Display Results

This gives me a roadmap of pieces or modules that I need to build out. As I look at the project layout I see that I am probably going to want to split the functionality into two files. One would be in charge of the data, collecting it, then wrangling it, and then saving it. The second file will be in charge of taking user inputs, simulating the matchup, and then showing the results.

Tools

The next step is to figure out what tools I want to use. My python experience is much > than my R experience. So python is the obvious choice. I know that I want to display the results visually, so I decide to use Flask/Dash/Plotly. I also want to save the data once I collect it. I could just save it to a .csv, or pickle the dataframes. But I want something more ‘elegant’ so I’ll go with sqlite as a simple db back end.

Now I need something to collect the data I need. I could just use a web-scraper like beautiful soup, but again this seems less than optimal. So I head on over to github and find an excellent project from [Swar Patel called nba-api]. This package allows me to query the nba-api for any data I want. And the package has great documentation, including a couple of notebooks with examples. It will be perfect for what I want to do.

Simulation

This is where I run into my first roadblock. I know I want to simulate games, but I’m not really sure how I want to go about it. So I head to google and look for some ideas. One of the most popular ways to simulate games is by using Elo ratings. [Nate Silver at Fivethirtyeight] took ratings that had been used for chess and applied them to other sports. This seemed interesting on first blush, but wasn’t really what I wanted.

As I read about Elo it got me thinking about what I really wanted to do. I wanted to build an engine that would look at past data and then use that to predict outcomes. In a word I want a [Monte Carlo Simulation].

This leads me to need the following data:

  • Game scores
  • Pace (possessions/game)

I can use this to calculate a normal distribution of points/possession and possession/game. I can then sample these x times to calculate the winning percentages. This gives me a baseline for my simulation. Later on, when I build my model, I’ll go more in depth into how this all fits together.

The build

At this point I feel pretty confident that I know where I’m going and have a general idea of how to get there. So I decide its time to start writing some code. The code is all posted on my github if you just want to check it out. But I recommend you read through the rest of this so you can see how I build up the codebase and make everything work together.

1 — Gathering the data

So first off I need to get the data for the project. As I mentioned before we’ll be using the [nba-api package from Swar Patel].

First we need to install the package. The easiest way to do that is to use pip:

pip install nba_api

Now this won’t be an in depth tutorial on how to use the nba-api. As I said earlier Swar’s project has excellent documentation on how to use its functionality. So I won’t get into it.

First we need to do some imports.

Imports

from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd

The first two imports are for endpoints. The leaguegamefinder endpoint allows us to get game info such as teams, scores, and game_id. The boxscoreadvancedv2 endpoint has a ton of data. But what we want most is the pace data for each game. The SeasonType parameter allows us to get only regular season games (which is what I plan for my first go round). Finally we need `pandas` because the data we will be getting back will be as data frames.

Now that we have our imports lets build some functions that lets us collect all the data and then combine the data into a single data frame.

The first part of the get_game_data function will use the leaguegamefinder endpoint to collect the basic game data. This endpoint is quick because it queries the season at once.

def get_game_data(season, league_id='00'):
    '''Query NBA stats via the nba-api
    
    Parameters
    ----------
    season: format is yyyy-yy ie. 2020-21
    league_id = '00' => NBA
    Returns
    -------
    result: list of 2 DataFrames
    '''
    #first we need to get all of the basic game data
    gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
                                                   season_nullable=season,
                                                   season_type_nullable=SeasonType.regular)
    
    game_basic_df = gamefinder.get_data_frames()[0]

The second half of the function will get the advanced data we need, which at this point is really just the pace. First we need a few more imports time, random, and IPython.display. So our imports will look like this:

from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output

Then we’ll use the boxscoreadvancedv2 endpoint. This is much slower because it has to be queried for each game. So we’ll update the get_game_data function like this:

def get_game_data(season, league_id='00', test=False):
    '''Query NBA stats via the nba-api
    
    Parameters
    ----------
    season: format is yyyy-yy ie. 2020-21
    league_id = '00' => NBA
    Returns
    -------
    result: dict of 2 DataFrames {"games_basic": , "games_adv": }
    '''
    #first we need to get all of the basic game data
    gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
                                                   season_nullable=season,
                                                   season_type_nullable=SeasonType.regular)
    
    games_basic_df = gamefinder.get_data_frames()[0]
    
    #then we need to get the advanced game data. Slow, game by game.
    #we need to get a list of the games
    game_ids = list(games_basic_df['GAME_ID'].unique())
    
    games_adv_df = pd.DataFrame()
    if test:
        c = 0 # this is here for testing only
    
    #now we iterate over the list of game_ids
    while len(game_ids) > 0:
        for i in game_ids:
            cooldown = rand.gammavariate(alpha=9, beta=0.4)
            clear_output(wait=True)

            for attempt in range(10):
                try:
                    time.sleep(cooldown)
                    data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
                                                                 end_range=0,
                                                                 game_id=i,
                                                                 range_type=0,
                                                                 start_period=1,
                                                                 start_range=0)
                except:
                    time.sleep(cooldown)
                    data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
                                                                 end_range=0,
                                                                 game_id=i,
                                                                 range_type=0,
                                                                 start_period=1,
                                                                 start_range=0)
                else:
                    break
            else:
                print('Connection Error')
                break
            data_df = data.get_data_frames()[1]

            if games_adv_df.empty:
                games_adv_df = data_df
            else:
                games_adv_df = games_adv_df.append(data_df, ignore_index=True)

            game_ids.remove(i)
            print (i , "completed", len(game_ids), "games left", sep="---")    

            if test:
                c += 1
                if c > 3: #again for testing only
                    games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
                    print('DONE!')
                    return games_df_dict
    games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
    print('DONE!')
    return games_df_dict

A couple notes on the above function:

  1. First off note the new parameter for test=False. This allows us to flag for testing the function. If we set test=True then it will only loop through the `boxscoreadvancedv2` three times. This allows simple testing of the function without waiting for all of the games to be accessed.
  2. Notice the while len(game_ids_ > 0 and the for attempt in range(10):. This is are part of a retry loop if there is a connection error.
    3. We also have the lines cooldown = rand.gammavariate(alpha=9, beta=0.4) and time.sleep(cooldown). This is a random timer that pauses on each connection attempt. The NBA stats api has a limiting feature that will lock your ip out, if you make too many requests in too short of a time.

We can then call the function like so:

games_df = get_game_data(season='2020-21')

This will give us a dict with two entries for the two dataframes. We can see the details on each dataframe using .info(). For the first (basic) dataframe we use:

games['games_basic'].info()

Which gives us the following:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2084 entries, 0 to 2083
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SEASON\_ID         2084 non-null   object 
 1   TEAM\_ID           2084 non-null   int64  
 2   TEAM\_ABBREVIATION 2084 non-null   object 
 3   TEAM\_NAME         2084 non-null   object 
 4   GAME\_ID           2084 non-null   object 
 5   GAME\_DATE         2084 non-null   object 
 6   MATCHUP            2084 non-null   object 
 7   WL                 2084 non-null   object 
 8   MIN                2084 non-null   int64  
 9   PTS                2084 non-null   int64  
 10  FGM                2084 non-null   int64  
 11  FGA                2084 non-null   int64  
 12  FG\_PCT            2084 non-null   float64
 13  FG3M               2084 non-null   int64  
 14  FG3A               2084 non-null   int64  
 15  FG3\_PCT           2084 non-null   float64
 16  FTM                2084 non-null   int64  
 17  FTA                2084 non-null   int64  
 18  FT\_PCT            2084 non-null   float64
 19  OREB               2084 non-null   int64  
 20  DREB               2084 non-null   int64  
 21  REB                2084 non-null   int64  
 22  AST                2084 non-null   int64  
 23  STL                2084 non-null   int64  
 24  BLK                2084 non-null   int64  
 25  TOV                2084 non-null   int64  
 26  PF                 2084 non-null   int64  
 27  PLUS\_MINUS        2084 non-null   float64
dtypes: float64(4), int64(17), object(7)
memory usage: 456.0+ KB

And for the second (advanced) dataframe we use:

games['games_adv'].info()

Which gives us:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 29 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   GAME_ID            8 non-null      object 
 1   TEAM_ID            8 non-null      int64  
 2   TEAM_NAME          8 non-null      object 
 3   TEAM_ABBREVIATION  8 non-null      object 
 4   TEAM_CITY          8 non-null      object 
 5   MIN                8 non-null      object 
 6   E_OFF_RATING       8 non-null      float64
 7   OFF_RATING         8 non-null      float64
 8   E_DEF_RATING       8 non-null      float64
 9   DEF_RATING         8 non-null      float64
 10  E_NET_RATING       8 non-null      float64
 11  NET_RATING         8 non-null      float64
 12  AST_PCT            8 non-null      float64
 13  AST_TOV            8 non-null      float64
 14  AST_RATIO          8 non-null      float64
 15  OREB_PCT           8 non-null      float64
 16  DREB_PCT           8 non-null      float64
 17  REB_PCT            8 non-null      float64
 18  E_TM_TOV_PCT       8 non-null      float64
 19  TM_TOV_PCT         8 non-null      float64
 20  EFG_PCT            8 non-null      float64
 21  TS_PCT             8 non-null      float64
 22  USG_PCT            8 non-null      float64
 23  E_USG_PCT          8 non-null      float64
 24  E_PACE             8 non-null      float64
 25  PACE               8 non-null      float64
 26  PACE_PER40         8 non-null      float64
 27  POSS               8 non-null      int64  
 28  PIE                8 non-null      float64
dtypes: float64(22), int64(2), object(5)
memory usage: 1.9+ KB

That’s it for the data gathering. So if you need a break you can set this to run and then come back in a couple hours for the next section.

2 — Data Wrangling

Now that we have all our data collected we need to get into the format we need for the simulator. We’re also going to set up sqlite so we can persist our data.

First let’s take a closer look at our dataframes. First let’s run this:

print(games['games_basic'])

This will give us the following:

SEASON_ID     TEAM_ID TEAM_ABBREVIATION               TEAM_NAME  \
0        22020  1610612746               LAC             LA Clippers   
1        22020  1610612748               MIA              Miami Heat   
2        22020  1610612741               CHI           Chicago Bulls   
3        22020  1610612755               PHI      Philadelphia 76ers   
4        22020  1610612750               MIN  Minnesota Timberwolves   
...        ...         ...               ...                     ...   
2097     22020  1610612758               SAC        Sacramento Kings   
2098     22020  1610612744               GSW   Golden State Warriors   
2099     22020  1610612747               LAL      Los Angeles Lakers   
2100     22020  1610612746               LAC             LA Clippers   
2101     22020  1610612751               BKN           Brooklyn Nets   

         GAME_ID   GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  \
0     0022001047  2021-05-13    LAC @ CHA  W  239  113  ...   0.750     6   
1     0022001050  2021-05-13  MIA vs. PHI  W  241  106  ...   1.000    15   
2     0022001051  2021-05-13  CHI vs. TOR  W  240  114  ...   0.700     7   
3     0022001050  2021-05-13    PHI @ MIA  L  238   94  ...   0.750     8   
4     0022001052  2021-05-13  MIN vs. DEN  L  240  103  ...   0.594    11   
...          ...         ...          ... ..  ...  ...  ...     ...   ...   
2097  0022000019  2020-12-23    SAC @ DEN  W  265  124  ...   0.774    18   
2098  0022000001  2020-12-22    GSW @ BKN  L  240   99  ...   0.652    13   
2099  0022000002  2020-12-22  LAL vs. LAC  L  240  109  ...   0.774     8   
2100  0022000002  2020-12-22    LAC @ LAL  W  241  116  ...   0.737    11   
2101  0022000001  2020-12-22  BKN vs. GSW  W  242  125  ...   0.813    13   

      DREB  REB  AST  STL  BLK  TOV  PF  PLUS_MINUS  
0       38   44   28    6    6   17  17        23.0  
1       32   47   26    5    6   19  15        12.0  
2       43   50   31    4    4   17  15        12.0  
3       22   30   20    8    1   12  18       -12.0  
4       32   43   20    3    5   15  17       -11.0  
...    ...  ...  ...  ...  ...  ...  ..         ...  
2097    36   54   27    5    7   14  29         2.0  
2098    34   47   26    6    6   18  24       -26.0  
2099    37   45   22    4    2   19  20        -7.0  
2100    29   40   22   10    3   16  29         7.0  
2101    44   57   24   11    7   20  22        26.0  

[2102 rows x 28 columns]

Now lets take the first GAME_ID and use it to filter the dataframe. Like so:

print(games['games_basic'].loc[(games['games_basic']['GAME_ID'] == '0022001047')])

That will give us this:

SEASON_ID     TEAM_ID TEAM_ABBREVIATION          TEAM_NAME     GAME_ID  \
0     22020  1610612746               LAC        LA Clippers  0022001047   
6     22020  1610612766               CHA  Charlotte Hornets  0022001047   

    GAME_DATE      MATCHUP WL  MIN  PTS  ...  FT_PCT  OREB  DREB  REB  AST  \
0  2021-05-13    LAC @ CHA  W  239  113  ...   0.750     6    38   44   28   
6  2021-05-13  CHA vs. LAC  L  239   90  ...   0.789     8    27   35   24   

   STL  BLK  TOV  PF  PLUS_MINUS  
0    6    6   17  17        23.0  
6    9    1   10  15       -23.0  

[2 rows x 28 columns]

So that means we have two records with the same GAME_ID, which makes sense when you see how the data is laid out. There is a row for each team in the game. One row corresponds to the home team, the other to the away team. If we did the same thing for the [‘games_adv’] dataframe we would see the same structure. This illustrates a couple things for us:

  1. It’s important to examine data thoroughly so you know how its structured. Don’t just make assumptions.
  2. This structure isn’t great for our simulator. It will be much easier if we can somehow merge these rows into one row with both teams results in it.

That second point could be a pain, but luckily for us Swar Patel has saved the day for us again. In his [examples] he recognizes the fact that this is something that users would want to do. So he wrote a function to take care of it.

import pandas as pd

def combine_team_games(df, keep_method='home'):
    '''Combine a TEAM_ID-GAME_ID unique table into rows by game. Slow.
        Parameters
        ----------
        df : Input DataFrame.
        keep_method : {'home', 'away', 'winner', 'loser', ``None``}, default 'home'
            - 'home' : Keep rows where TEAM_A is the home team.
            - 'away' : Keep rows where TEAM_A is the away team.
            - 'winner' : Keep rows where TEAM_A is the losing team.
            - 'loser' : Keep rows where TEAM_A is the winning team.
            - ``None`` : Keep all rows. Will result in an output DataFrame the same
                length as the input DataFrame.
                
        Returns
        -------
        result : DataFrame
    '''
    # Join every row to all others with the same game ID.
    joined = pd.merge(df, df, suffixes=['_A', '_B'],
                      on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
    # Filter out any row that is joined to itself.
    result = joined[joined.TEAM_ID_A != joined.TEAM_ID_B]
    # Take action based on the keep_method flag.
    if keep_method is None:
        # Return all the rows.
        pass
    elif keep_method.lower() == 'home':
        # Keep rows where TEAM_A is the home team.
        result = result[result.MATCHUP_A.str.contains(' vs. ')]
    elif keep_method.lower() == 'away':
        # Keep rows where TEAM_A is the away team.
        result = result[result.MATCHUP_A.str.contains(' @ ')]
    elif keep_method.lower() == 'winner':
        result = result[result.WL_A == 'W']
    elif keep_method.lower() == 'loser':
        result = result[result.WL_A == 'L']
    else:
        raise ValueError(f'Invalid keep_method: {keep_method}')
    return result

This is going to make our job much easier. But we still have a little bit of work to do. Mostly because we are using two dataframes, instead of one. So there are a couple of ways we can approach this. One is to just do it piece by piece, joining, filtering, etc. Or we can write some functions. I like functions so I decide to go that way.

So we’re going to write one function. It’s going to take our two dataframes, join them and then combine rows. When I’m writing something like this I like to build it in pieces, testing along the way. So first we’ll write the join and see how it goes. Here’s the join portion:

def prep_games(dictionary, left_key, right_key, left, right):
    '''join two dataframes
    
    Paramaters
    ----------
    dictionary: Input dict
    left_key: dictionary key for left dataframe
    right_key: dictionary key for right dataframe
    left: list of columns for left join
    right: list of columns for right join
    
    Returns
    -------
    result: DataFrame
    '''
    #join the two DataFrames using the paramaters
    joined = dictionary[left_key].merge(dictionary[right_key],
                                 left_on=left,
                                 right_on=right)
    return joined

If we then use it as so:

preped_game_df = prep_games(games,
                            left_key='games_basic',
                            right_key='games_adv',
                            left=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'], 
                            right=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'])

And then print the dataframe we get:

SEASON_ID     TEAM_ID TEAM_ABBREVIATION             TEAM_NAME_x  \
0        22020  1610612746               LAC             LA Clippers   
1        22020  1610612748               MIA              Miami Heat   
2        22020  1610612741               CHI           Chicago Bulls   
3        22020  1610612755               PHI      Philadelphia 76ers   
4        22020  1610612750               MIN  Minnesota Timberwolves   
...        ...         ...               ...                     ...   
2097     22020  1610612758               SAC        Sacramento Kings   
2098     22020  1610612744               GSW   Golden State Warriors   
2099     22020  1610612747               LAL      Los Angeles Lakers   
2100     22020  1610612746               LAC             LA Clippers   
2101     22020  1610612751               BKN           Brooklyn Nets   

         GAME_ID   GAME_DATE      MATCHUP WL  MIN_x  PTS  ...  TM_TOV_PCT  \
0     0022001047  2021-05-13    LAC @ CHA  W    239  113  ...        17.9   
1     0022001050  2021-05-13  MIA vs. PHI  W    241  106  ...        22.5   
2     0022001051  2021-05-13  CHI vs. TOR  W    240  114  ...        18.0   
3     0022001050  2021-05-13    PHI @ MIA  L    238   94  ...        14.4   
4     0022001052  2021-05-13  MIN vs. DEN  L    240  103  ...        15.3   
...          ...         ...          ... ..    ...  ...  ...         ...   
2097  0022000019  2020-12-23    SAC @ DEN  W    265  124  ...        13.5   
2098  0022000001  2020-12-22    GSW @ BKN  L    240   99  ...        16.1   
2099  0022000002  2020-12-22  LAL vs. LAC  L    240  109  ...        18.3   
2100  0022000002  2020-12-22    LAC @ LAL  W    241  116  ...        15.4   
2101  0022000001  2020-12-22  BKN vs. GSW  W    242  125  ...        17.9   

      EFG_PCT  TS_PCT  USG_PCT  E_USG_PCT  E_PACE    PACE  PACE_PER40  POSS  \
0       0.673   0.689      1.0      0.200   94.70   94.00       78.33    95   
1       0.586   0.617      1.0      0.197   91.32   89.50       74.58    89   
2       0.645   0.652      1.0      0.199  100.16   99.00       82.50   100   
3       0.506   0.541      1.0      0.194   91.32   89.50       74.58    90   
4       0.525   0.547      1.0      0.200   99.46   98.50       82.08    98   
...       ...     ...      ...        ...     ...     ...         ...   ...   
2097    0.490   0.536      1.0      0.197  102.43  100.98       84.15   111   
2098    0.424   0.454      1.0      0.201  113.60  112.00       93.33   112   
2099    0.525   0.576      1.0      0.199  106.00  104.00       86.67   104   
2100    0.548   0.572      1.0      0.198  106.00  104.00       86.67   104   
2101    0.538   0.589      1.0      0.201  113.60  112.00       93.33   112   

        PIE  
0     0.619  
1     0.597  
2     0.584  
3     0.403  
4     0.411  
...     ...  
2097  0.473  
2098  0.350  
2099  0.509  
2100  0.491  
2101  0.650  

[2102 rows x 54 columns]

You can see that we have joined our dataframes. Now, a quick note before we go on. There are quite a few columns that are redundant or unneeded for what we are doing here. So we could clear them out. But, seeing as how we went through all the work to collect this, I want to keep them. In case I need them in the future.

Now let’s get back to the second half of our function, which will basically use what Swar has already written for us. I’m only going to make some minor changes for what we’re doing. Our final function will look like this:

def prep_games(dictionary, left_key, right_key, left, right):
    '''join two dataframes
    
    Paramaters
    ----------
    dictionary: Input dict
    left_key: dictionary key for left dataframe
    right_key: dictionary key for right dataframe
    left: list of columns for left join
    right: list of columns for right join
    
    Returns
    -------
    result: DataFrame
    '''
    #join the two DataFrames using the paramaters
    joined = dictionary[left_key].merge(dictionary[right_key],
                                 left_on=left,
                                 right_on=right)
     # Join every row to all others with the same game ID.
    joined = pd.merge(joined, joined, suffixes=['_HOME', '_AWAY'],
                      on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
    # Filter out any row that is joined to itself.
    result = joined[joined.TEAM_ID_HOME != joined.TEAM_ID_AWAY]
    result = result[result.MATCHUP_HOME.str.contains(' vs. ')]
    #lowercase the columnn names for easier access
    result.columns = result.columns.str.lower() 
    return result

If we use the function and print the result again we will get something that looks like this:

SEASON_ID  TEAM_ID_HOME TEAM_ABBREVIATION_HOME        TEAM_NAME_x_HOME  \
2        22020    1610612766                    CHA       Charlotte Hornets   
5        22020    1610612748                    MIA              Miami Heat   
9        22020    1610612741                    CHI           Chicago Bulls   
13       22020    1610612750                    MIN  Minnesota Timberwolves   
18       22020    1610612752                    NYK         New York Knicks   
...        ...           ...                    ...                     ...   
4185     22020    1610612739                    CLE     Cleveland Cavaliers   
4190     22020    1610612741                    CHI           Chicago Bulls   
4193     22020    1610612743                    DEN          Denver Nuggets   
4198     22020    1610612751                    BKN           Brooklyn Nets   
4201     22020    1610612747                    LAL      Los Angeles Lakers   

         GAME_ID   GAME_DATE MATCHUP_HOME WL_HOME  MIN_x_HOME  PTS_HOME  ...  \
2     0022001047  2021-05-13  CHA vs. LAC       L         239        90  ...   
5     0022001050  2021-05-13  MIA vs. PHI       W         241       106  ...   
9     0022001051  2021-05-13  CHI vs. TOR       W         240       114  ...   
13    0022001052  2021-05-13  MIN vs. DEN       L         240       103  ...   
18    0022000458  2021-05-13  NYK vs. SAS       W         240       102  ...   
...          ...         ...          ...     ...         ...       ...  ...   
4185  0022000010  2020-12-23  CLE vs. CHA       W         241       121  ...   
4190  0022000015  2020-12-23  CHI vs. ATL       L         240       104  ...   
4193  0022000019  2020-12-23  DEN vs. SAC       L         265       122  ...   
4198  0022000001  2020-12-22  BKN vs. GSW       W         242       125  ...   
4201  0022000002  2020-12-22  LAL vs. LAC       L         240       109  ...   

      TM_TOV_PCT_AWAY  EFG_PCT_AWAY  TS_PCT_AWAY  USG_PCT_AWAY  \
2                17.9         0.673        0.689           1.0   
5                14.4         0.506        0.541           1.0   
9                11.2         0.511        0.521           1.0   
13               14.1         0.542        0.565           1.0   
18               11.6         0.451        0.493           1.0   
...               ...           ...          ...           ...   
4185             14.7         0.589        0.590           1.0   
4190             21.7         0.625        0.668           1.0   
4193             13.5         0.490        0.536           1.0   
4198             16.1         0.424        0.454           1.0   
4201             15.4         0.548        0.572           1.0   

      E_USG_PCT_AWAY  E_PACE_AWAY  PACE_AWAY  PACE_PER40_AWAY  POSS_AWAY  \
2              0.200        94.70      94.00            78.33         95   
5              0.194        91.32      89.50            74.58         90   
9              0.195       100.16      99.00            82.50         98   
13             0.195        99.46      98.50            82.08         99   
18             0.192        97.64      94.50            78.75         95   
...              ...          ...        ...              ...        ...   
4185           0.197       105.20     102.00            85.00        102   
4190           0.194       108.04     106.50            88.75        106   
4193           0.197       102.43     100.98            84.15        111   
4198           0.201       113.60     112.00            93.33        112   
4201           0.198       106.00     104.00            86.67        104   

      PIE_AWAY  
2        0.619  
5        0.403  
9        0.416  
13       0.589  
18       0.466  
...        ...  
4185     0.443  
4190     0.581  
4193     0.473  
4198     0.350  
4201     0.491  

[1051 rows x 105 columns]

We can even check the [‘GAME_ID’] column for duplicates in the following way:

are_dupes = not preped_game_df["GAME_ID"].is_unique
print(are_dupes)

If we get False then there are no duplicates.

As I look through my data I realize that there’s a couple columns that I want to have. I want home/away points per possession. Now I could just calculate these on the fly when I run my simulations, but I’d like to keep the calculations inside the sim to a minimum, so I can streamline it. So I want to take care of it now. So let’s write a quick function that takes care of the calculations for us:

def calc_points_per_possession(df):
    '''Calculate the home/away points per possesion and update the dataframe'''
    
    df['points_per_possession_home'] = df['pts_home'] / df['pace_home']
    df['points_per_possession_away'] = df['pts_away'] / df['pace_home']
    
    return df

The reason we can use pace_home for both calculations is because in any one game both teams play at the same pace.

We then call the function:

preped_game_df = calc_points_per_possession(preped_game_df)

Then if we look at the updated dataframe we will see:

season_id  team_id_home team_abbreviation_home        team_name_x_home  \
2        22020    1610612766                    CHA       Charlotte Hornets   
5        22020    1610612748                    MIA              Miami Heat   
9        22020    1610612741                    CHI           Chicago Bulls   
13       22020    1610612750                    MIN  Minnesota Timberwolves   
18       22020    1610612752                    NYK         New York Knicks   
...        ...           ...                    ...                     ...   
4185     22020    1610612739                    CLE     Cleveland Cavaliers   
4190     22020    1610612741                    CHI           Chicago Bulls   
4193     22020    1610612743                    DEN          Denver Nuggets   
4198     22020    1610612751                    BKN           Brooklyn Nets   
4201     22020    1610612747                    LAL      Los Angeles Lakers   

         game_id   game_date matchup_home wl_home  min_x_home  pts_home  ...  \
2     0022001047  2021-05-13  CHA vs. LAC       L         239        90  ...   
5     0022001050  2021-05-13  MIA vs. PHI       W         241       106  ...   
9     0022001051  2021-05-13  CHI vs. TOR       W         240       114  ...   
13    0022001052  2021-05-13  MIN vs. DEN       L         240       103  ...   
18    0022000458  2021-05-13  NYK vs. SAS       W         240       102  ...   
...          ...         ...          ...     ...         ...       ...  ...   
4185  0022000010  2020-12-23  CLE vs. CHA       W         241       121  ...   
4190  0022000015  2020-12-23  CHI vs. ATL       L         240       104  ...   
4193  0022000019  2020-12-23  DEN vs. SAC       L         265       122  ...   
4198  0022000001  2020-12-22  BKN vs. GSW       W         242       125  ...   
4201  0022000002  2020-12-22  LAL vs. LAC       L         240       109  ...   

      ts_pct_away  usg_pct_away  e_usg_pct_away  e_pace_away  pace_away  \
2           0.689           1.0           0.200        94.70      94.00   
5           0.541           1.0           0.194        91.32      89.50   
9           0.521           1.0           0.195       100.16      99.00   
13          0.565           1.0           0.195        99.46      98.50   
18          0.493           1.0           0.192        97.64      94.50   
...           ...           ...             ...          ...        ...   
4185        0.590           1.0           0.197       105.20     102.00   
4190        0.668           1.0           0.194       108.04     106.50   
4193        0.536           1.0           0.197       102.43     100.98   
4198        0.454           1.0           0.201       113.60     112.00   
4201        0.572           1.0           0.198       106.00     104.00   

      pace_per40_away  poss_away  pie_away  points_per_possession_home  \
2               78.33         95     0.619                    0.957447   
5               74.58         90     0.403                    1.184358   
9               82.50         98     0.416                    1.151515   
13              82.08         99     0.589                    1.045685   
18              78.75         95     0.466                    1.079365   
...               ...        ...       ...                         ...   
4185            85.00        102     0.443                    1.186275   
4190            88.75        106     0.581                    0.976526   
4193            84.15        111     0.473                    1.208160   
4198            93.33        112     0.350                    1.116071   
4201            86.67        104     0.491                    1.048077   

      points_per_possession_away  
2                       1.202128  
5                       1.050279  
9                       1.030303  
13                      1.157360  
18                      1.037037  
...                          ...  
4185                    1.117647  
4190                    1.164319  
4193                    1.227966  
4198                    0.883929  
4201                    1.115385  

[1051 rows x 107 columns]

Now that we have our data we probably want somewhere to store it so we don’t have to spend hours re-polling all the data from API each time we spin things up. As I mentioned earlier there are a few options:

  1. .csv files
  2. pickle
  3. sqlite
  4. MySql or PostgreSQL
  5. Other DB tech, ie. MongoDB

I decided to go with sqllite for a few reasons. While .csv files and pickle are easy to use they don’t offer much flexibility for data retrieval. Basically you get out what you put in. A db offers us the flexibility to only grab the data we need for our calculations. Also I have a ton of experience with SQL, so that’s a plus. I chose sqlite over the other DBs because it’s super simple, doesn’t need a service to run, so it just works.

So now we’re going to write a super simple function to save our dataframe to a sqlite database. We could use an ORM like SQLAlchemy for this, but I think that for something this simple it’s overkill.

First we need to update our imports to include sqlite:

from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql

Now let’s write our function. This one’s short and sweet:

def save_df_to_db(df, db_name, table_name, exists='replace'):
    '''Saves a dataframe to a table in a sqlite database
    
    Paramaters
    ----------
    df: DataFrame to save to table
    db_name: name of database
    table_name: name of table
    exists: what to do if table exists, default='replace'
    
    Return
    ------
    return:
    '''
    conn = None
    try:
        conn = sql.connect(db_name)
        df.to_sql(table_name, conn, if_exists=exists)
    except Exception as e:
        print(e)
    finally:
        if conn:
            conn.close()

A few notes on this. We’re using some features of sqlite that make this super easy. Sqlite will automatically create the database if it doesen’t exist, and the same with the table. So we don’t have to create either specifically. This is great for a rapid dev project like this. The if_exists flag tells sqlite what to do if the table exists. The default=replace will drop the table and replace it with the current data.

We can call our function like this:

save_df_to_db(preped_game_df, 'nba_data.db', 'games')

Now there’s a couple ways we could see if what we did worked. One would be to query the database and see if the data is there (which we’ll obviously be doing later). Or we could use a tool. My weapon of choice for this is [DB Browser for SQLite]. It’s super simple to install and then you can see if your new table exists and if its full of data. I won’t go over using it here, but you can check it out and use it.

Wow, that was a lot of work. But now we have all the data set up and ready to go. You might want to take a break and get something to drink (my current poison is MTN DEW Major Melon Zero). Then come back so we can get into the real guts of this, the simulator.

3 — Simulation

Hopefully you’re fully rejuvenated now and ready for the next part of our project: the simulation engine. There’s a lot going on here so first I’m going to start with an overview of what we are doing. First let’s talk about Monte Carlo, and I’m not talking about the city.

There’s lots of good resources on what a Monte Carlo simulation is, but I find this one on[Towards Data Science] to be a good primer. Basically the idea is that we take a number of known occurrences of a variable, then we use those knowns to create a probability distribution. We can then randomly sample the distribution to predict an outcome.

So for our simulator we could take the number of points a team scores per game and use those results to create a normal distribution. We could then sample that distribution for a score, thus ‘simulating’ the score for a game.

As you may have noticed I said ‘normal distribution’ in the preceding paragraph. Technically we could use any type of distribution (normal, gaussian, uniform, etc.) but we’re going to stick to a normal/gaussian distribution (ie. the bell curve) for our simulation for simplicity.

Now that that’s out of the way we’re ready to start building our simulation. But first we’re going to need get our data back out of our database and put it into a dataframe. So let’s write a quick function for that:

def get_df_from_db(db_name, table_name, select=['*'],
                   order_by='', descending=False):
    '''Runs a select from a database and returns a DataFrame
    
    Paramaters
    ----------
    db_name: name of database
    table_name: name of table
    select: list of columns to select, default = ['*']
    order_by: column name to order by
    descending: if order by is to be set as ascending
    
    Returns
    -------
    result: DataFrame
    '''
    select_string = ', '.join(select)
    order_by_string = ''
    if order_by:
        order_by_string = " ORDER BY " + order_by
        if descending:
            order_by_string = order_by_string + " DESC"
    
    query_string = "SELECT " + select_string + " FROM " + table_name + order_by_string
    conn = None
    try:
        conn = sql.connect(db_name)
        df = pd.read_sql_query(query_string , conn)
    except Exception as e:
        print(e)
        return
    finally:
        if conn:
            conn.close
            
    return df

Now we can call this function like this:

games_df = get_df_from_db('nba_data.db','games', order_by='game_date')

Which will give us all the columns. Or we can just get the columns we want — team_abbreviation_home, team_abbreviation_away, points_per_possession_home, points_per_possession_away, pace_home). Like this:

games_df = get_df_from_db('nba_data.db','games',
                          select=['team_abbreviation_home', 'team_abbreviation_away',
                                  'points_per_possession_home', 'points_per_possession_away',
                                  'pace_home'],
                          order_by='game_date')

Now that we have our data it’s time to get to work on our simulation engine. Now, I know this is going to be a more complicated function. So I want to map out the steps of how it works. Mostly I want to do this because I want the minimum number of calculations within the simulation loop itself. If I want to run 10k simulations, I want them to run as efficiently as possible.

These are the steps for how the simulation engine will work:

  1. Filter dataframe into teams we want to simulate
  2. Calculate standard deviation and mean for each statistic
  3. Sample distributions for each statistic
  4. Determine winner of simulation

From this I can see that I want the sampling and picking a winner inside the loop but I don’t want step 1 or 2 inside the loop. This helps me figure out the structure for my simulation.

Now that I have a basic layout I can start putting the pieces together. First step is basic function shape and filtering the dataframes (I set the default teams as PHX and LAL, since I’m a Suns fan and hate the Lakers):

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
    
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]

Now we need to approach gathering all of the standard deviations and means for the calculation. Originally I was going to just put all the calculations into the function, but that gets messy. So instead I’m going to write a function that calculates the standard deviation and mean for a column of a dataframe. Then I can just pass in the columns to get the data.

First we need to update our imports with the stats package:

from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats

Then we can write our function:

def calc_std_dev_and_mean(df_column):
    '''calculates the standard deviation and mean of a DataFrame column, return dict'''
    
	std_dev = df_column.std()
    mean = df_column.mean()
    stats_dict = {'std_dev': std_dev, 'mean': mean}
    return stats_dict

Now that I have a simple helper function to get standard deviation and mean I’m going to build another function that will use that to calculate all the stats for a team. I’m going to pass a team dataframe to it and it will build it into a dictionary with all the stats needed for the simulation. That function looks like this:

def get_team_stats(df, home=True):
    '''gets all the needed stats for a team for the simulator, return dict'''
    
    pace = calc_std_dev_and_mean(df['pace_home'])
    home_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_home'])
    away_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_away'])
    if home:
        stats_dict = {"pace": pace, 
                      "points_scored": home_pts_per_possession,
                      "points_given_up": away_pts_per_possession}
    else:
        stats_dict = {"pace": pace,
                      "points_scored": away_pts_per_possession,
                      "points_given_up": home_pts_per_possession}
        
    return stats_dict

Now we can update our original function to:

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
 
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
    home_stats = get_team_stats(home_df)
    away_stats = get_team_stats(away_df)

Now we’re on the home stretch. We now need to write a function that will take our two dictionaries of statistics and uses those to simulate a game and choose a winner. I’ll put it here then we can talk about how it works after:

def simulate_game(home_stats, away_stats):
    '''uses stats from home and away teams to simulate a single game'''
    home_pace = rand.gauss(home_stats['pace']['mean'],
                           home_stats['pace']['std_dev'])
    home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
                                 home_stats['points_scored']['std_dev'])
    home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
                                   home_stats['points_given_up']['std_dev'])
    away_pace = rand.gauss(away_stats['pace']['mean'],
                           away_stats['pace']['std_dev'])
    away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
                                 away_stats['points_scored']['std_dev'])
    away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
                                   away_stats['points_given_up']['std_dev'])
    pace = round((home_pace + away_pace) / 2)
    home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
    away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
    results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace}
    return results

The first part of the function uses the .gauss() function of the random package to create a gaussian distribution based on the parameters (mean, standard deviation) and then select a random value from that distribution. We do that for the various stats we need random values for. Then we use those to calculate what happens in our simulation. We take the random value for each team and then average them for the actual simulated value. We then return the points for each team and the pace for the game.

So now I can use this to test our nba_game_simulator so far. Here’s our updated function:

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
 
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
    home_stats = get_team_stats(home_df)
    away_stats = get_team_stats(away_df, home=False)
    for i in range(sims):
        game_result = simulate_game(home_stats, away_stats)
        print(game_result)

Notice that I’m just using a simple print to show the result of a single simulation. So when we call the function like so:

nba_game_simulator(games_df)

We get something like the following:

{'home_pts': 101, 'away_pts': 109, 'pace': 95}

Now all we need to do is modify the interior of our loop to keep count of the winners, like so:

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
    #containers for counts
    home_wins = 0
    away_wins = 0
    
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
    home_stats = get_team_stats(home_df)
    away_stats = get_team_stats(away_df, home=False)
    for i in range(sims):
        game_result = simulate_game(home_stats, away_stats)
        if (game_result['home_pts'] > game_result['away_pts']):
            home_wins += 1
        else:
            away_wins += 1
    
    results = {"home_wins": home_wins, "away_wins": away_wins}
    return results

When I updated this I realized that there is a problem in our game simulator. We can have ties, which is impossible in an NBA game. The easiest way to remedy this is to add a while loop in our simulate_game function. This is pretty simple:

def simulate_game(home_stats, away_stats):
    '''uses stats from home and away teams to simulate a single game'''
    # set the margin to 0 for our while loop
    margin = 0
    while(margin == 0):
        home_pace = rand.gauss(home_stats['pace']['mean'],
                               home_stats['pace']['std_dev'])
        home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
                                     home_stats['points_scored']['std_dev'])
        home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
                                       home_stats['points_given_up']['std_dev'])
        away_pace = rand.gauss(away_stats['pace']['mean'],
                               away_stats['pace']['std_dev'])
        away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
                                     away_stats['points_scored']['std_dev'])
        away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
                                       away_stats['points_given_up']['std_dev'])
        pace = round((home_pace + away_pace) / 2)
        home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
        away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
        margin = home_pts - away_pts
        
    results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace}
    return results

So now we can run a simulation for 10k games for Suns v. Lakers like so, and print the results:

suns_v_lakers = nba_game_simulator(games_df, sims=10000)
suns_win_pct = round((suns_v_lakers['home_wins'] / 10000) * 100)
lakers_win_pct = 100 - suns_win_pct
print("SUNS: " + str(suns_win_pct) + "% | LAKERS: " + str(lakers_win_pct) + "%")

Which will give us something like this:

SUNS: 64% | LAKERS: 36%

Once I set this up I realized that I wanted to have some more information than just how many wins. I wanted some other info like avg points, avg margin, etc. So we need to make just a couple more changes to our nba_game_simulator and to our simulate_game functions:

def simulate_game(home_stats, away_stats):
    '''uses stats from home and away teams to simulate a single game'''
    # set the margin to 0 for our while loop
    margin = 0
    while(margin == 0):
        home_pace = rand.gauss(home_stats['pace']['mean'],
                               home_stats['pace']['std_dev'])
        home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
                                     home_stats['points_scored']['std_dev'])
        home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
                                       home_stats['points_given_up']['std_dev'])
        away_pace = rand.gauss(away_stats['pace']['mean'],
                               away_stats['pace']['std_dev'])
        away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
                                     away_stats['points_scored']['std_dev'])
        away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
                                       away_stats['points_given_up']['std_dev'])
        pace = round((home_pace + away_pace) / 2)
        home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
        away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
        margin = home_pts - away_pts
        
    results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace, "margin": margin}
    return results

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
    #containers for counts
    home_wins = 0
    away_wins = 0
    
    #containers for stats
    home_margins = []
    away_margins = []
    home_pts = []
    away_pts = []
    
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
    home_stats = get_team_stats(home_df)
    away_stats = get_team_stats(away_df, home=False)
    for i in range(sims):
        game_result = simulate_game(home_stats, away_stats)
        home_pts.append(game_result['home_pts'])
        away_pts.append(game_result['away_pts'])
        if (game_result['margin'] > 0):
            home_wins += 1
            home_margins.append(game_result['margin'])
        else:
            away_wins += 1
            away_margins.append(-game_result['margin'])
    
    results = {"home_wins": home_wins, "away_wins": away_wins,
               "home_margins": home_margins, "away_margins": away_margins,
               "home_pts": home_pts, "away_pts": away_pts}
    return results

Then we can call and print the results like so:

suns_win_pct = round((suns_v_lakers['home_wins'] / 10000) * 100)
lakers_win_pct = 100 - suns_win_pct
suns_avg_margin = round(stats.mean(suns_v_lakers['home_margins']))
lakers_avg_margin = round(stats.mean(suns_v_lakers['away_margins']))
suns_avg_pts = round(stats.mean(suns_v_lakers['home_pts']))
lakers_avg_pts = round(stats.mean(suns_v_lakers['away_pts']))
print("SUNS: " + str(suns_win_pct) + "% | LAKERS: " + str(lakers_win_pct) + "%")
print("SUNS: Avg. Margin of Victory = " + str(suns_avg_margin))
print("LAKERS: Avg. Margin of Victory = " + str(lakers_avg_margin))
print("SUNS: Avg. Points = " + str(suns_avg_pts))
print("LAKERS: Avg. Points = " + str(lakers_avg_pts))

Which gives us something like this:

SUNS: 64% | LAKERS: 36%
SUNS: Avg. Margin of Victory = 10
LAKERS: Avg. Margin of Victory = 8
SUNS: Avg. Points = 111
LAKERS: Avg. Points = 107

Wow! That was a lot. This was definitely the hardest/longest part of this guide. All that we have left to do is roll this up in a simple Flask/Dash app shell and the we’ll be done. So go grab a snack and then we’ll get started on the final piece of the puzzle, putting this into a nice webapp.

4 — Dash

Hopefully you replenished your strength after that grueling section on building our simulation. Now we’re on to the last section which is to put together a web based app to show the results of our hard work.

A couple of notes here before we get started on this last section. We have some options here on how to lay out our codebase. We could just have one large file with all the functions inside it, including the flask/dash app. This is by far the least efficient as it will re-update the database every time it runs. That would be bad. We could do two files; one that contains all of the data collection/wrangling and then one that has our webapp and the simulation functions. There’s nothing wrong with this as it will work just fine. I think our third option is the best: one file for the data work, one for the simulation, and then a file for the app that calls the simulation app. So this is how we could set up the codebase:

nba-game-sim
|-- app.py
|-- generatedata.py
|-- simulategames.py
|-- nba_data.db

If we assume this structure then generatedata.py would look like this:

from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats

def get_game_data(season, league_id='00', test=False):
    '''Query NBA stats via the nba-api
    
    Parameters
    ----------
    season: format is yyyy-yy ie. 2020-21
    league_id = '00' => NBA
    Returns
    -------
    result: dict of 2 DataFrames {"games_basic": , "games_adv": }
    '''
    #first we need to get all of the basic game data
    gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
                                                   season_nullable=season,
                                                   season_type_nullable=SeasonType.regular)
    
    games_basic_df = gamefinder.get_data_frames()[0]
    
    #then we need to get the advanced game data. Slow, game by game.
    #we need to get a list of the games
    game_ids = list(games_basic_df['GAME_ID'].unique())
    
    games_adv_df = pd.DataFrame()
    if test:
        c = 0 # this is here for testing only
    
    #now we iterate over the list of game_ids
    while len(game_ids) > 0:
        for i in game_ids:
            cooldown = rand.gammavariate(alpha=9, beta=0.4)
            clear_output(wait=True)

            for attempt in range(10):
                try:
                    time.sleep(cooldown)
                    data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
                                                                 end_range=0,
                                                                 game_id=i,
                                                                 range_type=0,
                                                                 start_period=1,
                                                                 start_range=0)
                except:
                    time.sleep(cooldown)
                    data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
                                                                 end_range=0,
                                                                 game_id=i,
                                                                 range_type=0,
                                                                 start_period=1,
                                                                 start_range=0)
                else:
                    break
            else:
                print('Connection Error')
                break
            data_df = data.get_data_frames()[1]

            if games_adv_df.empty:
                games_adv_df = data_df
            else:
                games_adv_df = games_adv_df.append(data_df, ignore_index=True)

            game_ids.remove(i)
            print (i , "completed", len(game_ids), "games left", sep="---")    

            if test:
                c += 1
                if c > 3: #again for testing only
                    games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
                    print('DONE!')
                    return games_df_dict
    games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
    print('DONE!')
    return games_df_dict
        
def prep_games(dictionary, left_key, right_key, left, right):
    '''join two dataframes
    
    Paramaters
    ----------
    dictionary: Input dict
    left_key: dictionary key for left dataframe
    right_key: dictionary key for right dataframe
    left: list of columns for left join
    right: list of columns for right join
    
    Returns
    -------
    result: DataFrame
    '''
    #join the two DataFrames using the paramaters
    joined = dictionary[left_key].merge(dictionary[right_key],
                                 left_on=left,
                                 right_on=right)
     # Join every row to all others with the same game ID.
    joined = pd.merge(joined, joined, suffixes=['_HOME', '_AWAY'],
                      on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
    # Filter out any row that is joined to itself.
    result = joined[joined.TEAM_ID_HOME != joined.TEAM_ID_AWAY]
    result = result[result.MATCHUP_HOME.str.contains(' vs. ')]
    #lowercase the columnn names for easier access
    result.columns = result.columns.str.lower() 
    return result

def calc_points_per_possession(df):
    '''Calculate the home/away points per possesion and update the dataframe'''
    
    df['points_per_possession_home'] = df['pts_home'] / df['pace_home']
    df['points_per_possession_away'] = df['pts_away'] / df['pace_home']
    
    return df
    
def save_df_to_db(df, db_name, table_name, exists='replace'):
    '''Saves a dataframe to a table in a sqlite database
    
    Paramaters
    ----------
    df: DataFrame to save to table
    db_name: name of database
    table_name: name of table
    exists: what to do if table exists, default='replace'
    
    Return
    ------
    return:
    '''
    conn = None
    try:
        conn = sql.connect(db_name)
        df.to_sql(table_name, conn, if_exists=exists)
    except Exception as e:
        print(e)
    finally:
        if conn:
            conn.close()
    

def generate_season_data(season):
  '''this will call all the functions needed to get and save a season of data'''
  
  df = get_game_data(season)
  df = prep_games(df,
                  left_key='games_basic',
                  right_key='games_adv',
                  left=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'], 
                  right=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'])
  df = calc_points_per_possession(df)
  save_df_to_db(df, 'nba_data.db', 'games')
  
  return True

season_data = generate_season_data('2020-21')
if season_data:
  print(season_data, "SEASON GENERATED AND SAVED SUCCESFULLY")

You may have noticed a new function at the end generate_season_data. This function calls all the other processing functions so we just have to pass a single function call at the end.

Our other file simulategames.py will look like this:

import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats

def get_df_from_db(db_name, table_name, select=['*'],
                   order_by='', descending=False):
    '''Runs a select from a database and returns a DataFrame
    
    Paramaters
    ----------
    db_name: name of database
    table_name: name of table
    select: list of columns to select, default = ['*']
    order_by: column name to order by
    descending: if order by is to be set as ascending
    
    Returns
    -------
    result: DataFrame
    '''
    select_string = ', '.join(select)
    order_by_string = ''
    if order_by:
        order_by_string = " ORDER BY " + order_by
        if descending:
            order_by_string = order_by_string + " DESC"
    
    query_string = "SELECT " + select_string + " FROM " + table_name + order_by_string
    conn = None
    try:
        conn = sql.connect(db_name)
        df = pd.read_sql_query(query_string , conn)
    except Exception as e:
        print(e)
        return
    finally:
        if conn:
            conn.close
            
    return df

def calc_std_dev_and_mean(df_column):
    '''calculates the standard deviation and mean of a DataFrame column, return dict'''
    
    std_dev = df_column.std()
    mean = df_column.mean()
    stats_dict = {'std_dev': std_dev, 'mean': mean}
    return stats_dict

def get_team_stats(df, home=True):
    '''gets all the needed stats for a team for the simulator, return dict'''
    
    pace = calc_std_dev_and_mean(df['pace_home'])
    home_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_home'])
    away_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_away'])
    if home:
        stats_dict = {"pace": pace, 
                      "points_scored": home_pts_per_possession,
                      "points_given_up": away_pts_per_possession}
    else:
        stats_dict = {"pace": pace,
                      "points_scored": away_pts_per_possession,
                      "points_given_up": home_pts_per_possession}
        
    return stats_dict

def simulate_game(home_stats, away_stats):
    '''uses stats from home and away teams to simulate a single game'''
    # set the margin to 0 for our while loop
    margin = 0
    while(margin == 0):
        home_pace = rand.gauss(home_stats['pace']['mean'],
                               home_stats['pace']['std_dev'])
        home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
                                     home_stats['points_scored']['std_dev'])
        home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
                                       home_stats['points_given_up']['std_dev'])
        away_pace = rand.gauss(away_stats['pace']['mean'],
                               away_stats['pace']['std_dev'])
        away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
                                     away_stats['points_scored']['std_dev'])
        away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
                                       away_stats['points_given_up']['std_dev'])
        pace = round((home_pace + away_pace) / 2)
        home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
        away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
        margin = home_pts - away_pts
        
    results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace, "margin": margin}
    return results

def retrieve_data():
  '''retrieves data_frame'''
  
  games_df = get_df_from_db('nba_data.db','games',
                          select=['team_abbreviation_home', 'team_abbreviation_away',
                                  'points_per_possession_home', 'points_per_possession_away',
                                  'pace_home'],
                          order_by='game_date')

  return games_df

We added a new function here. Just a simple one to call the get_df_from_db function.

Now before starting the web app I want to think about what it’s going to look like. I do a couple sketches until I get one I like. This is what I designed:

sketch of web app design

On the left are the controls, which are simple. Just two drop downs; one for home, one for away. Below that I want some numerical results like our tests, only in a table layout. The right side will be a large histogram of the results. I think that will be cool because you usually don’t get to see a histogram on these, just the percentages.

Now that I have my layout I realize that I’m going to need some more data from my database. I’m going to need the abbreviations for each team. That means we need to update our simulategames.py. Luckily we don’t need to create a new function, we can just use our get_df_from_db function to grab what we need. We just need to update retrieve_data like this:

def retrieve_data():
  '''retrieves data_frame'''
  
  games_df = get_df_from_db('nba_data.db','games',
                          select=['team_abbreviation_home', 'team_abbreviation_away',
                                  'points_per_possession_home', 'points_per_possession_away',
                                  'pace_home'],
                          order_by='game_date')
  
  teams_df = get_df_from_db('nba_data.db', 'games',
                          select=['DISTINCT team_abbreviation_home'],
                          order_by='team_abbreviation_home')

  dfs_dict = {"games":games_df, "teams":teams_df}

Now I’ll get a dictionary back with two dataframes: one for the games and one with the team abbreviations.

Now that the housekeeping is out of the way it’s time to build our web app. Now this isn’t a full blown tutorial on how to build an app via Dash. That being said there are tons out there. I found the [docs] to be the most useful though. We are going to use [bootstrap components] as well if you want to take a look at their docs.

All right. First we need to install some dependencies. This can be done via pip:

pip install dash
pip install dash-renderer
pip install dash-html-components
pip install dash-core-components
pip install dash-bootstrap-components
pip install plotly

Next we open up app.py. Let’s start with our imports first. We’ll be importing dash and then also our simulategames.py file.

import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
import simulategames

First let’s get the data we need.

nba_data = simulategames.retrieve_data()

Now we can start building our site. First we need to initiate our app. We will pass through a stylesheet for Bootstrap which makes styling easier.

external_stylesheets = [dbc.themes.BOOTSTRAP]
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

Now that our app is started we can start building the structure. The easiest way to do this is to start at the component level, then build outwards until we have our main structure. So first we’ll make our controls.

controls = dbc.Card(
[

  dbc.FormGroup(
    [
      dbc.Label("Home Team"),
      dcc.Dropdown(
        id='hteam',
        options=get_options(teamslist),
        value='PHX'
        ),
    ]
    ),
  dbc.FormGroup(
    [
      dbc.Label('Away Team'),
      dcc.Dropdown(
        id='ateam',
        options=get_options(teamslist),
        value='LAL'
        ),
    ]
    )
],
body=True
)

These are form groups. One for the home team. And one for the away team. As you can see we have a new function that will take our dataframe and make an options list out of it. Here is the function.

def get_options(df, colname):
    '''take a column from a dataframe and create and options list'''
    
    new_list = df[colname].tolist()
    options = []
    for i in new_list:
        options.append({'label': i, 'value': i})

    return options

Now we need to make a table to display our data

table_header = [
  html.Thead(html.Tr([html.Th() , html.Th("Home Team") , html.Th("Away Team")]))
]
row1 = html.Tr([html.Th("Win %"),
                html.Td(id='htwinper'),
                html.Td(id='atwinper')])
row2 = html.Tr([html.Th("Avg. Margin of Victory"),
                html.Td(id='htmvict'),
                html.Td(id='atmvict')])
row3 = html.Tr([html.Th("Avg. Points Scored"),
                html.Td(id='htpts'),
                html.Td(id='atpts')])
table_body = [html.Tbody([row1 , row2, row3])]
table = dbc.Table(table_header + table_body)
simResults = dbc.Card(
	[

		
	html.H4('Simulation Results'),
	table
	],
	body=True
	)

Here we first made our table header. Then we make each row and then put them together in the table body. The we take that table and put it into a card.

Now that we have our components we can put the page together like so:

app.layout = dbc.Container(
		[
			html.H1(children='Choose two teams to see the results of 10k sims', style={'textAlign':'center'}),
			html.Hr(),
			dbc.Row(
				[
					dbc.Col([controls, html.Br(), ], table, md=4),
					dbc.Col([html.H3('All margins of victory | Home - Away'), dcc.Graph(id='histogram')] , md=8)
				],
				align='center',
				),
		],
		fluid=True
	)

Now if we want to test our layout we can add this at the end of our file:

if __name__ == '__main__':
    app.run_server(debug=True)

Then we just save. Then we can navigate to the folder and run our app.py file to test it out. The site will be available at [http://127.0.0.1:8050]. It should look something like this:

web app without data

As you can see we have a layout but no data. So now we need our data. The way we populate our page is via a callback function. There are two parts to the callback: a decorator and the function. Our decorator will look like this:

@app.callback(
	Output('histogram' , 'figure'),
	Output('htwinper' , 'children'),
	Output('atwinper' , 'children'),
	Output('htmvict' , 'children'),
	Output('atmvict' , 'children'),
	Output('htpts' , 'children'),
	Output('atpts', 'children'),
	Input('hteam' , 'value' ),
	Input('ateam' , 'value'))

The first part of the decorator is our outputs from the callback function. The last part is the inputs (the fields) where we’re getting our data for our callback function.

Now we need our callback function:

def callback( home , away):
  results = simulategames.nba_game_simulator(nba_data['games'],
                                              home, away, sims=10000)

  return px.histogram(results['margins']),\
         results['home_win_pct'],\
         results['away_win_pct'],\
         results['home_wins'],\
         results['away_wins'],\
         results['home_avg_pts'],\
         results['away_avg_pts']

For the histogram we are going to need plotly express. So we need to add that to our imports:

import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
import plotly.express as px
import simulategames

We also need to edit our nba_game_simulator function to return all the calculations like so:

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
    #containers for counts
    home_wins = 0
    away_wins = 0
    
    #containers for stats
    margins = []
    home_margins = []
    away_margins = []
    home_pts = []
    away_pts = []
    
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
    home_stats = get_team_stats(home_df)
    away_stats = get_team_stats(away_df, home=False)
    for i in range(sims):
        game_result = simulate_game(home_stats, away_stats)
        home_pts.append(game_result['home_pts'])
        away_pts.append(game_result['away_pts'])
        margins.append(game_result['margin'])
        if (game_result['margin'] > 0):
            home_wins += 1
            home_margins.append(game_result['margin'])
        else:
            away_wins += 1
            away_margins.append(-game_result['margin'])
    
    home_win_pct = round(home_wins / 100)
    away_win_pct = round(away_wins / 100)
    home_avg_pts = round(stats.mean(home_pts))
    away_avg_pts = round(stats.mean(away_pts))
    home_avg_margin = round(stats.mean(home_margins))
    away_avg_margin = round(stats.mean(away_margins))
    results = {"home_avg_margin": home_avg_margin, "away_avg_margin": away_avg_margin,
               "home_win_pct": home_win_pct, "away_win_pct": away_win_pct,
               "home_avg_pts": home_avg_pts, "away_avg_pts": away_avg_pts,
               "margins": margins,}
    return results

Now, for the moment of truth. If you head back to where you have your app spun up at [http://127.0.0.1:8050/]you should see something like this:

web app with data

When you change the teams it should live update the results.

Congratulations! We’re done.

Here is the final code for our three .py files. You can also view the [repo on my github].

generatedata.py

from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.endpoints import boxscoreadvancedv2
from nba_api.stats.library.parameters import SeasonType
import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats

def get_game_data(season, league_id='00', test=False):
    '''Query NBA stats via the nba-api
    
    Parameters
    ----------
    season: format is yyyy-yy ie. 2020-21
    league_id = '00' => NBA
    Returns
    -------
    result: dict of 2 DataFrames {"games_basic": , "games_adv": }
    '''
    #first we need to get all of the basic game data
    gamefinder = leaguegamefinder.LeagueGameFinder(league_id_nullable=league_id,
                                                   season_nullable=season,
                                                   season_type_nullable=SeasonType.regular)
    
    games_basic_df = gamefinder.get_data_frames()[0]
    
    #then we need to get the advanced game data. Slow, game by game.
    #we need to get a list of the games
    game_ids = list(games_basic_df['GAME_ID'].unique())
    
    games_adv_df = pd.DataFrame()
    if test:
        c = 0 # this is here for testing only
    
    #now we iterate over the list of game_ids
    while len(game_ids) > 0:
        for i in game_ids:
            cooldown = rand.gammavariate(alpha=9, beta=0.4)
            clear_output(wait=True)

            for attempt in range(10):
                try:
                    time.sleep(cooldown)
                    data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
                                                                 end_range=0,
                                                                 game_id=i,
                                                                 range_type=0,
                                                                 start_period=1,
                                                                 start_range=0)
                except:
                    time.sleep(cooldown)
                    data = boxscoreadvancedv2.BoxScoreAdvancedV2(end_period=4,
                                                                 end_range=0,
                                                                 game_id=i,
                                                                 range_type=0,
                                                                 start_period=1,
                                                                 start_range=0)
                else:
                    break
            else:
                print('Connection Error')
                break
            data_df = data.get_data_frames()[1]

            if games_adv_df.empty:
                games_adv_df = data_df
            else:
                games_adv_df = games_adv_df.append(data_df, ignore_index=True)

            game_ids.remove(i)
            print (i , "completed", len(game_ids), "games left", sep="---")    

            if test:
                c += 1
                if c > 3: #again for testing only
                    games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
                    print('DONE!')
                    return games_df_dict
    games_df_dict = {"games_basic": games_basic_df, "games_adv": games_adv_df}
    print('DONE!')
    return games_df_dict
        
def prep_games(dictionary, left_key, right_key, left, right):
    '''join two dataframes
    
    Paramaters
    ----------
    dictionary: Input dict
    left_key: dictionary key for left dataframe
    right_key: dictionary key for right dataframe
    left: list of columns for left join
    right: list of columns for right join
    
    Returns
    -------
    result: DataFrame
    '''
    #join the two DataFrames using the paramaters
    joined = dictionary[left_key].merge(dictionary[right_key],
                                 left_on=left,
                                 right_on=right)
     # Join every row to all others with the same game ID.
    joined = pd.merge(joined, joined, suffixes=['_HOME', '_AWAY'],
                      on=['SEASON_ID', 'GAME_ID', 'GAME_DATE'])
    # Filter out any row that is joined to itself.
    result = joined[joined.TEAM_ID_HOME != joined.TEAM_ID_AWAY]
    result = result[result.MATCHUP_HOME.str.contains(' vs. ')]
    #lowercase the columnn names for easier access
    result.columns = result.columns.str.lower() 
    return result

def calc_points_per_possession(df):
    '''Calculate the home/away points per possesion and update the dataframe'''
    
    df['points_per_possession_home'] = df['pts_home'] / df['pace_home']
    df['points_per_possession_away'] = df['pts_away'] / df['pace_home']
    
    return df
    
def save_df_to_db(df, db_name, table_name, exists='replace'):
    '''Saves a dataframe to a table in a sqlite database
    
    Paramaters
    ----------
    df: DataFrame to save to table
    db_name: name of database
    table_name: name of table
    exists: what to do if table exists, default='replace'
    
    Return
    ------
    return:
    '''
    conn = None
    try:
        conn = sql.connect(db_name)
        df.to_sql(table_name, conn, if_exists=exists)
    except Exception as e:
        print(e)
    finally:
        if conn:
            conn.close()
    

def generate_season_data(season):
  '''this will call all the functions needed to get and save a season of data'''
  
  df = get_game_data(season)
  df = prep_games(df,
                  left_key='games_basic',
                  right_key='games_adv',
                  left=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'], 
                  right=['GAME_ID' , 'TEAM_ID' , 'TEAM_ABBREVIATION'])
  df = calc_points_per_possession(df)
  save_df_to_db(df, 'nba_data.db', 'games')
  
  return True

season_data = generate_season_data('2020-21')
if season_data:
  print(season_data, "SEASON GENERATED AND SAVED SUCCESFULLY")

simulategames.py

import pandas as pd
import time
import random as rand
from IPython.display import clear_output
import sqlite3 as sql
import statistics as stats

def get_df_from_db(db_name, table_name, select=['*'],
                   order_by='', descending=False):
    '''Runs a select from a database and returns a DataFrame
    
    Paramaters
    ----------
    db_name: name of database
    table_name: name of table
    select: list of columns to select, default = ['*']
    order_by: column name to order by
    descending: if order by is to be set as ascending
    
    Returns
    -------
    result: DataFrame
    '''
    select_string = ', '.join(select)
    order_by_string = ''
    if order_by:
        order_by_string = " ORDER BY " + order_by
        if descending:
            order_by_string = order_by_string + " DESC"
    
    query_string = "SELECT " + select_string + " FROM " + table_name + order_by_string
    conn = None
    try:
        conn = sql.connect(db_name)
        df = pd.read_sql_query(query_string , conn)
    except Exception as e:
        print(e)
        return
    finally:
        if conn:
            conn.close
            
    return df

def calc_std_dev_and_mean(df_column):
    '''calculates the standard deviation and mean of a DataFrame column, return dict'''
    
    std_dev = df_column.std()
    mean = df_column.mean()
    stats_dict = {'std_dev': std_dev, 'mean': mean}
    return stats_dict

def get_team_stats(df, home=True):
    '''gets all the needed stats for a team for the simulator, return dict'''
    
    pace = calc_std_dev_and_mean(df['pace_home'])
    home_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_home'])
    away_pts_per_possession = calc_std_dev_and_mean(df['points_per_possession_away'])
    if home:
        stats_dict = {"pace": pace, 
                      "points_scored": home_pts_per_possession,
                      "points_given_up": away_pts_per_possession}
    else:
        stats_dict = {"pace": pace,
                      "points_scored": away_pts_per_possession,
                      "points_given_up": home_pts_per_possession}
        
    return stats_dict

def simulate_game(home_stats, away_stats):
    '''uses stats from home and away teams to simulate a single game'''
    # set the margin to 0 for our while loop
    margin = 0
    while(margin == 0):
        home_pace = rand.gauss(home_stats['pace']['mean'],
                               home_stats['pace']['std_dev'])
        home_pts_scored = rand.gauss(home_stats['points_scored']['mean'],
                                     home_stats['points_scored']['std_dev'])
        home_pts_given_up = rand.gauss(home_stats['points_given_up']['mean'],
                                       home_stats['points_given_up']['std_dev'])
        away_pace = rand.gauss(away_stats['pace']['mean'],
                               away_stats['pace']['std_dev'])
        away_pts_scored = rand.gauss(away_stats['points_scored']['mean'],
                                     away_stats['points_scored']['std_dev'])
        away_pts_given_up = rand.gauss(away_stats['points_given_up']['mean'],
                                       away_stats['points_given_up']['std_dev'])
        pace = round((home_pace + away_pace) / 2)
        home_pts = round(((home_pts_scored + away_pts_given_up) / 2) * pace)
        away_pts = round(((away_pts_scored + home_pts_given_up) / 2) * pace)
        margin = home_pts - away_pts
        
    results = {"home_pts": home_pts, "away_pts": away_pts, "pace": pace, "margin": margin}
    return results

def nba_game_simulator(dframe, home='PHX', away='LAL', sims=1):
    '''This will run x number of game simulations'''
    #containers for counts
    home_wins = 0
    away_wins = 0
    
    #containers for stats
    margins = []
    home_margins = []
    away_margins = []
    home_pts = []
    away_pts = []
    
    home_df = dframe.loc[(dframe['team_abbreviation_home'] == home)]
    away_df = dframe.loc[(dframe['team_abbreviation_away'] == away)]
    home_stats = get_team_stats(home_df)
    away_stats = get_team_stats(away_df, home=False)
    for i in range(sims):
        game_result = simulate_game(home_stats, away_stats)
        home_pts.append(game_result['home_pts'])
        away_pts.append(game_result['away_pts'])
        margins.append(game_result['margin'])
        if (game_result['margin'] > 0):
            home_wins += 1
            home_margins.append(game_result['margin'])
        else:
            away_wins += 1
            away_margins.append(-game_result['margin'])
    
    home_win_pct = round(home_wins / 100)
    away_win_pct = round(away_wins / 100)
    home_avg_pts = round(stats.mean(home_pts))
    away_avg_pts = round(stats.mean(away_pts))
    home_avg_margin = round(stats.mean(home_margins))
    away_avg_margin = round(stats.mean(away_margins))
    results = {"home_avg_margin": home_avg_margin, "away_avg_margin": away_avg_margin,
               "home_win_pct": home_win_pct, "away_win_pct": away_win_pct,
               "home_avg_pts": home_avg_pts, "away_avg_pts": away_avg_pts,
               "margins": margins,}
    return results

def retrieve_data():
  '''retrieves data_frame'''
  
  games_df = get_df_from_db('nba_data.db','games',
                          select=['team_abbreviation_home', 'team_abbreviation_away',
                                  'points_per_possession_home', 'points_per_possession_away',
                                  'pace_home'],
                          order_by='game_date')
  
  teams_df = get_df_from_db('nba_data.db', 'games',
                          select=['DISTINCT team_abbreviation_home'],
                          order_by='team_abbreviation_home')

  dfs_dict = {"games":games_df, "teams":teams_df}

  return dfs_dict

app.py

import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
import plotly.express as px
import simulategames

nba_data = simulategames.retrieve_data()
external_stylesheets = [dbc.themes.BOOTSTRAP]
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)

def get_options(df, colname):
    '''take a column from a dataframe and create and options list'''

    new_list = df[colname].tolist()
    options_list = []
    for i in new_list:
        options_list.append({'label': i, 'value': i})

    return options_list


controls = dbc.Card(
[

  dbc.FormGroup(
    [
      dbc.Label("Home Team"),
      dcc.Dropdown(
        id='hteam',
        options=get_options(nba_data['teams'], 'team_abbreviation_home'),
        value='PHX'
        ),
    ]
    ),
  dbc.FormGroup(
    [
      dbc.Label('Away Team'),
      dcc.Dropdown(
        id='ateam',
        options=get_options(nba_data['teams'], 'team_abbreviation_home'),
        value='LAL'
        ),
    ]
    )
],
body=True
)

table_header = [
  html.Thead(html.Tr([html.Th() , html.Th("Home Team") , html.Th("Away Team")]))
]
row1 = html.Tr([html.Th("Win %"),
                html.Td(id='htwinper'),
                html.Td(id='atwinper')])
row2 = html.Tr([html.Th("Avg. Margin of Victory"),
                html.Td(id='htmvict'),
                html.Td(id='atmvict')])
row3 = html.Tr([html.Th("Avg. Points Scored"),
                html.Td(id='htpts'),
                html.Td(id='atpts')])
table_body = [html.Tbody([row1 , row2, row3])]
table = dbc.Table(table_header + table_body)
simResults = dbc.Card(
	[

		
	html.H4('Simulation Results'),
	table
	],
	body=True
	)

app.layout = dbc.Container(
		[
			html.H1(children='Choose two teams to see the results of 10k sims', style={'textAlign':'center'}),
			html.Hr(),
			dbc.Row(
				[
					dbc.Col([controls, html.Br(), simResults], md=4),
					dbc.Col([html.H3('All margins of victory | Home - Away'), dcc.Graph(id='histogram')] , md=8)
				],
				align='center',
				),
		],
		fluid=True
	)

@app.callback(
Output('histogram' , 'figure'),
Output('htwinper' , 'children'),
Output('atwinper' , 'children'),
Output('htmvict' , 'children'),
Output('atmvict' , 'children'),
Output('htpts' , 'children'),
Output('atpts', 'children'),
Input('hteam' , 'value' ),
Input('ateam' , 'value'))

def callback( home , away):
  results = simulategames.nba_game_simulator(nba_data['games'],
                                              home, away, sims=10000)

  return px.histogram(results['margins']),\
         results['home_win_pct'],\
         results['away_win_pct'],\
         results['home_avg_margin'],\
         results['away_avg_margin'],\
         results['home_avg_pts'],\
         results['away_avg_pts']

if __name__ == '__main__':
    app.run_server(debug=True)

Next steps

This app obviously has a lot of room for more work. Here are some ideas I thought of that could enhance it.

  • You could add some more stats. Such as:
     — Top margins of victory along with counts
     — Highest scores
     — Lowest scores
     — etc.
  • You could set it up on Heroku or Pythonanywhere ([mine is up on Pythonanywhere, with margin counts])
  • You could rewrite the data collection so it updates with missing data instead of just getting all the data, and then use a cron job to run it. 
     — *A quick note on this — if you want to do this make sure not to run the data collection from AWS or Google Cloud, their IPs are all blocked by the API*
  • You could build a season simulator based on this that would run the entire season/playoffs.

Final thoughts

So there you have it. I really enjoyed designing this app and writing the guide. And I hope you enjoyed it as well.

My goal is to write one of these a week — or more likely every couple, since they are a lot of work. For my next guide I am thinking of building a financial simulation that will tell you how much money you ‘might’ have when you retire. But if you would like to see something else — or if you have questions about what I did here. Please post in the comments or email me @ jd [at] carbonlab [dot] io.