Okay, so here’s the deal. I messed around with some NBA stats today – specifically, the Miami Heat vs. Boston Celtics game. Figured I’d jot down how it went down, ya know, for posterity…and maybe for anyone else trying to do the same thing.

First off, I started by trying to find a decent data source. Scraped a few different sports websites, ESPN, * – you name it. Some were a total pain to parse, either because the HTML was a mess or they had some anti-scraping stuff going on. I ended up settling on one that wasn’t perfect, but gave me the core stats I needed: points, rebounds, assists, etc. for each player.
Then, I grabbed all that raw HTML, and dove in to clean up the data. This was the grunt work part. Used Python with Beautiful Soup to sift through the HTML tables. I had to write a bunch of little functions to extract the data from each row and column. Lots of trial and error to get the table structure right. Like, one table had the player names in one format, another had it in a different one. I swear, websites do this stuff on purpose.
Next, I put everything into Pandas DataFrames. This made things way easier to work with. I cleaned the data types, converting strings to integers or floats where necessary. There were some missing values (NaNs) here and there, so I filled them in with zeros – figured that was better than just dropping the rows entirely.
After that, I combined the data into a single DataFrame. I had to make sure the player names matched up correctly between the different tables. I ended up writing a little function to normalize the names (e.g., removing middle initials, standardizing nicknames). It wasn’t perfect, but it got me 90% of the way there.
Now for the fun part! I started doing some basic analysis. Calculated the total points scored by each team, the average rebounds per player, that kind of stuff. I used Matplotlib and Seaborn to create some basic visualizations – bar charts, scatter plots, histograms. Nothing fancy, just enough to get a quick overview of the data.

I wanted to see who the top performers were for each team. So, I sorted the data by points, rebounds, and assists, and printed out the top 5 players in each category. Turns out, the usual suspects were at the top of the list – you know, Butler for the Heat, Tatum for the Celtics. No big surprises there.
Finally, I saved the cleaned data and the visualizations to a file. I used the CSV format for the data, and PNG for the charts. Just so I could easily share them later if I wanted to.
Lessons Learned:
- Data scraping can be a real pain in the butt. Be prepared to spend a lot of time cleaning and formatting the data.
- Pandas is your friend. Learn it, love it, use it.
- Don’t be afraid to use visualizations. They can help you understand the data much better than just looking at numbers.
All in all, it was a pretty fun little project. Got to brush up on my Python skills and learn a bit more about NBA stats. Might try doing something similar with other sports data in the future.