Pygameday scrapes Major League Baseball (MLB) GameDay data, parses it, and inserts it into a database of your choosing for later analysis.
Games, Players, At-Bats, Hits In Play, and Pitches are the data types captured.
One of the motivations behind creating pygameday was to make the database backend transparent; you should not have to worry about whether you're using SQLite, Postgres, or some other implementation. All you do is specify a URI to the database, and you're up and running.
Pygameday is built on SQLAlchemy, and should be compatible with any database that SQLAlchemy supports. The following dialects are supported:
- SQLite
- PostgreSQL
- MySQL
- Oracle
- Microsoft SQL Server
- Firebird
- Sybase
It should be noted that I've tested only SQLite and Postgres.
The examples directory contains a documented Jupyter notebook and a more stripped-down example script. You can also refer to the Quickstart section below.
Install pygameday using pip:
pip install pygameday
Pygameday depends on tqdm, sqlalchemy, lxml, requests, and python-dateutil. If those packages are not automatically installed, run
pip install tqdm sqlalchemy lxml requests python-dateutil
Pygameday was developed and tested using Python 3.
Run pygameday by instantiating a GameDayClient.
First, instantiate the database client by specifying a database URI.
This example creates an SQLite database
named gameday.db in the current directory, but you can substitute
a URI for your database flavor of choice, as long as the database is supported
by SQLAlchemy.
The n_workers parameter determines how many parallel processes are
used to insert game data. By default, n_workers is set to 4. To handle database
inserts serially, set n_workers=1. Serial processing tends to work better for SQLite
databases, which is why it's used in this example, but a server-based
database implementation should be able to handle parallel processes.
from pygameday import GameDayClient
database_uri = "sqlite:///gameday.db"
client = GameDayClient(database_uri, n_workers=1)Ingest games that occurred on a single day by specifying a standard Python datetime.
from datetime import datetime
date_to_process = datetime(2015, 5, 1) # Ingest games on May 1, 2015
client.process_date(date_to_process)You can also ingest games within a date range.
# Ingest games between May 1, 2015 and May 3, 2015
start_date = datetime(2015, 5, 1)
end_date = datetime(2015, 5, 3)
client.process_date_range(start_date, end_date)After ingesting data, use any tool you like to verify that the data is in the database. Here's an example using pandas.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(database_uri)
# Execute SQL queries against the database we just created
data = pd.read_sql("SELECT * FROM games LIMIT 5", engine)
data.head()You need to specify a valid database URI for pygameday to work. Here are some example URIs.
SQLite:
"sqlite:///example.db" # File in the current directory"sqlite:////absolute/path/to/example.db" # Absolute path to file (Unix/Mac)"sqlite:///C:\absolute\path\to\example.db" # Absolute path to file (Windows)
PostgreSQL:
"postgresql://user:password@host/database_name" # Standard Postgres dialect"psycopg2+postgresql://user:password@host/database_name" # with psycopg2 driver
SQLAlchemy's engine documentation has additional details about the dialects it supports.