Skip to content

Inconsistencies in Parquet download files #73

@JRice15

Description

@JRice15

Describe the bug
I downloaded the parquet files directly from the OneDrive link included in the repo's readme, and have been reading them with pyarrow and pandas. In digging though the data (2022 only, so far), I discovered two problems, one of which led me to the other.

  1. First (and maybe more of a feature request than a bug) is the fact that, best I can tell, there is no easy way to tell whether games are regular season or postseason/allstar/other from any of the parquet files (I would particularly expect game.parquet, gamelog.parquet, or schedule.parquet to have an indicator column for this, but I do not see one).

  2. Second appears to be more of a bug. Contrary to no. 1 above, schedule.parquet only seems to include regular season games. Great, we can simply filter game.parquet and other files by whether or not the game exists in schedule.parquet, right? Nope, schedule.parquet seems to list games that never actually occurred. As an example: schedule.parquet includes a game MIL @ CHN 2022-04-08, not part of a double header. However, games.parquet (as well as baseball-reference and other sources) tell us that no such game exists! MIL @ CHN games occurred on 4/7/22 and 4/9/22, but not 4/8/22. I find 88 of these 'phantom games' in schedule.parquet for 2022. And it's not a byproduct of the pandemic or lockout, I found these inconsistencies in every year I've looked at as far back as 2000.

To Reproduce
Steps to reproduce the behavior:

  1. Compare schedule.parquet and game.parquet as described above.

Expected behavior
Games will be consistent across files, and a column easily delineates whether games are regular season or not.

I am entirely open to the fact that the files are exactly as intended and I am just missing something that explains the discrepancies, let me know if that is the case. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    fix on upcoming releaseBug that is expected to be resolved in conjunction with Retrosheet's next release (2x per year)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions