Skip to content

Parser quality checklist #37

@longobucco

Description

@longobucco

Description

Create a lightweight but consistent checklist to evaluate the quality of parsed output on representative course documents.

The goal is to make parser validation repeatable across samples and to quickly identify the most common failure modes before they affect chunking, retrieval, and source-grounded generation.

Scope

For each sample document, evaluate the parsed output against the following dimensions:

  • text extraction quality
  • slide/page segmentation
  • heading/section preservation
  • tables
  • formulas
  • image-heavy slides
  • broken reading order
  • missing content

Evaluation dimensions

Each document review should include brief notes and a simple rating for each of the following:

  • Text extraction quality
    Is the extracted text complete, readable, and reasonably clean?

  • Slide/page segmentation
    Are page and slide boundaries preserved correctly?

  • Heading/section preservation
    Are titles, section headers, and hierarchical structure retained?

  • Tables
    Are tables captured in a usable form, or are they broken/lost?

  • Formulas
    Are formulas preserved, partially degraded, or missing?

  • Image-heavy slides
    Does the parser still produce useful output when slides contain little text and mostly visuals?

  • Broken reading order
    Does the extracted content follow the correct logical reading sequence?

  • Missing content
    Is any obvious content missing from the parsed output?

Suggested output format

For each sample document, produce:

  • document name
  • file type
  • parser used
  • short overall quality summary
  • checklist evaluation by category
  • examples of major issues
  • recommendation:
    • acceptable for retrieval
    • acceptable with cleanup
    • not acceptable yet

Deliverables

  • parser quality checklist template
  • completed evaluations for a small set of representative sample documents
  • short summary of recurring parser weaknesses

Acceptance criteria

  • the checklist is clear and reusable
  • at least 3–5 representative documents are evaluated
  • major parser failure modes are explicitly documented
  • the output is useful for both parsing and retrieval work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions