Task: Implement File Parsing and Data Persistence in SQLite
Scope
1. Parsing Files
Implement the parsing of uploaded course files (PDF, Markdown, DOCX, TXT) and extract relevant content (titles, sections, text) to populate the study path. This includes the following steps:
-
File Validation: Ensure that the uploaded files are of valid types (PDF, MD/Markdown, DOCX, TXT). Validate both MIME type and file extension consistency.
-
Parsing: Use a parsing service or tool to process the files and extract structured data:
- For PDF, use tools like PyPDF or pdfminer.
- For Markdown and DOCX, use libraries like
markdown-it-py and python-docx.
- For TXT files, read and process plain text.
-
Chunking: Break the parsed content into meaningful chunks (e.g., sections, paragraphs) for easier retrieval later.
2. Persist Data in SQLite
Once the file has been parsed, save the extracted data in an SQLite database to allow efficient querying and navigation of the study materials:
3. Integration with Existing Workflow
- Ingestion Portal: Use the file upload endpoint to process and store the files. Once a file is uploaded, it should trigger the parsing process and populate the database with the extracted data.
- Link Parsing to UI: Ensure that the data persists in SQLite and is available for use in the study path navigation UI (courses, lessons, and content chunks).
Deliverables
- Parsing Service: Implemented file parsing logic for PDF, Markdown, DOCX, and TXT formats.
- SQLite Database Schema: Designed schema for courses, lessons, and chunks.
- Ingestion API Integration: Updated API endpoints to handle file upload, parsing, and persistence in SQLite.
- Unit Tests: Tests for parsing logic (e.g., correct extraction of content) and database operations (e.g., saving and querying parsed content).
- Error Handling: Handle invalid files (415 Unsupported Media Type), parsing errors, and database issues gracefully.
Acceptance Criteria
-
File Upload:
- Uploading a supported file triggers the parsing process and stores the parsed content in the SQLite database.
- Invalid files (wrong format or corrupted) return
415 with a clear error message.
-
Database Storage:
- Data extracted from the files is correctly saved in SQLite (courses, lessons, chunks).
- Data is available for querying via the system's APIs (e.g., retrieving lessons for a specific course).
-
Parsing Validation:
- Parsed content (e.g., sections, titles, body text) is stored with correct associations to courses and lessons.
- All text and metadata are stored in a way that supports efficient querying (e.g., for navigation and search).
-
Error Handling:
- Parsing and database failures (e.g., corrupted content, database connection issues) should be handled and logged.
-
Testing:
- Parsing logic is fully unit tested, with coverage for each supported file type.
- SQLite interactions are tested to ensure data integrity and correct associations.
- All tests pass in CI.
Task: Implement File Parsing and Data Persistence in SQLite
Scope
1. Parsing Files
Implement the parsing of uploaded course files (PDF, Markdown, DOCX, TXT) and extract relevant content (titles, sections, text) to populate the study path. This includes the following steps:
File Validation: Ensure that the uploaded files are of valid types (PDF, MD/Markdown, DOCX, TXT). Validate both MIME type and file extension consistency.
Parsing: Use a parsing service or tool to process the files and extract structured data:
markdown-it-pyandpython-docx.Chunking: Break the parsed content into meaningful chunks (e.g., sections, paragraphs) for easier retrieval later.
2. Persist Data in SQLite
Once the file has been parsed, save the extracted data in an SQLite database to allow efficient querying and navigation of the study materials:
Database Schema:
Database Interaction: Use SQLite to persist parsed content and make it queryable:
3. Integration with Existing Workflow
Deliverables
Acceptance Criteria
File Upload:
415with a clear error message.Database Storage:
Parsing Validation:
Error Handling:
Testing: