Implement File Parsing and Data Persistence in SQLite

### **Task: Implement File Parsing and Data Persistence in SQLite**

#### **Scope**

**1. Parsing Files**
Implement the parsing of uploaded course files (PDF, Markdown, DOCX, TXT) and extract relevant content (titles, sections, text) to populate the study path. This includes the following steps:

* **File Validation**: Ensure that the uploaded files are of valid types (PDF, MD/Markdown, DOCX, TXT). Validate both MIME type and file extension consistency.
* **Parsing**: Use a parsing service or tool to process the files and extract structured data:

  * For PDF, use tools like PyPDF or pdfminer.
  * For Markdown and DOCX, use libraries like `markdown-it-py` and `python-docx`.
  * For TXT files, read and process plain text.
* **Chunking**: Break the parsed content into meaningful chunks (e.g., sections, paragraphs) for easier retrieval later.

**2. Persist Data in SQLite**
Once the file has been parsed, save the extracted data in an SQLite database to allow efficient querying and navigation of the study materials:

* **Database Schema**:

  * **Courses**: Store metadata for each course (e.g., title, description, theme).
  * **Lessons**: Store lesson information (e.g., lesson title, content, order).
  * **Chunks**: Store the parsed content chunks (text, section, page number) and link them to lessons.
* **Database Interaction**: Use SQLite to persist parsed content and make it queryable:

  * Store files' metadata, content, and parsing results.
  * Associate parsed content with its respective courses and lessons.

**3. Integration with Existing Workflow**

* **Ingestion Portal**: Use the file upload endpoint to process and store the files. Once a file is uploaded, it should trigger the parsing process and populate the database with the extracted data.
* **Link Parsing to UI**: Ensure that the data persists in SQLite and is available for use in the study path navigation UI (courses, lessons, and content chunks).

---

#### **Deliverables**

* **Parsing Service**: Implemented file parsing logic for PDF, Markdown, DOCX, and TXT formats.
* **SQLite Database Schema**: Designed schema for courses, lessons, and chunks.
* **Ingestion API Integration**: Updated API endpoints to handle file upload, parsing, and persistence in SQLite.
* **Unit Tests**: Tests for parsing logic (e.g., correct extraction of content) and database operations (e.g., saving and querying parsed content).
* **Error Handling**: Handle invalid files (415 Unsupported Media Type), parsing errors, and database issues gracefully.

---

#### **Acceptance Criteria**

1. **File Upload**:

   * Uploading a supported file triggers the parsing process and stores the parsed content in the SQLite database.
   * Invalid files (wrong format or corrupted) return `415` with a clear error message.
2. **Database Storage**:

   * Data extracted from the files is correctly saved in SQLite (courses, lessons, chunks).
   * Data is available for querying via the system's APIs (e.g., retrieving lessons for a specific course).
3. **Parsing Validation**:

   * Parsed content (e.g., sections, titles, body text) is stored with correct associations to courses and lessons.
   * All text and metadata are stored in a way that supports efficient querying (e.g., for navigation and search).
4. **Error Handling**:

   * Parsing and database failures (e.g., corrupted content, database connection issues) should be handled and logged.
5. **Testing**:

   * Parsing logic is fully unit tested, with coverage for each supported file type.
   * SQLite interactions are tested to ensure data integrity and correct associations.
   * All tests pass in CI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement File Parsing and Data Persistence in SQLite #21

Task: Implement File Parsing and Data Persistence in SQLite

Scope

Deliverables

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement File Parsing and Data Persistence in SQLite #21

Description

Task: Implement File Parsing and Data Persistence in SQLite

Scope

Deliverables

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions