This repository contains coursework developed for the Database Systems Implementation course at the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens (NKUA).
The project focuses on implementing an external merge sort algorithm for large datasets stored in heap files, using the infrastructure developed in previous assignments.
The goal of this project is to understand how database systems perform sorting when the dataset does not fit entirely in main memory.
The implementation follows the external merge sort algorithm, which consists of:
-
Sorting phase
The input file is divided into smaller chunks (runs), each of which is sorted independently. -
Merge phase
Sorted chunks are merged iteratively using a multi-way merge process until a single fully sorted file is produced.
The sorting is performed based on:
- primary key: name
- secondary key: surname
The implementation builds upon the heap file abstraction developed in previous assignments and operates at the block level using the BF layer.
The project includes:
- sorting records within chunks (
sort_functions) - merging sorted chunks (
merge_functions) - iteration over chunks using custom iterators (
CHUNK_structures) - support for multi-way merge strategies
- The algorithm operates on fixed-size blocks, simulating disk-based processing.
- Chunk-based processing ensures that only a limited portion of data is loaded into memory at any time.
- The merge phase supports configurable b-way merging, affecting the number of passes required.
The project includes a Makefile for compilation and execution.
make sort
make run
The execution demonstrates:
- chunk-based sorting
- multi-pass merging
- correct ordering of records
This project was developed using a hybrid approach, combining manual implementation with assistance from large language models (ChatGPT).
The workflow included:
- generating initial implementations of functions
- refining and debugging the generated code
- integrating the code with the existing system (heap file & BF layer)
- correcting issues related to memory management (e.g., proper use of unpin operations)
Relevant conversations:
- ChatGPT 4.0: https://chatgpt.com/share/677be019-963c-8007-bf9a-585c9c53c7ac
- ChatGPT 3.5: https://chatgpt.com/share/677be033-f3c0-8007-a9e7-fcce767231f4
- C
- External sorting algorithms
- Block-based storage management
- Heap file abstraction
- File I/O