Skip to content

PoiLson/external-merge-sort-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

External Merge Sort for Database Systems

About

This repository contains coursework developed for the Database Systems Implementation course at the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens (NKUA).

The project focuses on implementing an external merge sort algorithm for large datasets stored in heap files, using the infrastructure developed in previous assignments.

Authors

Main Idea

The goal of this project is to understand how database systems perform sorting when the dataset does not fit entirely in main memory.

The implementation follows the external merge sort algorithm, which consists of:

  1. Sorting phase
    The input file is divided into smaller chunks (runs), each of which is sorted independently.

  2. Merge phase
    Sorted chunks are merged iteratively using a multi-way merge process until a single fully sorted file is produced.

The sorting is performed based on:

  • primary key: name
  • secondary key: surname

The implementation builds upon the heap file abstraction developed in previous assignments and operates at the block level using the BF layer.

Functionality

The project includes:

  • sorting records within chunks (sort_ functions)
  • merging sorted chunks (merge_ functions)
  • iteration over chunks using custom iterators (CHUNK_ structures)
  • support for multi-way merge strategies

Design Notes

  • The algorithm operates on fixed-size blocks, simulating disk-based processing.
  • Chunk-based processing ensures that only a limited portion of data is loaded into memory at any time.
  • The merge phase supports configurable b-way merging, affecting the number of passes required.

Usage

The project includes a Makefile for compilation and execution.

Compile sorting executable

make sort

Run the program

make run

The execution demonstrates:

  • chunk-based sorting
  • multi-pass merging
  • correct ordering of records

LLM-Assisted Development

This project was developed using a hybrid approach, combining manual implementation with assistance from large language models (ChatGPT).

The workflow included:

  • generating initial implementations of functions
  • refining and debugging the generated code
  • integrating the code with the existing system (heap file & BF layer)
  • correcting issues related to memory management (e.g., proper use of unpin operations)

Relevant conversations:

Technologies

  • C
  • External sorting algorithms
  • Block-based storage management
  • Heap file abstraction
  • File I/O

About

External merge sort implementation for large datasets using block-based storage and multi-way merging techniques

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors