fhash is a high-performance command-line tool designed to recursively scan directories, calculate MD5 hashes of files, and store the metadata in a SQLite database.
A key feature of fhash is its ability to calculate the MD5 hash of audio streams only, ignoring container-level metadata and tags (like ID3). This allows you to identify duplicate audio content even if the files have different filenames, tags, or bitrates (if re-encoded, though primarily intended for finding the exact same stream data in different containers).
- Recursive Scanning: Efficiently traverses directory trees.
- SQLite Storage: Saves file paths, sizes, timestamps, and hashes for easy querying.
- File Hashing: Calculates standard MD5 hashes for the entire file.
- Audio Hashing: Uses FFmpeg to extract and hash only the audio data, bypassing metadata.
- Audio Stream Validation:
checkcommand decodes embedded audio streams to detect missing data/corruption. - Batch Processing: Uses SQLite transactions for high-speed indexing.
- Incremental Updates: Uses file size + mtime to skip unchanged rows and updates changed files unless forced.
To build fhash, you need a C compiler (gcc) and the following libraries installed on your system:
- SQLite3: For database storage.
- OpenSSL: For MD5 calculation.
- FFmpeg (libavformat, libavcodec, libavutil): For audio stream processing.
On Debian/Ubuntu-based systems, you can install these with:
sudo apt-get install build-essential libsqlite3-dev libssl-dev libavformat-dev libavcodec-dev libavutil-devUse the provided makefile to build the application:
makeThis will produce an executable named fhash.
To install fhash system-wide on a *nix system:
sudo make installThis will install the binary to /usr/local/bin/fhash. To uninstall:
sudo make uninstall./fhash scan [options]
./fhash check [options]
./fhash dupe (-xa<n> | -xh<n>) [options]
./fhash link (-xa<n> | -xh<n>) -l{mode} [options]-help: Show help text.-v: Verbose output (default: OFF).scanoptions:-s <startpath>(default.),-e <extlist>,-r,-h,-a,-f.checkoptions:-s <startpath>(default.),-e <extlist>,-r,-f. Validates embedded audio streams and stores integer results infiles.audio_check_result.-fforces re-check even when previously checked.dupeoptions:-xa<n>(audio hash) or-xh<n>(file hash), optional min group sizen(default 2).linkoptions: same asdupeplus-l{mode}to replace duplicates with hard-links to a master selected by mode (s=shallowest path,d=deepest path,m=most metadata,o=oldest,n=newest).- Shared options:
-d <dbpath>(default./file_hashes.db),-vverbose. - Path filters (
scan/check/dupe/link):-s <startpath>,-r,-e <extlist>. -dryapplies tolink(and is accepted globally).
-r: recurse into subdirectories forscan/check; fordupe/link, recurse within the-spath filter instead of matching only immediate children.-f: force processing. Inscan, re-index even if file size/mtime is unchanged. Incheck, force re-validation even ifaudio_check_resultis already set.-a:scanonly. Calculate and storeaudio_md5(audio-stream MD5).-h:scanonly. Calculate and storemd5(full-file MD5).-xa<n>:dupe/linkonly. Useaudio_md5to find duplicate groups, with optional minimum group sizen(default2).-xh<n>:dupe/linkonly. Usemd5to find duplicate groups, with optional minimum group sizen(default2).
Duplicate/Link notes
dupeandlinkcommands use existing DB contents; they respect-s/-r/-eas filters on the query. Without-r, filtering by-sis limited to that directory only.-xaand-xhare mutually exclusive.-lis only valid with thelinkcommand.-dryis global; inlinkmode it prints planned links without changing files or DB rows.
Examples:
Scan and hash a music folder:
./fhash scan -s ~/Music -e mp3,flac -h -a -rValidate embedded audio streams only:
./fhash check -s ~/Music -e mp3,flac -rList file-hash duplicates (min group 3) under a path:
./fhash dupe -xh3 -s ~/Music -rDry-run a link pass using the shallowest path as the keeper, limited to txt files:
./fhash link -xh2 -ls -s ./docs -r -e txt -dryfhash stores results in a SQLite database with two tables:
files: Indexed items and their metadata.id(INTEGER PRIMARY KEY AUTOINCREMENT)md5(TEXT): Full-file MD5 hash (Not calculatedif skipped,0-byte-fileif size was zero).audio_md5(TEXT): Audio-only MD5 hash (Not calculatedif skipped,0-byte-fileif size was zero,Bad audioon FFmpeg/audio errors).filepath(TEXT UNIQUE): Absolute path.filename(TEXT): Basename of the file.extension(TEXT): Extension without dot.filesize(INTEGER): Size in bytes.last_check_timestamp(TIMESTAMP): Last timefhashscanned/linked this entry.modified_timestamp(INTEGER): File modification time (st_mtime) seen during last scan.filetype(TEXT, 1 char):F= regular file,L= hard link,D= directory.audio_check_result(INTEGER): Audio stream validation result enum.0= good1= no audio data2= missing chunks3= corrupted audio stream4= not checked
sys: Key/value metadata for the database.version: Application version recorded in the DB.db_version: Schema version recorded in the DB.
fhash initializes sys on first run and validates version/db_version on startup before scan, check, dupe, or link.
When opening a legacy 1.0 DB, fhash 1.01 migrates it in-place by adding audio_check_result (default 4 = not checked), then backfills legacy sentinels: any 0-byte-file hash becomes 1, and any Bad audio hash becomes 3.
Index all MP3 and FLAC files in a folder, calculating both file and audio hashes:
./fhash -s ~/Music -e mp3,flac -h -a -r -vFind duplicate audio content using the database:
sqlite3 file_hashes.db "SELECT audio_md5, COUNT(*) c FROM files GROUP BY audio_md5 HAVING c > 1;"This project is intended for personal or educational use.