Evidence for hierarchical representations of written and spoken words from an open-science human neuroimaging dataset

Tables and text files

SUBTLEX-NL with pos and Zipf.xlsx Contains word frequency measures of Dutch words in the SUBTLEX database, the most up-to-date version of the CELEX database. Zipf contains log-transformed values of FREQCOUNT, the number of word occurrences in the corpus. For more information, visit OSF.

subtlex_v2_cleaned_no_drop3.xlsx Screens the raw SUBTLEX database for entries that are not real words and contains a column suggesting whether to keep or drop an entry.

MOUS_audio_onset_offsets.xlsx Onset times of words in each audio file play in the speech listening part of the experiment.

MOUS_audio_onset_offsets_with_duration.csv Includes the durations of each spoken word in seconds.

stimuli.txt The sentences and word lists used in both the reading and speech listening experiments of the MOUS study.

bigram_counts.csv Cumulative bigram occurrences (per million) in the SUBTLEX text corpus.

CELEX (used to confirm validity of syllabified IPA transcriptions)

dutch_celex_database_updatedv2.csv Contains phonetic pronunciations of Dutch words in the CELEX database. For more information, see supplement to Sun & Poeppel 2023.

subtlex_phonetics.xlsx The intersection of the CELEX database and SUBTLEX databases, contains phonetics and occurrence counts of most words in Dutch.

syllable_counts.csv Cumulative CELEX syllable occurrences (per million) in the SUBTLEX text corpus.

eSpeakNG (used to generate IPA syllabifications)

subtlex_v3_IPA_syllables_ijfix2.csv The cleaned SUBTLEX database, now including IPA syllabifications.

IPA_individual_syllable_frequencies_ijfix2.csv Tabulates frequency counts of every unique Dutch syllable generated from running eSpeakNG on the SUBTLEX database.

merged-IPA_CELEX.csv Merges the above with CELEX to yield a side-by-side comparison of the two syllabification schemes.

MOUS_IPA_transcriptions_ijfix2.csv Includes syllabified IPA transcriptions of all study words.

MOUS_IPA_SyllableFrequencies_ijfix2.csv Includes syllabified IPA transcriptions and frequency statistics (mean, min, max) of all study words.

n_syllable_conflict_ijfix2.csv Lists words whose IPA-syllabified forms differ from CELEX in the # of syllables.

Code

master_table IPA.ipynb Generates bigram, syllable, and word frequency statistics for every word presented in the MOUS experiments.

Auditory - Word Frequency

source_auditory_trancription.py Takes in an auditory subject's events.tsv file and an output filename and tabulates the onset times and words played during that subject's scan. Generates transcription files that are saved in each subject's source subdirectory, e.g. sub-A2002_transcription.csv.

source_auditory_transcription_loop.ipynb Runs the above over all auditory subjects.

SPM_auditory_word_frequency_1st_level.m Runs SPM12 first-level analysis for Word Frequency across all auditory subjects. For a primer on this technique, see Andy's Brain Book

SPM_auditory_word_frequency_2nd_level.m Runs SPM12 group-level analysis for word frequency.

SPM_auditory_word_frequency_1st_level_Positive.m Tests for a positive correlation with word frequency.

Auditory - Syllable Frequency

eSpeakNG_IPA.py Functions to generate and parallelize command-line calls to eSpeak text-to-speech engine.

run_subtlex_IPA_syllables_chunks.py Runs eSpeakNG on the SUBTLEX database.

syllabify_ipa_nl.py Functions called upon by the above script that split an IPA transcription into its constituent syllables on the basis of syllabification rules.

celex_vs_IPA_script.py Generates regressor files for each subject. celex_vs_IPA.ipynb compares the eSpeak-generated syllabifications with CELEX, finding that 99% of the IPA-syllabified SUBTLEX words agree with CELEX in # of syllables.

SPM_auditory_syllable_frequency_1st_level_IPA.m Runs SPM12 first-level analysis for Syllable Frequency across all auditory subjects.

SPM_auditory_syllable_frequency_2nd_level.m Runs group-level analysis for syllable frequency.

SPM_auditory_syllable_max_mean_frequency_1st_level_IPA.m Tests max/mean syllable frequency as an alternate parameter.

Visual - Word Frequency

source_visual_transcription.m converts an events.tsv file to a cleaned CSV containing onset time and word presented.

source_visual_transcription_loop.ipynb Runs the above function in a loop over all visual subjects.

calculate_word_frequencies_visual.ipynb generates CSV files containing both word frequency and minimum bigram frequency info for all words in the study.

SPM_visual_word_frequency_1st_level.m Runs SPM12 first-level analysis for Word Frequency across all visual subjects.

SPM_visual_word_frequency_2nd_level.m Runs SPM12 group-level analysis for Word Frequency.

SPM_visual_word_frequency_1st_level_Positive.m Tests for a positive correlation with word frequency.

Visual - Bigram Frequency

SPM_visual_bigram_frequency_1st_level.m Runs SPM12 first-level analysis for Bigam Frequency across all visual subjects.

SPM_visual_bigram_frequency_2nd_level.m Runs SPM12 group-level analysis for bigram frequency.

SPM_visual_max_bigram_frequency_1st_level.m and SPM_visual_mean_bigram_frequency_1st_level.m test for correlations with max/mean bigram frequency respectively.

/figures

`cluster_separation.ipynb` Thresholds a T-map, reports cluster peaks and calculates locations of cluster centers of mass.

`min_sublexical_vs_zipf_frequency.ipynb` Generates scatterplots (reported in supplement) comparing different frequency statistics.

`peak-scale_tmap.ipynb` divides all the voxels in a T-map by the peak T-stat in the map, making the T-map into a 'percent of peak activation' (PPA) )map.

`LR_tmap_split_Ke.ipynb` Splits a T-map down the middle into left and right halves and applies extent thresholding. Makes for easier visualizations.

`mricrogl_renderings.py` Script that generates 3D renderings of T-maps. To run this script and the following, paste this code into the scripting interface of [MRIcroGL](https://www.nitrc.org/projects/mricrogl).

`mricrogl_LR_renderings.py` Fetches left/right views of a T-map and renders them separately.

`mricrogl_dual-contrast_mosaic_axial.py` Juxtaposes two PPA maps in axial slices.

`mricrogl_dual-contrast_mosaic_sagittal.py` Does the above in sagittal slices.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.venv		.venv
.vscode		.vscode
SPM_Aug_2025_results		SPM_Aug_2025_results
SPM_scripts		SPM_scripts
__pycache__		__pycache__
deprecated		deprecated
figures		figures
functional_connectivity		functional_connectivity
guslatho		guslatho
image/README		image/README
regressor_generation		regressor_generation
syllabification		syllabification
tables		tables
transcription		transcription
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evidence for hierarchical representations of written and spoken words from an open-science human neuroimaging dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evidence for hierarchical representations of written and spoken words from an open-science human neuroimaging dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages