A competitive leaderboard for curated seed datasets of IS630/Tc1/mariner (ITm) transposable elements.
This is Stage 1 of a project to train a DNA language model on the ITm transposon superfamily. The goal is to assemble a high-quality, phylogenetically diverse, well-annotated seed collection of autonomous ITm transposons with strong evidence of functionality.
Teams submit curated entries. A CI pipeline validates each submission and assigns a score. The leaderboard ranks all submissions.
The easiest way to submit is with the included submit.py script, which validates locally, creates a PR, and monitors CI:
# Install dependencies
pip install -r ci/requirements.txt
# Validate and submit (auto-detects entries/{your-github-username}/)
python submit.py
# Or specify your entry explicitly
python submit.py entries/my-teamThe script will:
- Run all validation checks locally and show a detailed score report
- Create a branch and PR for your entry
- Wait for CI checks to pass and report the result
Additional options:
python submit.py --no-pr # Validate locally only, don't create a PR
python submit.py --no-wait # Create PR but don't wait for CI
python submit.py --timeout 600 # Wait up to 10 minutes for CI (default: 5 min)If you prefer to submit manually:
- Fork this repo.
- Create a directory
entries/{your-github-username}/containing:protein.sto— Stockholm protein alignment with catalytic triad annotationdna.sto— Multi-Stockholm DNA sequences with element structure annotationprovenance.tsv— Tab-separated metadata table- (Optional) PDB structure files
- Open a pull request to
main. - CI runs validation on your entry and reports pass/fail on the PR.
- If the only files you changed are inside
entries/{your-github-username}/, the PR is automatically approved and merged. No maintainer action needed. - After merge, the leaderboard updates with your score.
Important: Your entry directory name must match your GitHub username (lowercase). PRs that modify files outside your own entry directory require manual review.
To update your entry, push a new PR modifying files in your directory. To withdraw, open a PR that deletes your entry directory.
See CLAUDE.md for the full format specification and POLICY.md for scoring details.
Entries are scored on a 0–100 scale based on four components:
| Component | Weight | What it rewards |
|---|---|---|
| Sequence diversity | 40% | Phylogenetic breadth across ITm families |
| Annotation quality | 25% | Correct and complete annotations |
| Evidence of functionality | 20% | Near-identical paralogs, structures, literature |
| Collection size | 15% | Enough sequences to seed a model |
Hard-fail checks (format errors, annotation inconsistencies) result in a score of 0.
Scores are also available as JSON:
leaderboard.json— ranked summaryscores/{team}.json— detailed per-team results
protein.sto: Stockholm 1.0 alignment with#=GC catalytic_triadmarking D, d, E (or D) positions.dna.sto: One Stockholm block per element. Each block has one sequence and a#=GC element_structureline using characters:5 3 < > A B 0 1 2 n t .provenance.tsv: Required columns:id,family,host_species,host_taxid,assembly,chrom,start,end,strand,source,reference.
See CLAUDE.md for the complete specification.