dataset: DISEASES by tristan-f-r · Pull Request #66 · Reed-CompBio/spras-benchmarking

tristan-f-r · 2026-03-18T05:07:48Z

From #39.

Depends on feat: scaffolding, caching, EGFR #65.

not needed just yet

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

this is only in github actions

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

to move to web

tristan-f-r · 2026-05-04T20:45:06Z

OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed (ensg-ensp.tsv, HumanDO.tsv, and HumanDO.tsv.metadata). I removed them from the Snakemake fetch() commands, but (a) are these also somewhere else? This is another use case for local vs. global files - how do I know whether these files are being used by other data collections? How do I remove existing files from outdated pipelines? Etc.

I talk about this in more detail in a comment under #65, but for local files, you can remove them outright. For global files, that is harder to track, but one can quickly search if they are being used by using the (for lack of a better word, 'query tuple') ("BioMart", "ensg-ensp.tsv") across the codebase for any other uses of it.

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file."

We'll have to use a Snakemake checkpoint. I can add this, but I'm also wary of breaking anything in this pipeline, so as @ntalluri suggested some time ago, I'll upload the processed files to Google Drive, and make sure that they don't change with my new changes.

…e direct

…ed numbers in the README.

annaritz · 2026-05-21T21:15:32Z

The committed change now includes using the full DISEASES datasets. We now have 121 disease datasets, though we could filter this (either by increasing the minimum number of high-confidence gold standard genes in the disease set, or by requiring a certain number of prize nodes).

annaritz · 2026-05-21T21:17:05Z

QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process?

We should be trimming the gold standard to be the ones in the interactome as well. I think there is code for that in #65.

Does that mean that the diseases code should be trimming, or does that happen elsewhere? That might change the new numbers, if we require at least 10 gold standard disease-gene pairs that appear in STRING.

ntalluri · 2026-06-24T16:50:04Z

+ There are 84972 high-confidence disease-gene pairs from the 643 diseases
+```
+
+(Note: if you use the filtered datasets, you end with 134 diseases with at least 10 high-confidence disease-gene pairs).


The snakemake file only mentions 121 diseases

There are 134 diseases that have at least 10 high-confidence disease-gene pairs from DISEASES. Some of these don't have enough TIGA evidence to be used - that's in the next step. At this point, there are 134 diseases that COULD be used.

ntalluri · 2026-06-24T16:59:35Z

+diseases_path = Path(dir_path, "..")
+(diseases_path / "processed").mkdir(exist_ok=True, parents=True)
+
+OXO_URL = "https://www.ebi.ac.uk/spot/oxo/api/search?size=500"


This site is down right now and so I can't run the code to process the diseases data. Is there a way to download this data and have a static version we use?

I put in a help ticket and will wait to hear back. Once it's back up, we should retain a static version.

If it doesn't come back online, I will look up the original data sources (MONDO and EFO) and get cross-reference files from those databases.

tristan-f-r added 6 commits March 18, 2026 03:16

chore: drop other datasets

b49439e

Merge branch 'main' into egfr-and-infrastructure

2018a13

chore: re-include

136e5ff

chore: drop tools

472468d

not needed just yet

chore: re-add tools

a5de971

feat: diseases

aba68bd

tristan-f-r added dataset Mutating datasets in any way. blocked-by-other-pr For PRs that depend on other PRs. labels Mar 18, 2026

tristan-f-r added 3 commits March 18, 2026 05:53

docs: cache

8ddccb4

style: fmt

90cc277

docs: on caching

eb23b8f

tristan-f-r mentioned this pull request Mar 18, 2026

feat: scaffolding, caching, EGFR #65

Open

tristan-f-r and others added 2 commits March 18, 2026 16:54

docs: suggestions from review

4b524bc

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

docs: more comments, refactor: mv function out of Snakefile

69fda05

ntalluri reviewed Mar 19, 2026

View reviewed changes

Comment thread datasets/diseases/README.md

tristan-f-r and others added 15 commits March 19, 2026 18:58

docs(datasets): mention responsenet and egfr

15c7ecb

docs(datasets): add old synthetic data branch

729a51b

chore: mv to scores instead of dmmm

922be5d

docs: drop expiration docs

f3d6d41

this is only in github actions

docs: clarify snakemake importing

a802fde

docs: more clarification on data and files

85ba6e9

refactor: move irefindex back to egfr

8139f14

chore: revert web

5a8c02e

docs: clarify on cache

041f998

chore: apply suggestions

bd3bcab

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

docs: apply suggestions

c08a0b8

chore: apply suggestions

b3cf691

drop extensions

193f75a

to move to web

refactor(egfr): correctly specify output dirs

0581bea

feat(egfr): add target nodes, fmt

320aa87

tristan-f-r and others added 13 commits May 5, 2026 17:06

ci: add uv environment tests

1ec407e

update the deduplicate tool function, remove normalize function and b…

5276687

…e direct

remove concept of latest

3006983

precommit

3f69499

fix tools test

7595652

cleanup

1f0ca12

precommit

78c0adf

add todo

e682f09

update trim_input_nodes.py, it was an & and it needed to be an |

230a765

update comment and precommit

b0a3165

add in physical links

311de92

todo

5d62559

added full DISEASES files for gold standard dataset generation. Updat…

19bf337

…ed numbers in the README.

annaritz and others added 11 commits May 21, 2026 14:25

fixed typos for pre-commit checks

81f6759

add targets to the ensp data

77a1d50

add a TODO

d668016

update to actual SPRAS format

5fa0e59

update weights

6e3e621

make a single trim file, add todos for trimming

30bb20f

precommit

04ba202

update cli trim code

e73403b

merge with egfr-infrastructure

ac9cb56

pre-commit

1c1f64a

update the link to data, update the interactome code, change file names

0cba496

ntalluri reviewed Jun 24, 2026

View reviewed changes

clean up organization of the files and add comments

fec1041

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset: DISEASES#66

dataset: DISEASES#66
tristan-f-r wants to merge 99 commits into
mainfrom
diseases-dataset

tristan-f-r commented Mar 18, 2026

Uh oh!

Uh oh!

tristan-f-r commented May 4, 2026

Uh oh!

annaritz commented May 21, 2026

Uh oh!

annaritz commented May 21, 2026

Uh oh!

ntalluri Jun 24, 2026

Uh oh!

annaritz Jun 25, 2026

Uh oh!

ntalluri Jun 24, 2026

Uh oh!

annaritz Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tristan-f-r commented Mar 18, 2026

Uh oh!

Uh oh!

tristan-f-r commented May 4, 2026

Uh oh!

annaritz commented May 21, 2026

Uh oh!

annaritz commented May 21, 2026

Uh oh!

ntalluri Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

ntalluri Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants