Skip to content

dataset: DISEASES#66

Open
tristan-f-r wants to merge 99 commits into
mainfrom
diseases-dataset
Open

dataset: DISEASES#66
tristan-f-r wants to merge 99 commits into
mainfrom
diseases-dataset

Conversation

@tristan-f-r

Copy link
Copy Markdown
Contributor

@tristan-f-r tristan-f-r added dataset Mutating datasets in any way. blocked-by-other-pr For PRs that depend on other PRs. labels Mar 18, 2026
tristan-f-r and others added 2 commits March 18, 2026 16:54
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
Comment thread datasets/diseases/README.md
@tristan-f-r

Copy link
Copy Markdown
Contributor Author

OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed (ensg-ensp.tsv, HumanDO.tsv, and HumanDO.tsv.metadata). I removed them from the Snakemake fetch() commands, but (a) are these also somewhere else? This is another use case for local vs. global files - how do I know whether these files are being used by other data collections? How do I remove existing files from outdated pipelines? Etc.

I talk about this in more detail in a comment under #65, but for local files, you can remove them outright. For global files, that is harder to track, but one can quickly search if they are being used by using the (for lack of a better word, 'query tuple') ("BioMart", "ensg-ensp.tsv") across the codebase for any other uses of it.

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file."

We'll have to use a Snakemake checkpoint. I can add this, but I'm also wary of breaking anything in this pipeline, so as @ntalluri suggested some time ago, I'll upload the processed files to Google Drive, and make sure that they don't change with my new changes.

@annaritz

Copy link
Copy Markdown
Contributor

The committed change now includes using the full DISEASES datasets. We now have 121 disease datasets, though we could filter this (either by increasing the minimum number of high-confidence gold standard genes in the disease set, or by requiring a certain number of prize nodes).

@annaritz

Copy link
Copy Markdown
Contributor

QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process?

We should be trimming the gold standard to be the ones in the interactome as well. I think there is code for that in #65.

Does that mean that the diseases code should be trimming, or does that happen elsewhere? That might change the new numbers, if we require at least 10 gold standard disease-gene pairs that appear in STRING.

There are 84972 high-confidence disease-gene pairs from the 643 diseases
```

(Note: if you use the filtered datasets, you end with 134 diseases with at least 10 high-confidence disease-gene pairs).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snakemake file only mentions 121 diseases

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 134 diseases that have at least 10 high-confidence disease-gene pairs from DISEASES. Some of these don't have enough TIGA evidence to be used - that's in the next step. At this point, there are 134 diseases that COULD be used.

diseases_path = Path(dir_path, "..")
(diseases_path / "processed").mkdir(exist_ok=True, parents=True)

OXO_URL = "https://www.ebi.ac.uk/spot/oxo/api/search?size=500"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This site is down right now and so I can't run the code to process the diseases data. Is there a way to download this data and have a static version we use?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in a help ticket and will wait to hear back. Once it's back up, we should retain a static version.

If it doesn't come back online, I will look up the original data sources (MONDO and EFO) and get cross-reference files from those databases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked-by-other-pr For PRs that depend on other PRs. dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants