dataset: DISEASES#66
Conversation
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
this is only in github actions
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
to move to web
I talk about this in more detail in a comment under #65, but for local files, you can remove them outright. For global files, that is harder to track, but one can quickly search if they are being used by using the (for lack of a better word, 'query tuple')
We'll have to use a Snakemake checkpoint. I can add this, but I'm also wary of breaking anything in this pipeline, so as @ntalluri suggested some time ago, I'll upload the processed files to Google Drive, and make sure that they don't change with my new changes. |
…ed numbers in the README.
|
The committed change now includes using the full DISEASES datasets. We now have 121 disease datasets, though we could filter this (either by increasing the minimum number of high-confidence gold standard genes in the disease set, or by requiring a certain number of prize nodes). |
Does that mean that the diseases code should be trimming, or does that happen elsewhere? That might change the new numbers, if we require at least 10 gold standard disease-gene pairs that appear in STRING. |
| There are 84972 high-confidence disease-gene pairs from the 643 diseases | ||
| ``` | ||
|
|
||
| (Note: if you use the filtered datasets, you end with 134 diseases with at least 10 high-confidence disease-gene pairs). |
There was a problem hiding this comment.
The snakemake file only mentions 121 diseases
There was a problem hiding this comment.
There are 134 diseases that have at least 10 high-confidence disease-gene pairs from DISEASES. Some of these don't have enough TIGA evidence to be used - that's in the next step. At this point, there are 134 diseases that COULD be used.
| diseases_path = Path(dir_path, "..") | ||
| (diseases_path / "processed").mkdir(exist_ok=True, parents=True) | ||
|
|
||
| OXO_URL = "https://www.ebi.ac.uk/spot/oxo/api/search?size=500" |
There was a problem hiding this comment.
This site is down right now and so I can't run the code to process the diseases data. Is there a way to download this data and have a static version we use?
There was a problem hiding this comment.
I put in a help ticket and will wait to hear back. Once it's back up, we should retain a static version.
If it doesn't come back online, I will look up the original data sources (MONDO and EFO) and get cross-reference files from those databases.
From #39.