Make sure the docker service is running before execute this pipeline.
Clone the repository
git clone https://github.com/glygener/glygen-data-pipeline.gitcd glygen-data-pipeline./run.sh generate-data
- Glygen Data Generation
In case of need to regenerate Makefile, adjust the configurations as you need in glygen.yaml file and run:
./run.sh j2 Makefile.j2 glygen.yaml > MakefileThis project utilizes a Makefile to manage various tasks related to data generation and setup. Below is an overview of the main targets and their dependencies.
Using a Makefile pipeline offers automation, efficiency, flexibility, and simplified project management. It automates tasks and reduces unnecessary computations.
To use this Makefile, ensure you have the necessary dependencies installed:
- Docker
Ensure that the configuration variables in the Makefile filled by glygen.yaml are correctly set before running any targets.
This target builds the Glygen JAR using Maven. It depends on the Glygen JAR file.
This Makefile target is responsible for downloading files using the GlyGen tool. Depends the $(GLYGEN_DIRECTORY)/target/$(GLYGEN_JAR): Specifies the required JAR file for the GlyGen tool.
Depends on the successful generation of the Glygen JAR and the presence of specific RDF files. It downloads and imports triplets.
Download Reactome graph database (it will be loaded into Neo4J)
Depends on all previous setups and generates data using the Glygen JAR.
Depends on all previous setups and generates data using the Glygen JAR. It generates the other data required for glygen data release
The all target sequentially invokes the following commands using required for the glygen data generation:
Removes generated files and directories.
make generate-glygenjar: Builds the Glygen JAR.make download-files: Download RDF files.make import-triplets: Imports triplets.make setup-reactome: Download Reactome Data and copy it to Neo4J.make generate-data: Generates data.make generate-other-data: Generates other data required to glygen data release.make all: Encapsulate all required steps.make clean: Cleans up generated files.
-
glygen_directory:Sets Glygen path. It must points out where Glygen Java code base is located (Default./glygen). -
input_directory:This is a directory where Glygen will consume input files. Must be set as the value set inglygen_directory+/in, e.g../glygen/in(Defaultglygen/in). -
output_directory:Folder that will be created to store generated files. (Default./releases) -
release_date:"This property will be concatenated withoutput_directorye.g. 2025_06 will result in a folder./releases/2025_06" -
glygen_jar:"Glygen JAR name (Defaultglygen-2024.6-SNAPSHOT.jar)" -
reactome_directory:Directory where reactome will be downloaded and extracted (Default./reactome). -
java_xms:Java heap size for JVM (Default4g) -
java_xmx:Java heap size for JVM (Default16g)
data:
triplets: List of triplets that Glygen will use for generating data. Each triplet should be in RDF format. These files will be considered as dependencies during the download-files Makefile goal.
After the release is complete, you can generate the epitope data using the scripts/epitope.py file.
Simply call the script using the provided wrapper shell script, passing the release folder as the argument. In this case, the release folder is releases/2025_06/.
./run.sh python3 scripts/epitope.py releases/2025_06/The epitope TSV files output will be generated in releases/2025_06/epitope
After the release is complete, you can generate the epitope data using the scripts/cosmic.py file.
Simply call the script using the provided wrapper shell script, passing the release folder as the argument. In this case, the release folder is releases/2025_06/.
You can also pass force flag to force download and parsing process
./run.sh python3 scripts/cosmic.py releases/2025_06/cosmicThe COSMIC TSV files output will be generated in releases/2025_06/cosmic
If any python dependency is missing, try to rebuild the docker container just typing:
docker compose build glygen- If you data generation proccess is being killed, maybe you need to raise java memory limits take a look in
java_xmsandjava_xmxin theglygen.yamlfile. - Don't forget to regenerate Makefile using
./run.sh j2 Makefile.j2 glygen.yaml > Makefileafter adjusting memory limits.
Step by step run:
If data-generation process fails, try to execute the steps one-by-one to identify the issue.
- Make sure docker daemon is running
- Remove previous docker services using
docker compose down -v - Rebuild docker services with
docker compose up -d --build - Check if the
Makefileis well generated and placed correctly. - Remove any previous downloaded content
./run.sh make clean - Type
./run.sh make allor./generate-data.sh.
