Input files for Sarek can be specified using a TSV file given to the --input command.
The TSV file is a Tab Separated Value file with columns:
subject gender status sample lane fastq1 fastq2for stepmappingwith paired-end FASTQssubject gender status sample lane bamfor stepmappingwith unmapped BAMssubject gender status sample bam bai recaltablefor steprecalibratewith BAMssubject gender status sample bam baifor stepvariantcallingwith BAMs
The content of these columns is quite straight-forward:
subjectdesignate the subject, it should be the ID of the Patient, and it must design only one patientgenderis the gender of the Patient, (XX or XY)statusis the status of the Patient, (0 for Normal or 1 for Tumor)sampledesignate the Sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each patient, i.e. a tumor and a relapse), it must design only one samplelaneis used when the sample is multiplexed on several lanes, it must be unique for each lane in the same samplefastq1is the path to the first pair of the fastq filefastq2is the path to the second pair of the fastq filebamis the bam filebaiis the bam index filerecaltableis the recalibration table
It is recommended to add the absolute path of the files, but relative path should work also.
Note, the delimiter is the tab (\t) character:
All examples are given for a normal/tumor pair. If no tumors are listed in the TSV file, then the workflow will proceed as if it is a normal sample instead of a normal/tumor pair.
Sarek will output results is a different directory for each sample. If multiple samples are specified in the TSV file, Sarek will consider all files to be from different samples. Multiple TSV files can be specified if the path is enclosed in quotes.
Somatic variant calling output will be in a specific directory for each normal/tumor pair.
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFACXX111207.1_1.fastq.gz pathToFiles/C09DFACXX111207.1_2.fastq.gz
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFACXX111207.2_1.fastq.gz pathToFiles/C09DFACXX111207.2_2.fastq.gz
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFACXX111207.3_1.fastq.gz pathToFiles/C09DFACXX111207.3_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMACXX111207.1_1.fastq.gz pathToFiles/D0ENMACXX111207.1_2.fastq.gz
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMACXX111207.2_1.fastq.gz pathToFiles/D0ENMACXX111207.2_2.fastq.gz
Input files for Sarek can be specified using the path to a FASTQ directory given to the --input command only with the mapping step.
nextflow run nf-core/sarek --input pathToDirectory ...The input folder, containing the FASTQ files for one individual (ID) should be organized into one subfolder for every sample. All fastq files for that sample should be collected here.
ID
+--sample1
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+------sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+------sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
Fastq filename structure:
sample_lib_flowcell-index_lane_R1_1000.fastq.gzandsample_lib_flowcell-index_lane_R2_1000.fastq.gz
Where:
sample= sample idlib= indentifier of libaray preparationflowcell= identifyer of flow cell for the sequencing runlane= identifier of the lane of the sequencing run
Read group information will be parsed from fastq file names according to this:
RGID= "sample_lib_flowcell_index_lane"RGPL= "Illumina"PU= sampleRGLB= lib
In this sample for the normal case there are 3 read groups, and 2 for the tumor.
G15511 XX 0 C09DFN C09DF_1 pathToFiles/C09DFAC_1.bam
G15511 XX 0 C09DFN C09DF_2 pathToFiles/C09DFAC_2.bam
G15511 XX 0 C09DFN C09DF_3 pathToFiles/C09DFAC_3.bam
G15511 XX 1 D0ENMT D0ENM_1 pathToFiles/D0ENMAC_1.bam
G15511 XX 1 D0ENMT D0ENM_2 pathToFiles/D0ENMAC_2.bam
The same way, if you have non recalibrated BAMs, their indexes and their recalibration tables, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.bam pathToFiles/G15511.C09DFN.md.bai pathToFiles/G15511.C09DFN.md.recal.table
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.bam pathToFiles/G15511.D0ENMT.md.bai pathToFiles/G15511.D0ENMT.md.recal.table
The same way, if you have recalibrated BAMs and their indexes, you should use a structure like:
G15511 XX 0 C09DFN pathToFiles/G15511.C09DFN.md.recal.bam pathToFiles/G15511.C09DFN.md.recal.bai
G15511 XX 1 D0ENMT pathToFiles/G15511.D0ENMT.md.recal.bam pathToFiles/G15511.D0ENMT.md.recal.bai
Input files for Sarek can be specified using the path to a VCF directory given to the --input command only with the annotate step.
Multiple VCF files can be specified if the path is enclosed in quotes.
As Sarek will use bgzip and tabix to compress and index VCF files annotated, it expects VCF files to be sorted.
nextflow run nf-core/sarek --step annotate --input "results/VariantCalling/*/.vcf.gz" ...