WGS
- Illumina Paired-End Short Read Sequencing: Use this workflow for paired-end short read data from Illumina platforms. It provides comprehensive analysis including alignment, variant calling, and annotation.
- Illumina Single-End Short Read Sequencing: This workflow is similar to bwa_srp but offers more configuration options and flexibility. Choose this if you need finer control over parameters or want to process BAM/CRAM files directly.
- Long Reads (Oxford Nanopore): This workflow is specifically designed for Oxford Nanopore long-read sequencing data. It leverages Minimap2 for alignment and PEPPER-Margin DeepVariant for accurate variant calling.
- Long Reads (PacBio): Choose this workflow for PacBio long-read sequencing data. It provides efficient alignment with Minimap2 and variant calling with PEPPER-Margin DeepVariant, optimized for PacBio data characteristics.
Illumina Paired-end Short Reads (BWA-MEM Aligner)
This workflow analyzes paired-end short reads data. It aligns FASTQ(/GZ) files to a reference genomce using BWA-MEM as the aligner. It pre-processes data, aligns reads, calls variants, and annotates them.
Workflow Inputs
- FASTQ Folder: This specifies the folder containing the FASTQ files with reads to be processed. Files should follow a specific naming convention (e.g., sample1_R1.fastq.gz). Accepted formats include both regular and gzipped FASTQ files. The location must be within the designated FASTQ bucket.
- Reference File: This refers to the reference file used by the pipeline. You can use the default options (GRCh37.75 or GRCh38.99) or provide a custom reference FASTA file by specifying its cloud storage location (gs:// or s3://).
- Minimum Q-score for FASTQ: This sets the minimum quality score for FASTQ reads. The default value is 30.
- Minimum mapping quality Q-score: This sets the minimum mapping quality score for alignments. The default minimum quality score is 30.
- DeepVariant model to use: Choose between "WGS" or "WES" depending on your sequencing data.
- Multi-lane mode: Enable this option if your data originates from multiple lanes. Note that this mode processes all files as a single sample, without parallelization.
- The human genome assembly to use: Select the appropriate human genome assembly ("GRCh37.75", "GRCh38.99") or provide a custom reference FASTA. Alternatively, choose "noAnnon" to skip variant annotation.
Workflow Steps
1. Sequence Alignment w/ BWA-MEM
This step pre-processes data using FASTP, aligns reads to the reference genome using BWA-MEM, and performs post-processing with SAMTools.
- Additional configurable options:
- align_ncpu: Number of CPU cores for alignment.
- align_ramGB: Amount of memory allocated for alignment.
- align_disk: Disk space allocated for alignment.
- align_docker: Docker image used for the alignment step.
2. DeepVariant
This step runs the DeepVariant variant caller on the aligned BAM file.
- Additional configurable options:
- deepvariant_ramGB: Amount of memory for DeepVariant.
- deepvariant_disk: Disk space for DeepVariant.
- deepvariant_ncpu: Number of CPU cores for DeepVariant.
- deepvariant_docker: Docker image for DeepVariant.
3. SnpEff
This step utilizes SnpEff to annotate and predict the effects of identified variants.
- Additional configurable options:
- vardig_dbsnp: Assembly database VCF file (optional).
- vardig_dbsnp_tbi: Indexed assembly database file (optional).
- vardig_ncpu: Number of CPU cores for SnpEff.
- vardig_ramGB: Amount of memory for SnpEff.
- vardig_disk: Disk space for SnpEff.
- vardig_docker: Docker image for SnpEff.
Workflow Outputs
- BAM file: Aligned reads in BAM format.
- BAI file: Index file for the BAM file.
- VCF file: Variant Call Format file containing identified variants.
- Annotated VCF file: VCF file with annotations from SnpEff.
- Summary reports and plots: Various reports and plots summarizing alignment and variant calling results.
Illumina Single-end Short Reads (BWA-MEM Aligner)
This workflow analyzes paired-end short reads data. This workflow aligns FASTQ or BAM files to a reference genome using BWA-MEM and calls variants using DeepVariant. Optionally, it can annotate those variants using SnpEff.
Workflow Inputs
- Workflow Name (Optional): Associate a name with your workflow. This name will be used to identify the workflow in the list of processed workflows.
- Input File Folder: Specify the name of the folder containing the FASTQ/BAM/CRAM files to be processed by the pipeline. This folder must be located in the designated FASTQ bucket.
- File Type: Choose the format of the input files: "BAM/CRAM" or "FastQ". The "FastQ" option includes both Gzipped ( .fastq.gz) and non-Gzipped (.fastq) FASTQ files.
- Reference File: Specify the name of the reference file to be used by the pipeline. You can either use the OmicsPipelines-provided reference data (GRCh37.75 or GRCh38.99) or override it with your own reference file. If you choose to override, ensure that the reference file is a valid cloud storage location (full "gs://..." path) accessible to the server.
- Minimum Q-score for FASTQ (Default: 30): Set the minimum quality score for FASTQ files. Reads with a Q-score below this threshold will be filtered out during the alignment process.
- Minimum Mapping Quality Q-score (Default: 30): Specify the minimum mapping quality score. Alignments with a mapping quality below this threshold will be discarded.
- DeepVariant Model: Select the DeepVariant model to be used for variant calling ("WGS" or "WES").
- Genome Assembly: Choose the human genome assembly to use for variant annotation: "GRCh37.75", "GRCh38.99", a custom reference FASTA, or "noAnnon" for no annotation. If using a custom reference FASTA, provide a valid .vcf.gz cloud storage location (full "gs://..." path) accessible to the server. If this field is left empty or "noAnnon" is selected, the pipeline will skip variant annotation.
Workflow Steps
1. Sequence Alignment (BWA-MEM)
This step pre-processes FASTQ data using FASTP, aligns reads to the reference genome using BWA-MEM, and performs post-processing with SAMTools. It generates a BAM file containing aligned reads and a BAM index file.
- Additional configurable options:
- align_ncpu: Number of CPU cores to use for the alignment process. Increasing this value can speed up the alignment but requires more computational resources.
- align_ramGB: Amount of memory (in gigabytes) to allocate for the alignment step.
- align_disk: Disk space (in gigabytes) to allocate for the alignment step.
- align_mapq: Minimum mapping quality threshold. Alignments with a mapping quality below this value will be filtered out.
- align_qscore: Minimum quality score for FASTQ reads. Reads with a Q-score below this threshold will be discarded.
- align_docker: Specify the Docker image to use for the alignment step.
2. Index BAM/FASTQ File
This step is only performed if the input file type is BAM/CRAM. It indexes the input BAM/CRAM file to enable efficient random access.
Additional configurable options:
- align_ncpu: Number of CPU cores to use for indexing the BAM/CRAM file.
- align_ramGB: Amount of memory (in gigabytes) to allocate for the indexing process.
- align_disk: Disk space (in gigabytes) to allocate for the indexing step.
- align_docker: Specify the Docker image to use for the indexing step.
3. DeepVariant
This step performs variant calling on the aligned BAM file using the DeepVariant tool. It generates a VCF file containing the identified variants.
- Additional configurable options:
- deepvariant_ncpu: Number of CPU cores to use for variant calling with DeepVariant.
- deepvariant_ramGB: Amount of memory (in gigabytes) to allocate for DeepVariant.
- deepvariant_disk: Disk space (in gigabytes) to allocate for DeepVariant.
- deepvariant_docker: Specify the Docker image to use for DeepVariant.
4. SnpEff Variant Annotation
This step is only performed if the "Genome Assembly" input is not set to "noAnnon". It annotates the variants identified by DeepVariant using SnpEff and generates an annotated VCF file.
- Additional configurable options:
- vardig_ncpu: Number of CPU cores to use for variant annotation with SnpEff.
- vardig_ramGB: Amount of memory (in gigabytes) to allocate for SnpEff.
- vardig_disk: Disk space (in gigabytes) to allocate for SnpEff.
- vardig_docker: Specify the Docker image to use for SnpEff.
- vardig_dbsnp: Reference database in VCF format for annotation.
- vardig_dbsnp_tbi: Indexed reference database file for annotation.
Workflow Outputs
- Aligned BAM File: This file contains the reads aligned to the reference genome.
- Aligned BAM Index File: This file is used for efficient random access to the aligned BAM file.
- FASTP Report: This HTML report provides quality control metrics for the input FASTQ data.
- QualiMap Report: This report provides detailed statistics and visualizations of the alignment results.
- DeepVariant VCF File: This file contains the variants identified by DeepVariant.
- SnpEff Annotated VCF File (Optional): This file contains the variants identified by DeepVariant with annotations added by SnpEff.
- VCF Summary Report (Optional): This PDF report provides a summary of the variants identified in the VCF file.
- VCF with Variants not in Database (Optional): This VCF file contains variants that were not found in the specified reference database.
- VCF with Rare Variants (Optional): This VCF file contains variants with a TOPMED frequency of less than 0.1, indicating they are relatively rare.
Oxford Nanopore (Minimap2 Aligner)
This pipeline processes Oxford Nanopore long-read sequencing data using Minimap2 for alignment and DeepVariant for variant calling. It also includes optional variant annotation using SnpEff.
Workflow Inputs
- Sample Name: A name to associate with your sample (optional).
- Input File Folder: The folder containing your FASTQ, BAM, or CRAM files.
- File Type: Specify whether your input files are FASTQ or BAM/CRAM format. For FASTQ, both gzipped (.fastq.gz) and unzipped (.fastq) files are supported.
- Reference File: Choose a pre-loaded reference genome (GRCh37.75 or GRCh38.99) or provide a custom FASTA file from a cloud storage location (gs:// or s3://).
- Minimum Q-score for FASTQ (Default: 30): Set the minimum quality score for FASTQ reads.
- Minimum Mapping Quality Q-score (Default: 30): Set the minimum mapping quality score for alignments.
- DeepVariant model to use: Select the appropriate DeepVariant model based on your sequencing data (e.g., WES for whole-exome sequencing).
- The human genome assembly to use: Choose the human genome assembly for variant annotation. Options include GRCh37.75, GRCh38.99, a custom reference FASTA, or "noAnnon" for no annotation. For custom references, provide a .vcf.gz file from a cloud storage location.
- ONT chemistry version: Select the appropriate ONT chemistry version (v9.4 or v10).
Workflow Steps
1. Sequence Alignment w/ Minimap2
This step pre-processes FASTQ data using FASTP, aligns reads to the reference genome with Minimap2, and post-processes alignments using SAMTools.
- Additional configurable options:
- align_ncpu: Number of CPU cores for alignment.
- align_ramGB: Amount of memory (in GB) for alignment.
- align_disk: Disk space (in GB) for alignment.
- align_docker: Docker image for the alignment step.
2. Index BAM/FASTQ File
This step indexes BAM/CRAM files or the BAM file generated from FASTQ alignment.
- Additional configurable options:
- align_ncpu: Number of CPU cores for indexing.
- align_ramGB: Amount of memory (in GB) for indexing.
- align_disk: Disk space (in GB) for indexing.
- align_docker: Docker image for the indexing step.
3. PEPPER-Margin DeepVariant
This step performs variant calling using PEPPER-Margin DeepVariant on the aligned and indexed BAM file.
- Additional configurable options:
- pmdv_ncpu: Number of CPU cores for variant calling.
- pmdv_ramGB: Amount of memory (in GB) for variant calling.
- pmdv_disk: Disk space (in GB) for variant calling.
- pmdv_docker: Docker image for the variant calling step.
4. SnpEff/Variant Annotation (Optional)
This step annotates variants using SnpEff based on the chosen genome assembly.
- Additional configurable options:
- vardig_ncpu: Number of CPU cores for variant annotation.
- vardig_ramGB: Amount of memory (in GB) for variant annotation.
- vardig_disk: Disk space (in GB) for variant annotation.
- vardig_docker: Docker image for the variant annotation step.
Workflow Outputs
- Aligned BAM file: The aligned reads in BAM format.
- Aligned BAM index file: Index file for the aligned BAM.
- Alignment report: Quality control report generated by Qualimap.
- Variant calling output: VCF file containing identified variants.
- Annotated VCF file (optional): VCF file with annotated variants based on the selected genome assembly.
- Variant annotation reports (optional): Various reports and summaries generated by SnpEff and related tools.
PacBio (Minimap2 Aligner)
This workflow aligns PacBio reads to a reference genome using minimap2 and calls variants using PEPPER-Margin DeepVariant. Optionally, it can also annotate the called variants using SnpEff.
Workflow Inputs
- Workflow name: (Optional) A name to associate with your workflow for easy identification.
- Input File Folder: The name of the folder containing the FASTQ/BAM/CRAM files to be processed.
- The format of the files: Select either "FastQ" (including .fastq.gz and .fastq) or "BAM/CRAM". The default is FastQ.
- Reference file: The reference genome to use for alignment. You can choose between "GRCh37.75" and "GRCh38.99" or provide a custom reference FASTA file in a cloud storage location accessible to the server. Make sure the corresponding .fai index file is also available.
- Minimum Q-score for FASTQ: (Optional) The minimum quality score for FASTQ reads. The default is 30.
- Minimum mapping quality Q-score: (Optional) The minimum mapping quality score for alignments. The default is 30.
- DeepVariant model to use: (Optional) The DeepVariant model to use for variant calling. The default is "WES".
- The human genome assembly to use: (Optional) Select the human genome assembly for variant annotation. Choose from "GRCh37.75", "GRCh38.99", or "noAnnon" for no annotation. For custom annotations, provide a .vcf.gz file in a cloud storage location accessible to the server, along with its corresponding .tbi index file.
- PacBio instrument: Select the PacBio instrument used to generate the data: "hifi" or "pb".
Workflow Steps
1. Sequence Alignment w/ Minimap2
This step pre-processes FASTQ data using FASTP, aligns reads to the reference genome using minimap2, and performs post-processing with SAMtools. For BAM/CRAM inputs, this step performs indexing instead of alignment.
- Additional configurable options:
- align_ramGB: Memory allocation for the alignment step.
- align_disk: Disk space allocation for the alignment step.
- align_ncpu: Number of CPU cores to use for alignment.
- align_docker: Docker image to use for the alignment step.
2. PEPPER-Margin DeepVariant
This step performs variant calling on the aligned BAM file using PEPPER-Margin DeepVariant.
- Additional configurable options:
- pmdv_ramGB: Memory allocation for variant calling.
- pmdv_disk: Disk space allocation for variant calling.
- pmdv_ncpu: Number of CPU cores to use for variant calling.
- pmdv_docker: Docker image to use for variant calling.
3. SnpEff/Variant Annotation (Optional)
This step annotates the called variants and predicts their effects using SnpEff. This step is only performed if a genome assembly is specified for annotation.
- Additional configurable options:
- vardig_dbsnp: (Optional) A dbSNP VCF file for annotation.
- vardig_dbsnp_tbi: (Optional) The corresponding .tbi index file for the dbSNP VCF.
- vardig_ncpu: Number of CPU cores to use for annotation.
- vardig_ramGB: Memory allocation for annotation.
- vardig_disk: Disk space allocation for annotation.
- vardig_docker: Docker image to use for annotation.
Workflow Outputs
- Aligned BAM file: The aligned reads in BAM format.
- Aligned BAM index file: The index file for the aligned BAM.
- Alignment report: A report summarizing the alignment results.
- Variant calling output: A compressed archive containing the variant calling results.
- Annotated VCF file: (Optional) The VCF file with annotations, if enabled.
- Filtered VCF files: (Optional) VCF files filtered for variants not in dbSNP and variants with low allele frequency, if enabled.
- VCF statistics and plots: (Optional) Summary statistics and plots of the called variants, if enabled.