RNA-seq
This workflow processes RNA sequencing data, aligning reads to a reference genome, quantifying gene expression, and generating quality control reports.
Workflow Inputs
- Workflow Name (Optional): A name to associate with your workflow for easy identification.
- FASTQ Folder: The cloud storage location (e.g., gs://bucket/folder) containing the FASTQ files for your samples. The folder should contain paired-end reads with filenames ending in "_R1.fastq.gz" and "_R2.fastq.gz".
- UMI Read Index Files (Optional): Enable this option and provide the cloud storage location of UMI index files if your data includes them. These files should have filenames ending in "_I1.fastq.gz".
- Reference Files: The workflow uses pre-defined reference files for alignment and quantification. You have the option to override these defaults with your own files.
- STAR Index: The cloud storage location of the STAR index file for read alignment.
- GTF File: The cloud storage location of the GTF annotation file for gene quantification.
- RSEM Reference: The cloud storage location of the RSEM reference file for transcript quantification.
- Globin Genome Index: The cloud storage location of the Bowtie2 index for globin sequence to measure contamination.
- rRNA Genome Index: The cloud storage location of the Bowtie2 index for rRNA sequence to quantify rRNA reads.
- PhiX Genome Index: The cloud storage location of the Bowtie2 index for PhiX sequence to measure PhiX contamination.
- RefFlat File: The cloud storage location of the RefFlat file generated from the GTF file using gtfToGenePred.
- Output Report Name: The prefix used for the final merged report files.
Workflow Steps
The RNAseq pipeline consists of several steps, each performing a specific analysis:
1. Pre-Trim FASTQC:
Quality control analysis of reads after adapter trimming.
- Additional configurable options:
- pretrim_fastqc_ncpu: Number of CPU cores allocated for pre- and post-trim FASTQC analysis.
- pretrim_fastqc_ramGB: Amount of RAM (in GB) dedicated to pre- and post-trim FASTQC.
- pretrim_fastqc_disk: Disk space (in GB) allocated for pre- and post-trim FASTQC outputs.
- fastqc_docker: Specify the Docker image used for running FASTQC.
2. Attach UMI (Optional):
Appends UMI information to read names if UMI index files are provided.
- attach_umi_ncpu: Number of CPU cores used for attaching UMI information to read names.
- attach_umi_ramGB: RAM allocated for the UMI attachment process.
- attach_umi_disk: Disk space designated for UMI attachment outputs.
- attach_umi_docker: Docker image used for the UMI attachment step.
3. Cutadapt:
Trims adapter sequences from reads and removes low-quality reads.
- Additional configurable options:
- minimumLength: Minimum length of reads after adapter trimming. Reads shorter than this will be discarded.
- index_adapter: Sequence of the adapter used for indexing.
- univ_adapter: Sequence of the universal adapter used (if applicable).
- cutadapt_ncpu: Number of CPU cores used for adapter trimming with Cutadapt.
- cutadapt_ramGB: RAM allocated for Cutadapt execution.
- cutadapt_disk: Disk space designated for Cutadapt outputs.
- cutadapt_docker: Docker image used for running Cutadapt.
4. Post-Trim FASTQC:
Quality control analysis of reads after adapter trimming.
- Additional configurable options:
- posttrim_fastqc_ncpu: Number of CPU cores allocated for pre- and post-trim FASTQC analysis.
- posttrim_fastqc_ramGB: Amount of RAM (in GB) dedicated to pre- and post-trim FASTQC.
- posttrim_fastqc_disk: Disk space (in GB) allocated for pre- and post-trim FASTQC outputs.
- fastqc_docker: Specify the Docker image used for running FASTQC.
5. MultiQC:
Generates a consolidated report combining results from pre- and post-trim FASTQC and Cutadapt.
- Additional configurable options:
- multiqc_ncpu: Number of CPU cores used for generating the MultiQC report.
- multiqc_ramGB: RAM allocated for MultiQC report generation.
- multiqc_disk: Disk space allocated for MultiQC report outputs.
- multiqc_docker: Docker image used for running MultiQC.
6. STAR Alignment:
Aligns reads to the reference genome using STAR.
- Additional configurable options:
- star_ncpu: Number of CPU cores dedicated to STAR alignment.
- star_ramGB: RAM allocated for STAR alignment execution.
- star_disk: Disk space designated for STAR alignment outputs.
- star_docker: Docker image used for running STAR.
7. FeatureCounts:
Quantifies gene expression levels based on read counts.
- Additional configurable options:
- feature_counts_ncpu: Number of CPU cores used for gene expression quantification with FeatureCounts.
- feature_counts_ramGB: RAM allocated for FeatureCounts execution.
- feature_counts_disk: Disk space allocated for FeatureCounts outputs.
- feature_counts_docker: Docker image used for running FeatureCounts.
8. RSEM Quantification:
Quantifies gene and transcript expression levels, including FPKMs and TPMs.
- Additional configurable options:
- rsem_ncpu: Number of CPU cores dedicated to RSEM quantification.
- rsem_ramGB: RAM allocated for RSEM execution.
- rsem_disk: Disk space designated for RSEM outputs.
- rsem_docker: Docker image used for running RSEM.
9. **Contamination Estimation:
Uses Bowtie2 to estimate the level of globin, rRNA, and PhiX contamination in the samples.
- Additional configurable options:
- bowtie2_{globin,rrna,phix}_ncpu: Number of CPU cores for each Bowtie2 contamination estimation step.
- bowtie2_{globin,rrna,phix}_ramGB: RAM allocated for each Bowtie2 step.
- bowtie2_{globin,rrna,phix}_disk: Disk space for each Bowtie2 step outputs.
- bowtie_docker: Docker image used for running Bowtie2.
10. Mark Duplicates:
Identifies and marks duplicate reads resulting from PCR amplification.
- Additional configurable options:
- markdup_ncpu: Number of CPU cores used for marking duplicate reads with Picard MarkDuplicates.
- markdup_ramGB: RAM allocated for Picard MarkDuplicates execution.
- markdup_disk: Disk space for Picard MarkDuplicates outputs.
11. Collect RNA-seq Metrics:
Collects various RNA-seq quality control metrics using Picard tools.
- Additional configurable options:
- rnaqc_ncpu: Number of CPU cores used for collecting RNAseq metrics with Picard CollectRnaSeqMetrics.
- rnaqc_ramGB: RAM allocated for Picard CollectRnaSeqMetrics execution.
- rnaqc_disk: Disk space for Picard CollectRnaSeqMetrics outputs.
12. UMI Duplication (Optional):
Estimates PCR duplication rates from UMI information if provided.
- Additional configurable options:
- umi_dup_ncpu: Number of CPU cores used for UMI-based duplication estimation.
- umi_dup_ramGB: RAM allocated for UMI duplication estimation.
- umi_dup_disk: Disk space designated for UMI duplication estimation outputs.
- umi_dup_docker: Docker image used for UMI duplication estimation.
13. SAMTools Mapped:
Calculates the percentage of reads mapped to different genomic regions.
- Additional configurable options:
- mapped_ncpu: Number of CPU cores used for calculating mapped reads with SAMTools.
- mapped_ramGB: RAM allocated for SAMTools Mapped execution.
- mapped_disk: Disk space for SAMTools Mapped outputs.
- samtools_docker: Docker image used for running SAMTools.
14. MultiQC Post-Alignment:
Generates a consolidated report combining results from STAR, Picard tools, and other post-alignment steps.
- Additional configurable options:
- mqc_postalign_ncpu: Number of CPU cores used for generating the post-alignment MultiQC report.
- mqc_postalign_ramGB: RAM allocated for post-alignment MultiQC report generation.
- mqc_postalign_disk: Disk space allocated for post-alignment MultiQC report outputs.
15. RNAseq QC Report:
Creates a comprehensive QC report for each sample using MultiQC reports and other log files.
- Additional configurable options:
- collect_qc_ncpu: Number of CPU cores used for generating the final RNAseq QC report.
- collect_qc_ramGB: RAM allocated for final QC report generation.
- collect_qc_disk: Disk space allocated for final QC report outputs.
- collect_qc_docker: Docker image used for running the final QC report.
16. Merge Results:
Merges quantification outputs from RSEM and FeatureCounts, along with QC reports, into final combined files.
- Additional configurable options:
- merge_results_ncpu: Number of CPU cores used for merging results.
- merge_results_ramGB: RAM allocated for merging results.
- merge_results_disk: Disk space allocated for merging results.
- merge_results_docker: Docker image used for merging results.
Workflow Outputs
- RSEM Gene Counts: A table containing raw read counts for each gene across all samples.
- RSEM Gene TPMs: A table containing TPM (Transcripts Per Million) values for each gene across all samples.
- RSEM Gene FPKMs: A table containing FPKM (Fragments Per Kilobase Million) values for each gene across all samples.
- FeatureCounts File: A table containing raw read counts for each gene across all samples.
- QC Report File: A combined QC report summarizing results for all samples.
Additional Notes
- Each step in the workflow has configurable options for adjusting computational resources (CPU, RAM, disk) and specifying docker images.
- The workflow is designed to be modular, allowing for customization and extension as needed.