Skip to main content

RRBS

This workflow is designed for the analysis of Reduced Representation Bisulfite Sequencing (RRBS) data. It performs adapter trimming, alignment, methylation quantification, and quality control.

Workflow Inputs

  • Workflow Name (Optional): A name for your workflow to help identify it in the list of processed workflows.
  • FASTQ Folder: The folder containing your FASTQ files, including forward (R1), reverse (R2), and UMI (I1) reads. Files should follow the naming convention <Sample Name>_R1.fastq.gz, <Sample Name>_R2.fastq.gz, and <Sample Name>_I1.fastq.gz.
  • Sample Genome Index (TAR): Path to the reference genome index for the sample. The default is a pre-built rat genome index, but you can override it with your own.
  • Spike-in Genome Index (TAR): Path to the reference genome index for the spike-in control (e.g., lambda phage). The default is a pre-built lambda genome index, but you can override it with your own.
  • PhiX Genome Index (TAR): Path to the Bowtie2 reference index for PhiX. The default is a pre-built PhiX index, but you can override it with your own.
  • Output Report Name: A name or prefix to use for the generated reports and pipeline outputs.

Workflow Steps

The RRBS pipeline consists of several steps, each with additional configurable options:

1. Pre-Trim FastQC

This step assesses the initial quality of raw reads before adapter trimming.

  • Additional Configurable Options:
  • pretrim_fastqc_ncpu: Number of CPU cores.
  • pretrim_fastqc_ramGB: Memory allocation in GB.
  • pretrim_fastqc_disk: Disk space allocation in GB.

2. Attach UMI

Appends UMI information from the I1 file to the read names in R1 and R2 files for downstream analysis.

  • Additional Configurable Options:
  • attach_umi_ncpu: Number of CPU cores.
  • attach_umi_ramGB: Memory allocation in GB.
  • attach_umi_disk: Disk space allocation in GB.

3. Trim Galore (Regular Adapters)

Removes standard adapter sequences from the reads.

  • Additional Configurable Options:
  • trim_reg_adapt_ncpu: Number of CPU cores.
  • trim_reg_adapt_ramGB: Memory allocation in GB.
  • trim_reg_adapt_disk: Disk space allocation in GB.

4. Trim Diversity Adapters

Removes NuGen-specific diversity adapters.

  • Additional Configurable Options:
  • trim_diversity_adapt_ncpu: Number of CPU cores.
  • trim_diversity_adapt_ramGB: Memory allocation in GB.
  • trim_diversity_adapt_disk: Disk space allocation in GB.

5. Post-Trim FastQC

Evaluates the quality of reads after adapter trimming.

  • Additional Configurable Options:
  • posttrim_fastqc_ncpu: Number of CPU cores.
  • posttrim_fastqc_ramGB: Memory allocation in GB.
  • posttrim_fastqc_disk: Disk space allocation in GB.

6. MultiQC

Aggregates and summarizes results from FastQC and Trim Galore steps.

  • Additional Configurable Options:
  • multiqc_ncpu: Number of CPU cores.
  • multiqc_ramGB: Memory allocation in GB.
  • multiqc_disk: Disk space allocation in GB.

7. Align Trimmed Reads (Sample & Spike-in)

Aligns trimmed reads to both the sample and spike-in reference genomes using Bismark.

  • Additional Configurable Options (for both Sample and Spike-in):
  • align_trim_sample/spike_in_ncpu: Number of CPU cores.
  • align_trim_sample/spike_in_ramGB: Memory allocation in GB.
  • align_trim_sample/spike_in_disk: Disk space allocation in GB.

8. Mark UMI Duplicates (Sample & Spike-in)

Identifies and tags UMI duplicates in the aligned BAM files.

  • Additional Configurable Options (for both Sample and Spike-in):
  • tag_udup_sample/spike_in_ncpu: Number of CPU cores.
  • tag_udup_sample/spike_in_ramGB: Memory allocation in GB.
  • tag_udup_sample/spike_in_disk: Disk space allocation in GB.

9. Mark PCR Duplicates (Sample & Spike-in)

Identifies and marks PCR duplicates using Picard's MarkDuplicates.

  • Additional Configurable Options (for both Sample and Spike-in):
  • mark_dup_sample/spike_in_ncpu: Number of CPU cores.
  • mark_dup_sample/spike_in_ramGB: Memory allocation in GB.
  • mark_dup_sample/spike_in_disk: Disk space allocation in GB.

10. Quantify Methylation (Sample & Spike-in)

Quantifies methylation levels using Bismark's methylation extractor.

  • Additional Configurable Options (for both Sample and Spike-in):
  • quant_methyl_sample/spike_in_ncpu: Number of CPU cores.
  • quant_methyl_sample/spike_in_ramGB: Memory allocation in GB.
  • quant_methyl_sample/spike_in_disk: Disk space allocation in GB.

11. Bowtie2 PhiX Alignment

Aligns reads to the PhiX genome using Bowtie2 to assess the level of PhiX contamination.

  • Additional Configurable Options:
  • bowtie2_phix_ncpu: Number of CPU cores.
  • bowtie2_phix_ramGB: Memory allocation in GB.
  • bowtie2_phix_disk: Disk space allocation in GB.

12. SAMTools Mapped

Calculates the percentage of reads mapped to different chromosomes and contigs.

  • Additional Configurable Options:
  • chrinfo_ncpu: Number of CPU cores.
  • chrinfo_ramGB: Memory allocation in GB.
  • chrinfo_disk: Disk space allocation in GB.

13. Collect QC Metrics

Gathers and summarizes quality control metrics from various steps in the pipeline.

  • Additional Configurable Options:
  • collect_qc_ncpu: Number of CPU cores.
  • collect_qc_ramGB: Memory allocation in GB.
  • collect_qc_disk: Disk space allocation in GB.

14. Merge Results

Combines reports and QC metrics files from all samples.

  • Additional Configurable Options:
  • merge_results_ncpu: Number of CPU cores.
  • merge_results_ramGB: Memory allocation in GB.
  • merge_results_disk: Disk space allocation in GB.

Workflow Outputs

The RRBS pipeline produces several outputs, including:

  • QC Report: A comprehensive report containing quality control metrics from all steps of the pipeline.
  • Alignment BAM files: Aligned reads in BAM format for both sample and spike-in.
  • Methylation quantification files: Files containing methylation levels for each cytosine in the genome.
  • Bismark reports: Reports generated by Bismark, including alignment and methylation extraction summaries.
  • MultiQC report: Aggregated report summarizing results from FastQC and Trim Galore.

Additional Notes

  • The default settings for the workflow are suitable for most RRBS datasets, but you may need to adjust them based on the specific characteristics of your data and computational resources.
  • You can find more detailed information about each step of the pipeline and the tools used in the corresponding tool documentation.