Skip to content

Methodology

See here for more information on the library generation of the sci-RNA-Seq3 protocol.

Major steps

  1. Check for sanity of provided barcodes and sample-sheet.
  2. Converts BCL files to paired-end .fq.gz files with PCR indexes in header (bcl2fastq).
  3. Merges multiple sequencing runs (path_bcl) into one experiment-based file (experiment_name).
  4. Splits paired-end .fq.fz files into smaller (evenly-sized) chunks for parallelization (fastqsplitter).
  5. Demultiplexing using the supplied sample-specific barcodes (sci-rocket).
  6. Finds exact or nearest match for PCR Index #1 (p5), PCR Index #1 (p7), ligation and/or RT barcode (single match with ≤1 hamming distance).
  7. Generates sample-specific .fastq.gz files with corrected R1 sequence (48nt) and added read-names in R2.
  8. Read-pairs without all four matching barcodes are discarded into separate .fastq.gz files with logs detailing which barcode(s) are (non-)matching.
  9. For samples with a specified hashing sheet, additional hashing procedures are performed.
  10. Performs adapter and low-quality base-trimming (fastp).
  11. Read-pairs with a mate ≤10nt after trimming are discarded.
  12. Aligns reads to the supplied reference genome and perform cell-barcode/UMI counting (STARSolo).
  13. STAR index can be generated based on supplied genome sequences and annotations.
  14. Per gene and cellular barcode, intronic, exonics and UTR-overlapping reads (UMI) are counted and multi-mapping reads are distributed using the EM method.
  15. Generate demultiplexing/alignment overview. (sci-dash)
  16. Generates a HTML report with demultiplexing and alignment statistics.

Parallization is performed per experiment_name and split chunk.

Optional steps

  1. (Mus musculus-only) Haplotype demultiplexing.
  2. Adds haplotype-specific read tags (HP) to the STARSolo BAM files using known haplotype-specific SNPs (MGP + haplotag).
  3. Generate haplotype-specific read-counts per gene per cell (H1, H2, UA) (umi_tools).

Downstream analysis

For downstream analysis, we also maintain an R package to analyze results produced by sci-rocket called scir.

Sample demultiplexing (without hashing)

Example of R1 sequence:

      @READNAME 1:N:0:CCGTATGATT+AGATGCAACT
                        |----p7---|+|----p5----|: p5 is reverse-complemented during demuxxing.
      ACTTGATTGTCAGAGCTTTGGTATCCTACCAGTT

      The R1 sequence should adhere to the following scheme:
      First 9 or 10nt:  Ligation barcode
      Next 6nt:    Primer
      Next 8nt:    UMI
      Last 10nt:   RT Barcode (sample-specific)

      Anatomy of R1 (ligation of 10nt):
      |ACTTGATTGT| |CAGAGC| |TTTGGTAT| |CCTACCAGTT|
      |-LIGATION-| |Primer| |---UMI--| |----RT----|

      Anatomy of R1 (ligation of 9nt):
      |CTCGTTGAT| |CAGAGC| |TTTGGTAT| |CCTACCAGTT| |T|
      |-LIGATION| |Primer| |---UMI--| |----RT----| |.| <- Extra base.

      Corrected R1 sequence (48nt):
      |CCGTATGATT| |AGTTGCATCT| |CTCGTTGAT| |CCTACCAGTT| |TTTGGTAT|
      |----p7----| |----p5----| |-LIGATION-| |----RT----| |---UMI--|

For sample-demultiplexing, the following steps are performed:

  1. Extracts p5, p7 PCR indexes from the read-name of R1 and ligation, RT and UMI barcodes from sequence of read 1 (R1).
  2. If no match, corrects p5, p7, ligation and/or RT barcode to nearest match (with max. 1nt difference). If multiple close matches, discard read-pair.
  3. For ligation barcodes of 9nt in length, an extra G is added to the ligation sequence as padding to ensure 48nt R1 sequence.
  4. Add the barcodes to the read-names of read 1 and 2:
    @READNAME|P5-<p5>-P7-<p7>|<ligation>|<rt>_<UMI>
  5. Generate sample-specific paired-end fq.gz files with corrected R1 sequence (48nt) and R2 sequence.

Hashing

Reads (R2) containing both a polyA signal (AAAA) and a hashing barcode are used to flag reads as hashing-reads. These reads are used for collecting hashing metrics (with their respective R1) and subsequently removed from the analysis.

To flag reads as hashing-reads, we first check for the presence of the polyA signal (AAAA) in R2 (first occurence). If this signal is present, we check for the presence of the hashing barcode in R2 prior to this poly-A signal. It is assumed that the hashing barcodes are 10nt and are (directly) prior to the poly-A signal (5' - 1nt spacer).

If no match is found using the first 10nt (5' poly-A - 1nt spacer) ; we try again against the closest match (hamming distance=1). If no rescued match is found, we search for the presence of any hashing barcode in the entire R2 sequence prior to the poly-A signal.

The following metrics are generated from hashing reads, per cellular barcode / hash barcode combination:

sequencing_name   hash_barcode    cell_barcode    count    n_umi
test            AGGTAGAGCT      F07_D09_LIG98_P01-C08      100      10
test            ACGTTGAATG      F07_D09_LIG98_P01-C08      200      15

These metrics are used to determine the hashing efficiency and to correct for UMI bias in downstream analysis:

  • count: Total number of hashing reads for that specific cell-barcode / hash-barcode combination.
  • n_umi: Number of unique UMIs for that specific cell-barcode / hash-barcode combination.

Haplotyping (optional; Mus musculus cross-experiments only)

As optional procedure, sci-rocket can be used to further haplotype the sex-chromosome X of the demultiplexed samples, e.g. in the case of mouse F1 cross-hybrids, see here for more information. This will download (or symlink) the MGP database and perform haplotype-specific read-counting using whatshap on F1-informative heterozygous SNPs.

Output

The major output files are the following:

  1. Sequence and sample-specific fastq file(s):
  2. {experiment_name}/demux_reads/{sample_name}_R1.fastq.gz
  3. {experiment_name}/demux_reads/{sample_name}_R2.fastq.gz
  4. {experiment_name}/demux_reads/{sample_name}_R1_discarded.fastq.gz
  5. {experiment_name}/demux_reads/{sample_name}_R2_discarded.fastq.gz
  6. {experiment_name}/demux_reads/log_{sample_name}_discarded_reads.tsv.gz
  7. Alignment files:
  8. {experiment_name}/alignment/{sample_name}_{species}_Aligned.sortedByCoord.out.bam/bai
  9. {experiment_name}/alignment/{sample_name}_{species}_Solo.out/
  10. Demultiplexing/alignment overview:
  11. {experiment_name}/sci-dash/
  12. Logging and benchmarking:
  13. {experiment_name}/logs/
  14. {experiment_name}/benchmarking/