CRUK-CI Demultiplexing Guide

The data files the CRUK-CI sequencing service provides after sequencing your samples are demultiplexed according to the barcodes provided in the submission spreadsheet. It sometimes happens that users of the service make a mistake when filling in this spreadsheet and put an incorrect index against one or more samples, or samples are omitted from the submission spreadsheet. This manifests as those samples' FASTQ files being much smaller than one might expect, or not present at all.

The good news is that the reads are still present in the data: they've just not been allocated to the sample. There is a file for each lane of sequencing called SLX-????.<flow cell id>.s_?.r_?.lostreads.fq.gz that contains all the reads that could not be allocated to any of the indexes listed in the submission spreadsheet.

The bad news is that correcting the submission information regarding the indexes attached to the samples is troublesome to the point of impossibility in our Clarity LIMS system. We cannot therefore fix your submission to rerun demultiplexing and attach those files so you can fetch them with the download tool or on the FTP site. Rerunning demultiplexing is also not an automatic process: a little time and effort will be needed to reprocess the FASTQ files.

That said, demultiplexing not a particularly difficult task. Running a demultiplexing program is simple: it is the fiddling with the configuration file for the demultiplexer that can be painful. This of course becomes more painful the greater the number of samples in the pool. This guide will take you through how to fix barcoding errors and demultiplex your sequencing data yourself.

Getting Started

The demultiplexing program the service uses to generate the demultiplexing report and to demultiplex FASTQ files1 is a tool written here at CRUK-CI. You can download the tool using the links below:

Redhat / CentOS binary (RHEL 7 or newer)
MAC OS X binary
Windows Cygwin binary (requires Cygwin 64 bit)
Source tar ball

If you build from source, the INSTALL file in the tar ball gives instructions on how to build the program.

Configuring the Demultiplexer

The demultiplexing program is driven by an index file that tells it which barcodes belong to which file. The format of the file is a tab-separated table. The section below gives examples.

Single Index Kits

AGTTCC  SLX-9214.A014.C6H1DANXX.s_3.r_1.fq.gz  Sample_01
AGTCAA  SLX-9214.A013.C6H1DANXX.s_3.r_1.fq.gz  Sample_02
CAGATC  SLX-9214.A007.C6H1DANXX.s_3.r_1.fq.gz  Sample_03
CTTGTA  SLX-9214.A012.C6H1DANXX.s_3.r_1.fq.gz  Sample_04
CGATGT  SLX-9214.A002.C6H1DANXX.s_3.r_1.fq.gz  Sample_05

The first column is the sequence of the barcode. The second column is the name of the file to write reads matching that barcode to. This should be a simple file name (not a path), and can be any name you wish. The output will be FASTQ and, optionally, gzip compressed. The third column is optional and is the sample name or some other user friendly identifier. It is used solely to make the summary report easier to read.

Dual Index Kits

GGACTCCT    CTCTCTAT    SLX-7655.N705_N502.C6A38ANXX.s_1.r_1.fq.gz    Sample_010
TCCTGAGC    CTCTCTAT    SLX-7655.N704_N502.C6A38ANXX.s_1.r_1.fq.gz    Sample_025
TAAGGCGA    TATCCTCT    SLX-7655.N701_N503.C6A38ANXX.s_1.r_1.fq.gz    Sample_101
TAGGCATG    CTCTCTAT    SLX-7655.N706_N502.C6A38ANXX.s_1.r_1.fq.gz    Sample_056
TAAGGCGA    CTCTCTAT    SLX-7655.N701_N502.C6A38ANXX.s_1.r_1.fq.gz    Sample_070

The first column is the sequence of the first barcode of the pair. The second column is the sequence of the second barcode of the pair. The third column is the name of the file to write reads matching the barcode pair to. This should be a simple file name (not a path), and can be any name you wish. The output will be FASTQ and, optionally, gzip compressed. The fourth column is optional and is the sample name or some other user friendly identifier. It is used solely to make the summary report easier to read.

Note that there are complications when demultiplexing FASTQ files with dual indexes on some sequencing platforms. Refer to the section Complications with Dual Index Barcoding for details.

All Kits

The demultiplexer must be run with the -e option on the command line if the additional identifier column is used (see below). The identifiers must not contain spaces.

Regardless of the type of barcoding kit, you will need to produce a configuration file per read for your data. If your data is single end, one file is sufficient. If it is paired end, you will need a second configuration file with different file names for read two. In our naming conventions (as above), this is the "r_?" part of the name. So we have r_1 for read one and r_2 for read two. The easiest way to produce the configuration files is to prepare for one read, check it is correct, make a copy and tweak the file names for the second read, keeping the same index sequences.

Running the Demultiplexer

The demultiplexer is run by calling demuxFQ on the command line with options and the source FASTQ files. The demultiplexer handles both gzip compressed and uncompressed FASTQ files transparently, and can accept multiple files.

The demultiplexer is run like any command line program:

./demuxFQ [<options>] <index file> <FASTQ file> [<FASTQ file>]*

The demultiplexer has several options. These are as follows.

-b [file]	Write reads that do not match any barcode declared in the index file to the given file. This is the option that produces the "lost reads" file of reads that didn't match a bar code in the initial demultiplexing. Without this option, reads that don't match are discarded. Note that when used with -o, this argument is relative to the directory given there.
-c	Compress the output FASTQ with gzip compression. If file names in the index file do not have the suffix ".gz", this is added unless the -n option is also present. Without this option, the FASTQ is written as plain text.
-d	Actually perform the demultiplexing. This causes the FASTQ files to be written. Without the option, the source FASTQ is analysed as if it were being demultiplexed but no output files are produced, giving just a report of what indexes were found in the source files.
-e	Add user friendly labels in the demultiplexing reports. This option tells the demultiplexer that the final column in the index file is a user friendly name associated with the barcode (typically a sample name). If the column is present in the index file, this option must be provided on the command line.
-i	The read headers are in (now) standard Illumina format. See the section Read Header Formats.
-l [number]	Explicitly tell the demultiplexer the length of the first index in a dual index kit. Sometimes more index cycles are created than the length of the barcode and this causes problems with dual index kits. Again, see the section Read Header Formats for further details.
-m	Generate MD5 checksums for the output files.
-n	Prevents the addition of the ".gz" suffix to output file names even when the -c option is used.
-o [directory]	Specify an output directory for the demultiplexed files. Without this option, the files are written to the current working directory.
-R	Use the reverse complement of the sequence given in the sample sheet for the second index of a dual index pair. This can be necessary depending on the sequencing technology used. See the section Complications with Dual Index Barcoding.
-r [number]	Report barcodes in the summary that appear with a frequency greater than this number. This is a fraction of the total number of reads. Barcodes that appear frequently that are not in the index file can indicate a mislabelling. The default is 0.001 (or 0.1%).
-S	Summarise the barcodes in the FASTQ files without a sample sheet. This gives a report that will tell you what indexes are found in the source files regardless of what indexes are expected (or indeed if you do not know the expected indexes).
-s [file]	Write a summary report to the given file (use "`-`" as the file name to write to standard out). The report gives details of the indexes found, their frequency and so forth. It is the report you can find in the sequencing section of this web site. See the section Report Information.
-t [number]	Allow up to the given number of mismatched nucleotides compared to the expected index (Hamming distance). Unless your group has explicitly asked us otherwise, the FASTQ you receive from the CRUK-CI sequencing service allows zero mismatches from the expected barcode sequence to include the read in that sample's FASTQ file.
-v	Print the version of the demultiplexer and exit.

For example:

./demuxFQ \
    -c -d -i -e -t 0 -r 0.01 \
    -o ../correctFastq \
    -b SLX-1234.lostreads.r_1.fq.gz \
    -s SLX-1234.demultiplexsummary.txt \
    SLX-1234.read1.index.txt \
    SLX-1234.*.r_1.fq.gz

This example will demultiplex all the FASTQ files matching the file pattern SLX-1234.*.r_1.fq.gz using the configuration file SLX-1234.read1.index.txt. The output files will be gzip compressed (-c); demultiplexed FASTQ files will be written (-d); the source FASTQ is in Illumina standard format (-i); the index file contains an additional column of user friendly labels (-e); reads can have zero mismatches between their index sequence and the expected index (-t); the demultiplexing report will report other indexes found with a frequency of 0.01 or more of the total reads (-r). The output files will be written into the directory ../correctFastq (-o) and any reads that cannot be allocated to a sample will go into the file ../correctFastq/SLX-1234.lostreads.r_1.fq.gz (-b, relative to -o). The report is written to SLX-1234.demultiplexsummary.txt (-s).

Read Header Formats

A read headers are not something that has always been standard. Illumina seem to have settled on a standard format:

@HISEQ:137:C6H1DANXX:3:1101:1163:1999 1:N:0:CCGTCC

And for dual index:

@M01712:44:000000000-KHKC7:1:2105:17404:1727 1:N:0:TATTAATATC+TCTGCTTTCT

This format appears in data from early to mid 2015 to the present day.

Demultiplexing Illumina Format FASTQ

For the purposes of demultiplexing, it is the last part of this header is the important part. In the Illumina format, the first part is the read number (1 or 2); the second is whether the read passed the purity filters (Y or N)2; the third is a number for Illumina control flags (ignore). The fourth part is the index sequence or sequences, and is the part of the header that is of interest here.

These are the bases found in the index cycles. If the sequencer was run with two index reads, the index sequences will be separated by a plus sign. This was not always the case, and if you have older data (prior to mid 2015) the sequences might be separated by a minus (older NextSeq runs) or have no separator at all, which provides an additional complication described below.

The demultiplexer will use the number of bases defined in the index file from the index sequences in the FASTQ file. You may have fewer bases in the index file than have been sequenced, but this is not a problem since the separator between the indexes tells demuxFQ where the second index read begins. So a single barcode of six bases would be matched against the first six bases in the read header, and a dual barcode of six bases each would match against the first six bases of the first index sequence and the first six bases of the second index sequence (i.e. after the separator). Any trailing bases in either index read are not used by the demultiplexer.

If your file matches this format then you need read no further in this section. Use the -i flag and your demultiplexing should work.

Demultiplexing Older Illumina Format FASTQ

For this section "older" format means the Illumina format where there was no separator between index reads. A check through the archive has found that the change in the Illumina format to include the plus sign came in during the first half of 2015, so if your files were produced mid 2015 or later, this does not apply.

In this format, which is seen in data produced October 2014 to mid 2015, the dual index example from above would look like:

@M01712:44:000000000-KHKC7:1:2105:17404:1727 1:N:0:TATTAATATCTCTGCTTTCT

If your demultiplexing index file specifies as many bases for index one as were sequenced for the index then there is no problem. The demultiplexer will use the first ten bases (in the example above) for index one and start index two at position eleven in the sequence.

The complication arises with dual indexes when the sequencing has read more bases in the index that the length of the barcode in the index file. We have seen this where one lane on a flowcell needs ten bases in the first index and eight in the second. Each lane may have a different kit, for example single index six bases, single index ten bases, and dual index eight plus eight bases. In a case like this, the demultiplexing program needs to be told how far into the index in the read header the second index starts (where relevant). This is where the -l option comes in.

Let's say you have two samples in a run. The first is a single index with six bases. The second is the sample with a dual index with eight and eight bases. The sequencer has been run with ten cycles for the first index (to accomodate someone else's sample on a different lane) and eight on the second. For the single index sample, there is no problem: the index starts at the first base in the header and uses the first six bases (same as the length in the index file). The trailing bases are ignored. However, for the dual index kit things are not as simple. Without explicit information to the contrary, the demultiplexer will assume that the dual index's second index will start at the ninth base in the header (the first eight being index one, the next eight being index two, and two unused cycles). In this case, you must run the demultiplexer with the flag -l 10 to tell it that the second index starts at base ten (here, numbering starts from zero, not one). It will then use the first eight bases as index one, skip two irrelevant bases for this sample, then use the next eight bases as index two. Note that if your data is from the NextSeq, you need to add 1 to the start position to take into account the delimiter between the sequences.

Demultiplexing Very Old Format FASTQ

In the original header format (before October 2014) we had a separator between the first and second index read, so the complications of the original Illumina format do not apply: the delimiter allows demuxFQ to work out where the second index starts.

We used to produce a header with a hash character to mark the start of the barcode, then the bases of the index sequence. If the barcoding kit was dual index, then the second index followed a second hash character (see below). The demuxFQ program looks for one or two hash characters based on whether the configuration file has one or two columns for the index sequences.

@HWI-D00491:108:5:1101:4662:1982#CGCTCATT#AGGCGAAG

If you have a FASTQ file in this format, make sure you do not use the -i option.

Complications with Dual Index Barcoding

The form of the second barcode sequence of a dual index pair may vary depending on the sequencing technology used to produce the FASTQ files. Files produced on the NovaSeq, HiSeq 4000 or NextSeq platforms have their second index read written as the reverse complement of the sequence one would expect from the kit manufacturers' documentation.

There are two ways of dealing with this. You may create your index files for the demultiplexer with the second barcode manually reverse complemented. A simpler approach is to prepare your index files in the normal manner and run the demultiplexer with the -R option. This will cause the demultiplexer to use the reverse complement of second barcode given in the index file, saving a tedious and error prone conversion by hand.

GAIIx, HiSeq 2000/2500 and MiSeq technologies do not require the second index to be the reverse complement. Do not use the -R option on files produced by those sequencers.

Report Information

With the -s option, the program will produce a summary of the contents of the FASTQ file, somewhat like the following:

  176602548 reads
  23511 distinct codes
   45548212   CTTGTAA
   50236885   GCCAATA
   51427561   GTGAAAA

which shows that there were about 177 million reads, with 23,511 distinct barcodes, but only three that occur more than 0.1% of the time (by default) frequency.

Or if a sample sheet is provided:

  176602548 reads
  18154293 10.27% lost
  1 = threshold for match
  3 = minimum distance between barcodes
  Expected:
  Index   Total   Balance     0       1
  GCCAAT  53878803    91.52%  50748618    28.73%  3130185 1.77%   sample1
  CTTGTA  48841472    82.96%  46308597    26.22%  2532875 1.43%   sample2
  GTGAAA  55727980    94.66%  52312119    29.62%  3415861 1.93%   sample3
  23511 distinct codes
   45548212   CTTGTAA
   50236885   GCCAATA
   51427561   GTGAAAA

This report provides:

the number of reads;
the number (and percentage) that did not match a barcode in the sample sheet;
the match threshold used (default 1) (i.e. number of errors allowed in a match);
the minimum distance among all pairs of barcodes in the sample;
the expected barcodes, and how many of each were seen;
the name of the sample associated with the barcode (if provided and run with -e);
the remaining output as if run without a sample sheet.

The numbers provided for each expected barcode are:

the number of reads matching this barcode;
the balance of reads, i.e. percentage of reads found versus expected, under the assumption that reads will be divided evenly among the expected barcodes;
the number of reads matching this barcode with zero errors;
the corresponding percentage (of total reads);
the number of reads matching this barcode with one error;
the corresponding percentage (of total reads).

The latter columns vary depending on the stringency. If the stringency is set to "K", then reads with 0, 1, ... K errors will be reported.

The balance column may require elaboration. Ideally, the reads will be divided evenly among the "K" barcodes in the sample sheet. So if there are "N" reads total, one would hope for around N ÷ K reads per barcode. The balance number is the ratio of actual reads matching this barcode to the expected number.

For example, if there are 176,602,548 reads total, and three barcodes in the sample sheet, we expect around 176,602,548 ÷ 3 = 58,867,516 reads to match each barcode. If 53,878,803 actually match a particular barcode, then the "balance" is 53878803 ÷ 58867516 = 91.5%. If 74,959,637 actually match, then the "balance" is 74959637 ÷ 58867516 = 127.3%. In a good run, the balances will all be near 100%.

If the -R option is used to reverse complement the second index of a dual index pair the sequence given in the report will also be the reverse complement. It will not be a direct match of the sequence given in the index file.

1. The initial demultiplexing is done by Illumina's bcl2fastq tool as part of its conversion from the internal Illumina files to FASTQ.

2. Unless we have been explicitly told otherwise, the CRUK-CI sequencing service will return data without reads that fail the purity filter.