The data files the CRUK-CI sequencing service provides after sequencing your samples are demultiplexed according to the barcodes provided in the submission spreadsheet. It sometimes happens that users of the service make a mistake when filling in this spreadsheet and put an incorrect index against one or more samples, or samples are omitted from the submission spreadsheet. This manifests as those samples' FASTQ files being much smaller than one might expect, or not present at all.
The good news is that the reads are still present in the data: they've just not been allocated to the sample. There is a file for each lane of sequencing called SLX-????.<flow cell id>.s_?.r_?.lostreads.fq.gz that contains all the reads that could not be allocated to any of the indexes listed in the submission spreadsheet.
The bad news is that correcting the submission information regarding the indexes attached to the samples is troublesome to the point of impossibility in our Clarity LIMS system. We cannot therefore fix your submission to rerun demultiplexing and attach those files so you can fetch them with the download tool or on the FTP site. Rerunning demultiplexing is also not an automatic process: a little time and effort will be needed to reprocess the FASTQ files.
That said, demultiplexing not a particularly difficult task. Running a demultiplexing program is simple: it is the fiddling with the configuration file for the demultiplexer that can be painful. This of course becomes more painful the greater the number of samples in the pool. This guide will take you through how to fix barcoding errors and demultiplex your sequencing data yourself.
The demultiplexing program the service uses to generate the demultiplexing report and to demultiplex FASTQ files1 is a tool written here at CRUK-CI. You can download the tool using the links below:
If you build from source, the INSTALL file in the tar ball gives instructions on how to build the program.
The demultiplexing program is driven by an index file that tells it which barcodes belong to which file. The format of the file is a tab-separated table. The section below gives examples.
Single Index Kits
AGTTCC SLX-9214.A014.C6H1DANXX.s_3.r_1.fq.gz Sample_01 AGTCAA SLX-9214.A013.C6H1DANXX.s_3.r_1.fq.gz Sample_02 CAGATC SLX-9214.A007.C6H1DANXX.s_3.r_1.fq.gz Sample_03 CTTGTA SLX-9214.A012.C6H1DANXX.s_3.r_1.fq.gz Sample_04 CGATGT SLX-9214.A002.C6H1DANXX.s_3.r_1.fq.gz Sample_05
The first column is the sequence of the barcode. The second column is the name of the file to write reads matching that barcode to. This should be a simple file name (not a path), and can be any name you wish. The output will be FASTQ and, optionally, gzip compressed. The third column is optional and is the sample name or some other user friendly identifier. It is used solely to make the summary report easier to read.
Dual Index Kits
GGACTCCT CTCTCTAT SLX-7655.N705_N502.C6A38ANXX.s_1.r_1.fq.gz Sample_010 TCCTGAGC CTCTCTAT SLX-7655.N704_N502.C6A38ANXX.s_1.r_1.fq.gz Sample_025 TAAGGCGA TATCCTCT SLX-7655.N701_N503.C6A38ANXX.s_1.r_1.fq.gz Sample_101 TAGGCATG CTCTCTAT SLX-7655.N706_N502.C6A38ANXX.s_1.r_1.fq.gz Sample_056 TAAGGCGA CTCTCTAT SLX-7655.N701_N502.C6A38ANXX.s_1.r_1.fq.gz Sample_070
The first column is the sequence of the first barcode of the pair. The second column is the sequence of the second barcode of the pair. The third column is the name of the file to write reads matching the barcode pair to. This should be a simple file name (not a path), and can be any name you wish. The output will be FASTQ and, optionally, gzip compressed. The fourth column is optional and is the sample name or some other user friendly identifier. It is used solely to make the summary report easier to read.
Note that there are complications when demultiplexing FASTQ files with dual indexes on some sequencing platforms. Refer to the section Complications with Dual Index Barcoding for details.
All Kits
The demultiplexer must be run with the -e option on the command line if the additional identifier column is used (see below). The identifiers must not contain spaces.
Regardless of the type of barcoding kit, you will need to produce a configuration file per read for your data. If your data is single end, one file is sufficient. If it is paired end, you will need a second configuration file with different file names for read two. In our naming conventions (as above), this is the "r_?" part of the name. So we have r_1 for read one and r_2 for read two. The easiest way to produce the configuration files is to prepare for one read, check it is correct, make a copy and tweak the file names for the second read, keeping the same index sequences.
The demultiplexer is run by calling demuxFQ on the command line with options and the source FASTQ files. The demultiplexer handles both gzip compressed and uncompressed FASTQ files transparently, and can accept multiple files.
The demultiplexer is run like any command line program:
./demuxFQ [<options>] <index file> <FASTQ file> [<FASTQ file>]*
The demultiplexer has several options. These are as follows.
-b [file] | Write reads that do not match any barcode declared in the index file to the given file. This is the option that produces the "lost reads" file of reads that didn't match a bar code in the initial demultiplexing. Without this option, reads that don't match are discarded. Note that when used with -o, this argument is relative to the directory given there. |
-c | Compress the output FASTQ with gzip compression. If file names in the index file do not have the suffix ".gz", this is added unless the -n option is also present. Without this option, the FASTQ is written as plain text. |
-d | Actually perform the demultiplexing. This causes the FASTQ files to be written. Without the option, the source FASTQ is analysed as if it were being demultiplexed but no output files are produced, giving just a report of what indexes were found in the source files. |
-e | Add user friendly labels in the demultiplexing reports. This option tells the demultiplexer that the final column in the index file is a user friendly name associated with the barcode (typically a sample name). If the column is present in the index file, this option must be provided on the command line. |
-i | The read headers are in (now) standard Illumina format. See the section Read Header Formats. |
-l [number] | Explicitly tell the demultiplexer the length of the first index in a dual index kit. Sometimes more index cycles are created than the length of the barcode and this causes problems with dual index kits. Again, see the section Read Header Formats for further details. |
-m | Generate MD5 checksums for the output files. |
-n | Prevents the addition of the ".gz" suffix to output file names even when the -c option is used. |
-o [directory] | Specify an output directory for the demultiplexed files. Without this option, the files are written to the current working directory. |
-R | Use the reverse complement of the sequence given in the sample sheet for the second index of a dual index pair. This can be necessary depending on the sequencing technology used. See the section Complications with Dual Index Barcoding. |
-r [number] | Report barcodes in the summary that appear with a frequency greater than this number. This is a fraction of the total number of reads. Barcodes that appear frequently that are not in the index file can indicate a mislabelling. The default is 0.001 (or 0.1%). |
-S | Summarise the barcodes in the FASTQ files without a sample sheet. This gives a report that will tell you what indexes are found in the source files regardless of what indexes are expected (or indeed if you do not know the expected indexes). |
-s [file] |
Write a summary report to the given file (use "- " as
the file name to write to standard out).
The report gives details of the indexes found, their frequency and
so forth. It is the report you can find in the
|
-t [number] | Allow up to the given number of mismatched nucleotides compared to the expected index (Hamming distance). Unless your group has explicitly asked us otherwise, the FASTQ you receive from the CRUK-CI sequencing service allows zero mismatches from the expected barcode sequence to include the read in that sample's FASTQ file. |
-v | Print the version of the demultiplexer and exit. |
For example:
./demuxFQ \ -c -d -i -e -t 0 -r 0.01 \ -o ../correctFastq \ -b SLX-1234.lostreads.r_1.fq.gz \ -s SLX-1234.demultiplexsummary.txt \ SLX-1234.read1.index.txt \ SLX-1234.*.r_1.fq.gz
This example will demultiplex all the FASTQ files matching the file
pattern SLX-1234.*.r_1.fq.gz using the configuration
file SLX-1234.read1.index.txt.
The output files will be gzip compressed (-c
); demultiplexed FASTQ
files will be written (-d
); the source FASTQ is in Illumina
standard format (-i
); the index file contains an additional column
of user friendly labels (-e
); reads can have zero mismatches
between their index sequence and the expected index (-t
); the
demultiplexing report will report other indexes found with a frequency of 0.01
or more of the total reads (-r
). The output files
will be written into the directory ../correctFastq
(-o
) and any reads that cannot be allocated to a sample will go
into the file
../correctFastq/SLX-1234.lostreads.r_1.fq.gz
(-b
, relative to -o
). The report is written to
SLX-1234.demultiplexsummary.txt (-s
).
A read headers are not something that has always been standard. Illumina seem to have settled on a standard format:
@HISEQ:137:C6H1DANXX:3:1101:1163:1999 1:N:0:CCGTCC
And for dual index:
@M01712:44:000000000-KHKC7:1:2105:17404:1727 1:N:0:TATTAATATC+TCTGCTTTCT
This format appears in data from early to mid 2015 to the present day.
Demultiplexing Illumina Format FASTQ
For the purposes of demultiplexing, it is the last part of this header is the important part. In the Illumina format, the first part is the read number (1 or 2); the second is whether the read passed the purity filters (Y or N)2; the third is a number for Illumina control flags (ignore). The fourth part is the index sequence or sequences, and is the part of the header that is of interest here.
These are the bases found in the index cycles. If the sequencer was run with two index reads, the index sequences will be separated by a plus sign. This was not always the case, and if you have older data (prior to mid 2015) the sequences might be separated by a minus (older NextSeq runs) or have no separator at all, which provides an additional complication described below.
The demultiplexer will use the number of bases defined in the index file
from the index sequences in the FASTQ file. You may have fewer bases in the
index file than have been sequenced, but this is not a problem since the
separator between the indexes tells demuxFQ
where the second
index read begins. So a single
barcode of six bases would be matched against the first six bases in the read
header, and a dual barcode of six bases each would match against the first
six bases of the first index sequence and the first six bases of the second
index sequence (i.e. after the separator). Any trailing bases in either index
read are not used by the demultiplexer.
If your file matches this format then you need read no further in this section. Use the -i flag and your demultiplexing should work.
Demultiplexing Older Illumina Format FASTQ
For this section "older" format means the Illumina format where there was no separator between index reads. A check through the archive has found that the change in the Illumina format to include the plus sign came in during the first half of 2015, so if your files were produced mid 2015 or later, this does not apply.
In this format, which is seen in data produced October 2014 to mid 2015, the dual index example from above would look like:
@M01712:44:000000000-KHKC7:1:2105:17404:1727 1:N:0:TATTAATATCTCTGCTTTCT
If your demultiplexing index file specifies as many bases for index one as were sequenced for the index then there is no problem. The demultiplexer will use the first ten bases (in the example above) for index one and start index two at position eleven in the sequence.
The complication arises with dual indexes when the sequencing has read more bases in the index that the length of the barcode in the index file. We have seen this where one lane on a flowcell needs ten bases in the first index and eight in the second. Each lane may have a different kit, for example single index six bases, single index ten bases, and dual index eight plus eight bases. In a case like this, the demultiplexing program needs to be told how far into the index in the read header the second index starts (where relevant). This is where the -l option comes in.
Let's say you have two samples in a run. The first is a single index with six bases. The second is the sample with a dual index with eight and eight bases. The sequencer has been run with ten cycles for the first index (to accomodate someone else's sample on a different lane) and eight on the second. For the single index sample, there is no problem: the index starts at the first base in the header and uses the first six bases (same as the length in the index file). The trailing bases are ignored. However, for the dual index kit things are not as simple. Without explicit information to the contrary, the demultiplexer will assume that the dual index's second index will start at the ninth base in the header (the first eight being index one, the next eight being index two, and two unused cycles). In this case, you must run the demultiplexer with the flag -l 10 to tell it that the second index starts at base ten (here, numbering starts from zero, not one). It will then use the first eight bases as index one, skip two irrelevant bases for this sample, then use the next eight bases as index two. Note that if your data is from the NextSeq, you need to add 1 to the start position to take into account the delimiter between the sequences.
Demultiplexing Very Old Format FASTQ
In the original header format (before October 2014) we had a separator
between the first and second index read, so the complications of the original
Illumina format do not apply: the delimiter allows demuxFQ
to work
out where the second index starts.
We used to produce a header with a hash character to mark
the start of the barcode, then the bases of the index sequence. If the
barcoding kit was dual index, then the second index followed a second
hash character (see below). The demuxFQ
program looks for one or two hash characters based on whether the
configuration file has one or two columns for the index sequences.
@HWI-D00491:108:5:1101:4662:1982#CGCTCATT#AGGCGAAG
If you have a FASTQ file in this format, make sure you do not use the -i option.
The form of the second barcode sequence of a dual index pair may vary depending on the sequencing technology used to produce the FASTQ files. Files produced on the NovaSeq, HiSeq 4000 or NextSeq platforms have their second index read written as the reverse complement of the sequence one would expect from the kit manufacturers' documentation.
There are two ways of dealing with this. You may create your index files for the demultiplexer with the second barcode manually reverse complemented. A simpler approach is to prepare your index files in the normal manner and run the demultiplexer with the -R option. This will cause the demultiplexer to use the reverse complement of second barcode given in the index file, saving a tedious and error prone conversion by hand.
GAIIx, HiSeq 2000/2500 and MiSeq technologies do not require the second index to be the reverse complement. Do not use the -R option on files produced by those sequencers.
With the -s option, the program will produce a summary of the contents of the FASTQ file, somewhat like the following:
176602548 reads 23511 distinct codes 45548212 CTTGTAA 50236885 GCCAATA 51427561 GTGAAAA
which shows that there were about 177 million reads, with 23,511 distinct barcodes, but only three that occur more than 0.1% of the time (by default) frequency.
Or if a sample sheet is provided:
176602548 reads 18154293 10.27% lost 1 = threshold for match 3 = minimum distance between barcodes Expected: Index Total Balance 0 1 GCCAAT 53878803 91.52% 50748618 28.73% 3130185 1.77% sample1 CTTGTA 48841472 82.96% 46308597 26.22% 2532875 1.43% sample2 GTGAAA 55727980 94.66% 52312119 29.62% 3415861 1.93% sample3 23511 distinct codes 45548212 CTTGTAA 50236885 GCCAATA 51427561 GTGAAAA
This report provides:
The numbers provided for each expected barcode are:
The latter columns vary depending on the stringency. If the stringency is set to "K", then reads with 0, 1, ... K errors will be reported.
The balance column may require elaboration. Ideally, the reads will be divided evenly among the "K" barcodes in the sample sheet. So if there are "N" reads total, one would hope for around N ÷ K reads per barcode. The balance number is the ratio of actual reads matching this barcode to the expected number.
For example, if there are 176,602,548 reads total, and three barcodes in the sample sheet, we expect around 176,602,548 ÷ 3 = 58,867,516 reads to match each barcode. If 53,878,803 actually match a particular barcode, then the "balance" is 53878803 ÷ 58867516 = 91.5%. If 74,959,637 actually match, then the "balance" is 74959637 ÷ 58867516 = 127.3%. In a good run, the balances will all be near 100%.
If the -R option is used to reverse complement the second index of a dual index pair the sequence given in the report will also be the reverse complement. It will not be a direct match of the sequence given in the index file.
1. The initial demultiplexing is done by Illumina's bcl2fastq tool as part of its conversion from the internal Illumina files to FASTQ.
2. Unless we have been explicitly told otherwise, the CRUK-CI sequencing service will return data without reads that fail the purity filter.