Overview
========

This package contains "demuxFQ", a program for demultiplexing Fastq
files generated by Illumina's sequencers (or any other fastq in a sufficiently
similar format).  The user provides a sample sheet and a Fastq file, then
the program generates the demultiplexed samples.

Sample Sheets
=============

"demuxFQ" reads sample sheets that describe the expected barcodes in the
Fastq file.  The format of a line in the sample sheet is::

  XXXXXX filename

or if the file has dual barcodes::

  XXXXXX YYYYYY filename

where "XXXXXX" and "YYYYYY" are barcode sequences, and "filename" is the
name of the file to which to write sequences matching this barcode (pair).
The white space separating the tokens may be spaces or tabs.

If the "-e" option ("extended" metadata) is added, then the sample sheet
contains an additional field:

  XXXXXX filename meta_data_here

which is arbitrary metadata that will be added to the summarization report
for documentation purposes.  Note that the metadata field must be separated
from a the filename by a TAB, not other white space.

Blank lines, and lines starting with "#" are ignored.

An example of a sample sheet with single indices::

  GCCAATA jc1174_D1837ACXX_6_1.fq
  CTTGTAA jc1175_D1837ACXX_6_1.fq
  GTGAAAC jc1176_D1837ACXX_6_1.fq
  ACAGTGA jc1173_D1837ACXX_6_1.fq

An example with dual indices::

  ATTACTCG TAATCTTA SLX-5906.S12.r_1.fq
  ATTACTCG CAGGACGT SLX-5906.S13.r_1.fq
  ATTACTCG GTACTGAC SLX-5906.S14.r_1.fq
  TCCGGAGA TATAGCCT SLX-5906.S03.r_1.fq
  TCCGGAGA ATAGAGGC SLX-5906.S04.r_1.fq
  TCCGGAGA CCTATCCT SLX-5906.S15.r_1.fq

The expected format of the barcode in the Fastq header is described by this
regular expression::

  ^[^#]+#[ACGT]+(#[ACGT]+)?(\/\d+)?$

Examples include::

  @HWI-ST230:965:1:1101:13957:19936#CGTACGTA#TAATCTTA/1
  @HWI-ST230:965:1:1101:13957:19936#CGTACGTA/1
  @HWI-ST230:965:1:1101:13957:19936#CGTACGTA#TAATCTTA
  @HWI-ST230:965:1:1101:13957:19936#CGTACGTA

Note that the barcodes in the sample sheet may be (and in general will be)
shorter than the barcodes in the Fastq file; only the sequence up to the length
of the barcode in the sample sheet will be considered relevant in the Fastq
file.

Command Line
============

This program demultiplexes a Fastq file, based on information in a sample
sheet,and/or generates a summary of the contents of the file.  The command line
syntax is::

  demuxFQ [options] [<sampleSheet>] <fastq>

If just demultiplexing is requred, use the "-d" option (and the sample sheet
is mandatory in this case); if just a summary is required, use "-s"; if both
are required, then use both options (and the sample sheet is required).

WARNING: This program will blindly overwrite existing files, if a file name
in the sample sheet is the name of an existing file.

The Fastq file may be uncompressed, or gzipped.  If gzipped, it will be
uncompressed on the fly (i.e. without writing the uncompressed file to disk).

There are several options:

* -b <fname> -- Save reads with non-matching barcodes to file <fname>.
* -c -- Compress output files with gzip, appending a ".gz" suffix if it is
        not present in the filenames already.
* -d -- Generate demultiplexed output.
* -e -- metadata appended as 3rd or 4th field in sample sheet.
* -n -- If -c is given, do not append a ".gz" suffix.
* -o <dirname> -- Write output files to directory <dirname>.
* -r <nn> -- Report reads with frequency at least <nn (default 0.001).
* -R -- reverse-complement the second index if there are 2, to allow for
        HiSeq 4000 and NextSeq reversing the second index read.
* -s -- Write summary to standard output.
* -t <N> -- Allow up to N mismatches per index.

Summaries
=========

With the "-s" option, the program will produce a summary of the contents of
the Fastq file, somewhat like the following:

  176602548 reads
  23511 distinct codes
   45548212   CTTGTAA
   50236885   GCCAATA
   51427561   GTGAAAA

which shows that there were about 177 million reads, with 23,511 distinct
barcodes, but only 3 that occur more than 1% of the time (by default) frequency.

Or if a sample sheet is provided::

  176602548 reads
  18154293 10.27% lost
  1 = threshold for match
  3 = minimum distance between barcodes
  Expected:
  Index   Total   Balance     0       1
  GCCAAT  53878803    91.52%  50748618    28.73%  3130185 1.77%
  CTTGTA  48841472    82.96%  46308597    26.22%  2532875 1.43%
  GTGAAA  55727980    94.66%  52312119    29.62%  3415861 1.93%
  23511 distinct codes
   45548212   CTTGTAA
   50236885   GCCAATA
   51427561   GTGAAAA

This report provides:

* the number of reads;
* the number (and percentage) that did not match a barcode in the sample sheet;
* the match threshold used (default 1) (i.e. number of errors allowed in
  a match);
* the minimum distance among all pairs of barcodes in the sample;
* the expected barcodes, and how many of each were seen;
* the remaining output as if run without a sample sheet.

The numbers provided for each expected barcode are:

* the number of reads matching this barcode;
* the balance of reads, i.e. percentage of reads found versus expected,
  under the assumption that reads will be divided evenly among the
  expected barcodes;
* the number of reads matching this barcode with zero errors;
* the corresponding percentage (of total reads);
* the number of reads matching this barcode with one error;
* the corresponding percentage (of total reads).

The latter columns vary depending on the stringency.  If the stringency is set
to "K", then reads with 0, 1, ... "K" errors will be reported.

The balance column may require elaboration.  Ideally, the reads will be divided
evenly among the "K" barcodes in the sample sheet.  So if there are "N" reads
total, one would hope for around "N/K" reads per barcode.  The balance number
is the ratio of actual reads matching this barcode to the expected number.
For example, if there are 176602548 reads total, and 3 barcodes in the sample
sheet, we expect around 176602548/3 = 58867516 reads to match each barcode.  If
53878803 actually match a particular barcode, then the "balance"
is 53878803/58867516 = 91.5%.
If 74959637 actually match, then the "balance" is 74959637/58867516 = 127.3%.
In a good run, the balances will all be near 100%.

