DataFormat

FASTQ

Applies to:  DNA Sequence Dataset and Numeric Dataset

Description

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.
The FASTQ format was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has recently become the de facto standard for storing the output of high-throughput sequencing instruments such as the Illumina Genome Analyzer.

FASTQ encodes each sequence with four lines:
  1. The first line consists of the symbol @ immediately followed by a sequence identifier (similar to a regular FASTA header starting with the > symbol). The sequence identifier can optionally be followed by a description, but this is ignored by MotifLab.
  2. The second line contains the DNA sequence sequence itself. Unlike regular FASTA, the sequence cannot be split across multiple lines.
  3. Line three is a repetition of the header, but this time prefixed by a plus sign instead of the @ symbol.
  4. The last line contains the quality scores for the sequence on line 2 and has exactly the same number of characters as that line. Each single character in this line represents a numeric value depending on the selected encoding scheme (see below).

Example of sequence data in FASTQ format:
@SRR057629.1 HWUSI-EAS230-R:2:1:2:1844 length=36
CAAAAAGTTGCAATCAAAGATCTCTTCATCTTATTG
+SRR057629.1 HWUSI-EAS230-R:2:1:2:1844 length=36
ababba`Y[_aaa^aaa_QaYaYa]]aa]`_T`XSW

@SRR057629.2 HWUSI-EAS230-R:2:1:2:1910 length=36
GGAGTCCCAGCTTAGGGAGTCACTACTGGAGGCAGA
+SRR057629.2 HWUSI-EAS230-R:2:1:2:1910 length=36
^bT_X_baaa[^Xbbabb`T_^Y[\DOR\^]VbaTT

@SRR057629.3 HWUSI-EAS230-R:2:1:2:1325 length=36
CAAATGAAGGCGAATTCAAGGCTGAAGGAAATAGCA
+SRR057629.3 HWUSI-EAS230-R:2:1:2:1325 length=36
ZY\HTU[[TLD]ZXRLQ\YTRQOJXZWFNa_SYRBB


Quality encoding
The quality score for each DNA base is encoded with a single ASCII character from the following range:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

The first character (!) in this set of 94 different characters has ASCII code 33 and the last (~) has ASCII code 126. This set of characters can thus be used to represent values from 33 to 126 (or 0 to 93 if we offset the value range to include zero). Several different, non-compatible encoding schemes exist to convert quality scores to ASCII codes and vice versa, and they all use a different subset of these characters and different offset values. For example, the Sanger encoding scheme uses the full range of ASCII characters above and an offset of 33 to represent quality scores in the range 0 to 93. The Solexa encoding scheme, on the other hand, only uses the last 68 characters and an offset of 64 to represent quality scores in the range -5 to 62.

The following table summarizes the three quality encoding schemes that are recognized by MotifLab:
NameASCII charactersQuality score
RangeOffsetTypeRange
Sanger33-12633PHRED0 to 93
Solexa59-12664Solexa-5 to 62
Illumina 1.3+64-12664PHRED0 to 62


Using the FASTQ format
When outputting data in FASTQ format, MotifLab can merge a DNA track and a numeric track (containing quality scores) into a single output file. However, when using the format to parse input, only one track can be created at a time. Hence, to import a DNA sequence with associated quality scores from a FASTQ file, you must first use the "Import Data" function to load a "DNA Sequence Dataset" from the FASTQ file and then you must use the "Import Data" function once more to load a separate "Numeric Dataset" from the same file.

Arguments

NameDescription
Quality encoding This parameter specifies the encoding scheme used for the quality scores. Valid options are "Sanger", "Solexa" and "Illumina 1.3+" (these are explained above).
Quality scores This output parameter specifies the numeric track containing the quality scores for the DNA sequence. MotifLab will complain if some of the values in the track are outside the allowed range for the selected encoding scheme.
Strand orientation Controls which strand to output for each sequence. Valid options are "Direct" (output sequence from genomic direct strand), "Reverse" (output sequence from genomic reverse strand) and "Relative" (output sequence data relative to the orientation of the sequence. I.e. use same strand as the strand the sequence originates from). The quality values will be output in the same orientation as the DNA sequence, so that each character on the fourth line of each sequence encodes the quality of the DNA base exactly two lines above it.
Extra space If selected, an extra empty line will be added between each sequence block (four lines) to separate the sequences visually. Note that some external programs might not be able to parse FASTQ files correctly if extra lines are added.
Convert uracil If this input parameter is selected, Uracils in the input sequence (U and u) will be automatically converted into Thymins (T or t).

See Also: output, FASTA, DNA Sequence Dataset, Numeric Dataset