Description
FASTQ format is a text-based format for storing both a biological sequence
(usually nucleotide sequence) and its corresponding quality scores.
Both the sequence letter and quality score are each encoded with a single
ASCII character for brevity.
The FASTQ format was originally developed at the Wellcome Trust Sanger
Institute to bundle a FASTA sequence and its quality data,
but has recently become the de facto standard for storing the output of
high-throughput sequencing instruments such as the Illumina Genome Analyzer.
FASTQ encodes each sequence with four lines:
- The first line consists of the symbol @ immediately followed by a
sequence identifier (similar to a regular FASTA header starting with the
> symbol).
The sequence identifier can optionally be followed by a description, but
this is ignored by MotifLab.
- The second line contains the DNA sequence sequence itself. Unlike regular
FASTA, the sequence cannot be split across multiple lines.
- Line three is a repetition of the header, but this time prefixed by a plus
sign instead of the @ symbol.
- The last line contains the quality scores for the sequence on line 2
and has exactly the same number of characters as that line. Each single character
in this line represents a numeric value depending on the selected encoding
scheme (see below).
Example of sequence data in FASTQ format:
@SRR057629.1 HWUSI-EAS230-R:2:1:2:1844 length=36
CAAAAAGTTGCAATCAAAGATCTCTTCATCTTATTG
+SRR057629.1 HWUSI-EAS230-R:2:1:2:1844 length=36
ababba`Y[_aaa^aaa_QaYaYa]]aa]`_T`XSW
@SRR057629.2 HWUSI-EAS230-R:2:1:2:1910 length=36
GGAGTCCCAGCTTAGGGAGTCACTACTGGAGGCAGA
+SRR057629.2 HWUSI-EAS230-R:2:1:2:1910 length=36
^bT_X_baaa[^Xbbabb`T_^Y[\DOR\^]VbaTT
@SRR057629.3 HWUSI-EAS230-R:2:1:2:1325 length=36
CAAATGAAGGCGAATTCAAGGCTGAAGGAAATAGCA
+SRR057629.3 HWUSI-EAS230-R:2:1:2:1325 length=36
ZY\HTU[[TLD]ZXRLQ\YTRQOJXZWFNa_SYRBB
Quality encoding
The quality score for each DNA base is encoded with a single ASCII character from the following range:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ |
The first character (!) in this set of 94 different characters has ASCII code 33 and the last (~) has ASCII code 126. This set of characters can thus be used to
represent values from 33 to 126 (or 0 to 93 if we offset the value range to include zero). Several different, non-compatible encoding schemes
exist to convert quality scores to ASCII codes and vice versa, and they all use a different subset of these characters and different offset values.
For example, the
Sanger encoding scheme uses the full range of ASCII characters above and an offset of 33 to represent quality scores
in the range 0 to 93. The
Solexa encoding scheme, on the other hand, only uses the last 68 characters and an offset of 64 to represent
quality scores in the range -5 to 62.
The following table summarizes the three quality encoding schemes that are recognized by MotifLab:
Name | ASCII characters | Quality score |
Range | Offset | Type | Range |
Sanger | 33-126 | 33 | PHRED | 0 to 93 |
Solexa | 59-126 | 64 | Solexa | -5 to 62 |
Illumina 1.3+ | 64-126 | 64 | PHRED | 0 to 62 |
Using the FASTQ format
When outputting data in FASTQ format, MotifLab can merge a DNA track and a
numeric track (containing quality scores) into a single output file. However,
when using the format to parse input, only one track can be created at a time.
Hence, to import a DNA sequence with associated quality scores from a FASTQ
file, you must first use the "Import Data" function to load a "DNA Sequence Dataset"
from the FASTQ file and then you must use the "Import Data" function once more
to load a separate "Numeric Dataset" from the same file.
Name | Description |
Quality encoding |
This parameter specifies the encoding scheme used for the quality
scores. Valid options are "Sanger", "Solexa" and "Illumina 1.3+" (these
are explained above).
|
Quality scores |
This output parameter specifies the numeric track containing the quality
scores for the DNA sequence. MotifLab will complain if some of the values in the
track are outside the allowed range for the selected encoding scheme.
|
Strand orientation |
Controls which strand to output for each sequence. Valid
options are "Direct" (output sequence from genomic direct strand),
"Reverse" (output sequence from genomic reverse strand) and "Relative"
(output sequence data relative to the orientation of the
sequence. I.e. use same strand as the strand the sequence originates from).
The quality values will be output in the same orientation as the DNA
sequence, so that each character on the fourth line of each sequence
encodes the quality of the DNA base exactly two lines above it.
|
Extra space |
If selected, an extra empty line will be added between each sequence
block (four lines) to separate the sequences visually.
Note that some external programs might
not be able to parse FASTQ files correctly if extra lines are added.
|
Convert uracil |
If this input parameter is selected, Uracils in the input sequence (U
and u) will be automatically converted into Thymins (T or t).
|