DataFormat

FASTA

Applies to:  DNA Sequence Dataset

Description

The output for a sequence in FASTA format consists of a header-line followed by one or more lines of sequence data. The header line is distinguished from the sequence data by a greater-than (">") symbol at the start of the line. The word following the ">" symbol is the identifier of the sequence, and this may be followed by additional descriptive text. The sequence data can be split across multiple lines for improved readability, and the sequences will be sorted in the output according to the current sort order.

Example of sequence data in FASTA format:
>ENSG00000035403
GTAGTCGCTGCACAGTCTGTCTCTTCGCCGGTTCCCGGCC
CCGTGGATCCTACTTCTCTGTCGCCCGCGGTTCGCCGCCC
>ENSG00000100345
GCAGATCACCGCGGTTCCTGGGCAGGGCACGGAAGGCTAA
GCAAGGCTGACCTGCTGCAGCTCCCGCCTCGTGCGCTCGC
>ENSG00000107796
AACACCACCCAGTGTGGAGCAGCCCAGCCAAGCACTGTCA
GGGTAAGTGGCGCCAGGCCAAGGATGTGACTTATAGATTC

The header can contain other information in addition to the name of the sequence if the fields are separated by vertical bars. The fields are in order: sequence name, sequence location, strand orientation and organism/genome build. MotifLab version 2.0 can also recognize a fifth field specifying the gene name and location (position of TSS and TES). All the extra fields are optional, but the order is important, so if you want to include information about the strand, you must also include the sequence location field preceeding it.

Example:
>ENSG00000035403|chr10:75425878-75428077|Direct strand|9606:hg18
>ENSG00000035403|chr10:75425878-75428077|Direct strand|9606:hg18|VCL:75427878-75549916

The sequence name must not contain spaces or characters other than letters, numbers or underscores. If the name contains spaces, only the first part of the name will be used. If the name contains other illegal characters, an error will be reported. The location must be given as "chromosome:start-end" (where the "chr" prefix for the chromosome is optional). For the orientation, strings starting with "direct", "+" or "1" are interpreted as the direct strand whereas strings starting with "reverse" or "–" are interpreted as the reverse strand (other strings will just default to direct strand). The "organism/genome build" field should be specified as two values separated by a colon, where the first value is an integer taxonomy identifier (or known organism name) and the second value is the genome build. Optionally, the genome build can be stated alone and the system will then try to infer the organism. The fifth "gene location" field introduced in MotifLab v2.0 is on the form "gene name:TSS-TES".

Arguments

NameDescription
Strand orientation This parameter controls which strand to output for each sequence. Valid options are "Direct" (output sequence from genomic direct strand), "Reverse" (output sequence from genomic reverse strand) and "Relative" (output sequence data relative to the orientation of the sequence. I.e. use same strand as the strand the sequence originates from).
Header Specifies what information to include in the header (after the > sign). The default is to output only the name of the sequence, but additional fields (separated by vertical bars) can also be output, such as the genomic location of the sequence, the strand orientation of the sequence and the genomic build of the sequence.
Column width The number of sequence bases to output on each line. If the length of the sequence is longer than the specified column width, the sequence data will be split across multiple lines. A common value is 80, but the special value of 0 can be used to specify that the whole sequence should be output on one single line.
Extra space If selected, an extra empty line will be added after the sequence data for each sequence (and before the header of the next sequence) to separate the sequences visually. Note that some external programs might not be able to parse FASTA files correctly if extra lines are added.

See Also: output, DNA Sequence Dataset