Description
The output for a sequence in FASTA format consists of a header-line
followed by one or more lines of sequence data. The header line is distinguished from the sequence data by a greater-than
(">") symbol at the start of the line. The word following the ">" symbol is the identifier of the sequence, and
this may be followed by additional descriptive text. The sequence data can
be split across multiple lines for improved readability, and the sequences
will be sorted in the output according to the current sort order.
Example of sequence data in FASTA format:
>ENSG00000035403
GTAGTCGCTGCACAGTCTGTCTCTTCGCCGGTTCCCGGCC
CCGTGGATCCTACTTCTCTGTCGCCCGCGGTTCGCCGCCC
>ENSG00000100345
GCAGATCACCGCGGTTCCTGGGCAGGGCACGGAAGGCTAA
GCAAGGCTGACCTGCTGCAGCTCCCGCCTCGTGCGCTCGC
>ENSG00000107796
AACACCACCCAGTGTGGAGCAGCCCAGCCAAGCACTGTCA
GGGTAAGTGGCGCCAGGCCAAGGATGTGACTTATAGATTC
The header can contain other information in addition to the name of the
sequence if the fields are separated by vertical bars.
The fields are in order: sequence name, sequence location, strand orientation
and organism/genome build. MotifLab version 2.0 can also recognize a fifth field
specifying the gene name and location (position of TSS and TES). All the extra fields are optional, but the order
is important, so if you want to include information about the strand, you must
also include the sequence location field preceeding it.
Example:
>ENSG00000035403|chr10:75425878-75428077|Direct strand|9606:hg18
>ENSG00000035403|chr10:75425878-75428077|Direct strand|9606:hg18|VCL:75427878-75549916
The sequence name must not contain spaces or characters
other than letters, numbers or underscores. If the name contains spaces, only
the first part of the name will be used. If the name contains other illegal
characters, an error will be reported.
The location must be given as "
chromosome:start-end" (where
the "chr" prefix for the chromosome is optional). For the orientation, strings
starting with "direct", "+" or "1" are interpreted as the direct strand
whereas strings starting with "reverse" or "–" are interpreted as the reverse
strand (other strings will just default to direct strand).
The "organism/genome build" field should be specified as two values
separated by a colon, where the first value is an integer taxonomy identifier
(or known organism name) and the second value is the genome
build. Optionally, the genome build can be stated alone and the system will
then try to infer the organism. The fifth "gene location" field introduced in
MotifLab v2.0 is on the form "
gene name:TSS-TES".
Name | Description |
Strand orientation |
This parameter controls which strand to output for each sequence. Valid
options are "Direct" (output sequence from genomic direct strand),
"Reverse" (output sequence from genomic reverse strand) and "Relative"
(output sequence data relative to the orientation of the
sequence. I.e. use same strand as the strand the sequence originates from).
|
Header |
Specifies what information to include in the header (after the >
sign). The default is to output only the name of the sequence, but
additional fields (separated by vertical bars) can also be output, such as
the genomic location of the sequence, the strand orientation of the
sequence and the genomic build of the sequence.
|
Column width |
The number of sequence bases to output on each line. If the length of
the sequence is longer than the specified column width, the sequence
data will be split across multiple lines. A common value is 80, but the
special value of 0 can be used to specify that the whole sequence should
be output on one single line.
|
Extra space |
If selected, an extra empty line will be added after the sequence data
for each sequence (and before the header of the next sequence) to
separate the sequences visually. Note that some external programs might
not be able to parse FASTA files correctly if extra lines are added.
|