Description
The General Feature Format (GFF) is one of the most popular formats for
exchanging information about
region
based features.
The official
GFF specification can be found here,
but briefly described the format outputs one region per line and each line
consists of 8 (or optionally 9) fields separated by TAB.
The fields are in order:
- The name of the sequence
- The source of the feature
- The feature type
- The start coordinate of the region
- The end coordinate of the region
- A score value for the region
- The orientation of the region. This can be "+" or "-" (or "." if orientation is
unspecified)
- The reading frame. The value of this field is either 0, 1 or 2 (or
"." if the frame does not apply)
- Additional attributes. This optional field consists of a list of
attributes separated by semicolon. Each attribute has a key (or "tag") followed by
value for the attribute (separated by an equals sign).
NOTE:
When importing regions from a GFF-file, the sequence name in the
first column must correspond to the name of an existing sequence in MotifLab,
and the region will then be added to that sequence. If the first column
contains a chromosome name, it will only be added to a sequence if there is
a sequence that is actually named after the chromosome; it is not enough that the
sequence covers the chromosomal segment that the region from the GFF-file
falls within. When the first column contains chromosome names, it is suggested
instead to use the GTF format (or convert the
file to BED format).
Sequences output in GFF format are output according to the currently selected
sorting order of the sequences,
but within each sequence the user can specify whether to sort the regions by
position, score or type.
The
start and
end positions of each region (fields 4 and 5) can
be output as either genomic coordinates
or as positions relative to the start of the sequence by setting the
"Position" option to either "Genomic" or "Relative".
If the "Relative" setting is chosen, the "Relative-offset" and "Orientation"
settings will also apply. The "Relative-offset"
setting specifies the coordinate of the first position in the sequence.
This will normally be 1 but can be set to other values if needed (for instance
0). The "Orientation" setting specifies which
orientation to use to determine the relative region coordinates. For example,
if a 100 bp long sequence on the direct strand
has a binding site region from position 80 to 90, the start and end
coordinates will be [80,90] if the "Direct" strand orientation
is selected or [10,20] if the "Reverse" orientation is selected. If the
"Orientation" is set to "From Sequence" the strand orientation
will be selected based on the orientation of the sequence itself, so that
sequences on the direct strand will be output in direct orientation
and those on the reverse strand will be output in reverse orientation. If the
"Opposite" strand orientation is selected, the orientation
will be the opposite of the orientation of the sequence.
If the standard GFF format is not adequate, the "Format" setting can be used
to specify an alternative output format.
The alternative format is specified by a string consisting of a mix of literal
characters and special field codes surrounded by braces
(e.g. {START} ). For each region, the field codes in the format string (if
recognized) will be replaced by the corresponding value of the field as it
applies to the target region before the string is output.
Some recognized fields are: SEQUENCENAME, FEATURE, SOURCE, START, END, SCORE,
STRAND and TYPE (note the capitalization).
TABs can be represented with the escape character
\t.
For example, the following output format:
Binding site for
{TYPE} at {START}-{END} with score={SCORE} in sequence
{SEQUENCENAME}
will produce output that looks like this
Binding site for M00378 at 483-494 with score=5.963 in sequence ENSG00000120948
Binding site for M00253 at 3-10 with score=3.801 in sequence ENSG00000116741
Binding site for M00313 at 8-15 with score=5.697 in sequence ENSG00000116741
Name | Description |
Position |
Specifies whether the
coordinate positions [start-end] for a region should be given relative
to the start of the chromosome ("Genomic"), relative to the upstream
start of the sequence ("Relative") or relative to the transcription
start site associated with the sequence ("TSS-Relative")
|
Relative-offset |
If the "Position" setting is set to "Relative", this offset-value specifies
what position the first base in the sequence should start at (common
choices are 0 or 1). If the "Position" setting is "TSS-Relative", a
value of 0 here will place the TSS at position +0 whereas any other
value will place the TSS at position +1 (and the coordinate-system
will then skip 0 and go directly from -1 to +1)
|
Orientation |
Orientation of relative coordinates (only applicable if the "Position"
setting is "Relative").
|
Sort1,Sort2,Sort3 |
These three parameters control how to sort the regions in the
output. The regions will be sorted first by "Sort1", then by "Sort2"
(for regions with similar values for the "Sort1" property) and finally
by Sort3. Valid choices for these parameters are "Position", "Type" and
"Score" and the three parameters should preferably have different
values. The default choice is to first so by "Position", then "Type" and
finally "Score".
|
Include module motifs |
If selected, the constituent
single TF binding sites making up a cis-regulatory module will also be
included for each module region. Hence, if a module consists of three
TFBS, the module region will be output first on one line followed by
three lines containing each of the TFBS regions. The third column in
the output will have the value "module" for the module regions and
"motif" for the individual TFBS. Also, a "module_identifier" is output
on each line that can be used to group together a module entry with
its corresponding motif (TFBS) entries.
|
Skip header lines |
This (hidden) parameter can be used to specify a number of lines that
should be skipped at the start of the file (default is 0). These lines
are suspected to contain comments or other information that do not
conform to standard GFF format and would therefore result in parsing
errors if treated as regular input.
|
Format |
The "Format" parameter allows you to specify a
different format to use rather than the standard GFF fields. In
additional to literal text, the format string can
contain field-codes surrounded by braces,
e.g. {TYPE}. These field codes will be replaced by the
corresponding property value of the region. Standard recognized
field codes include: SEQUENCENAME,START,END,TYPE,SCORE,STRAND and
ATTRIBUTES. Other field codes can be used to refer to user-defined
properties. Tabs can be inserted using \t and extra newlines
can be inserted with \n.
Example: Use the following format string to output a comma-separated list with the type
of the region plus start and end coordinates in the sequence:
{TYPE},{START},{END}
|