DataFormat

GFF

Applies to:  Region Dataset

Description

The General Feature Format (GFF) is one of the most popular formats for exchanging information about region based features. The official GFF specification can be found here, but briefly described the format outputs one region per line and each line consists of 8 (or optionally 9) fields separated by TAB.

The fields are in order:
  1. The name of the sequence
  2. The source of the feature
  3. The feature type
  4. The start coordinate of the region
  5. The end coordinate of the region
  6. A score value for the region
  7. The orientation of the region. This can be "+" or "-" (or "." if orientation is unspecified)
  8. The reading frame. The value of this field is either 0, 1 or 2 (or "." if the frame does not apply)
  9. Additional attributes. This optional field consists of a list of attributes separated by semicolon. Each attribute has a key (or "tag") followed by value for the attribute (separated by an equals sign).

NOTE:
When importing regions from a GFF-file, the sequence name in the first column must correspond to the name of an existing sequence in MotifLab, and the region will then be added to that sequence. If the first column contains a chromosome name, it will only be added to a sequence if there is a sequence that is actually named after the chromosome; it is not enough that the sequence covers the chromosomal segment that the region from the GFF-file falls within. When the first column contains chromosome names, it is suggested instead to use the GTF format (or convert the file to BED format).


Sequences output in GFF format are output according to the currently selected sorting order of the sequences, but within each sequence the user can specify whether to sort the regions by position, score or type. The start and end positions of each region (fields 4 and 5) can be output as either genomic coordinates or as positions relative to the start of the sequence by setting the "Position" option to either "Genomic" or "Relative". If the "Relative" setting is chosen, the "Relative-offset" and "Orientation" settings will also apply. The "Relative-offset" setting specifies the coordinate of the first position in the sequence. This will normally be 1 but can be set to other values if needed (for instance 0). The "Orientation" setting specifies which orientation to use to determine the relative region coordinates. For example, if a 100 bp long sequence on the direct strand has a binding site region from position 80 to 90, the start and end coordinates will be [80,90] if the "Direct" strand orientation is selected or [10,20] if the "Reverse" orientation is selected. If the "Orientation" is set to "From Sequence" the strand orientation will be selected based on the orientation of the sequence itself, so that sequences on the direct strand will be output in direct orientation and those on the reverse strand will be output in reverse orientation. If the "Opposite" strand orientation is selected, the orientation will be the opposite of the orientation of the sequence.

If the standard GFF format is not adequate, the "Format" setting can be used to specify an alternative output format. The alternative format is specified by a string consisting of a mix of literal characters and special field codes surrounded by braces (e.g. {START} ). For each region, the field codes in the format string (if recognized) will be replaced by the corresponding value of the field as it applies to the target region before the string is output. Some recognized fields are: SEQUENCENAME, FEATURE, SOURCE, START, END, SCORE, STRAND and TYPE (note the capitalization). TABs can be represented with the escape character \t.

For example, the following output format:
Binding site for {TYPE} at {START}-{END} with score={SCORE} in sequence {SEQUENCENAME}

will produce output that looks like this
Binding site for M00378 at 483-494 with score=5.963 in sequence ENSG00000120948
Binding site for M00253 at 3-10 with score=3.801 in sequence ENSG00000116741
Binding site for M00313 at 8-15 with score=5.697 in sequence ENSG00000116741

Arguments

NameDescription
Position Specifies whether the coordinate positions [start-end] for a region should be given relative to the start of the chromosome ("Genomic"), relative to the upstream start of the sequence ("Relative") or relative to the transcription start site associated with the sequence ("TSS-Relative")
Relative-offset If the "Position" setting is set to "Relative", this offset-value specifies what position the first base in the sequence should start at (common choices are 0 or 1). If the "Position" setting is "TSS-Relative", a value of 0 here will place the TSS at position +0 whereas any other value will place the TSS at position +1 (and the coordinate-system will then skip 0 and go directly from -1 to +1)
Orientation Orientation of relative coordinates (only applicable if the "Position" setting is "Relative").
Sort1,Sort2,Sort3 These three parameters control how to sort the regions in the output. The regions will be sorted first by "Sort1", then by "Sort2" (for regions with similar values for the "Sort1" property) and finally by Sort3. Valid choices for these parameters are "Position", "Type" and "Score" and the three parameters should preferably have different values. The default choice is to first so by "Position", then "Type" and finally "Score".
Include module motifs If selected, the constituent single TF binding sites making up a cis-regulatory module will also be included for each module region. Hence, if a module consists of three TFBS, the module region will be output first on one line followed by three lines containing each of the TFBS regions. The third column in the output will have the value "module" for the module regions and "motif" for the individual TFBS. Also, a "module_identifier" is output on each line that can be used to group together a module entry with its corresponding motif (TFBS) entries.
Skip header lines This (hidden) parameter can be used to specify a number of lines that should be skipped at the start of the file (default is 0). These lines are suspected to contain comments or other information that do not conform to standard GFF format and would therefore result in parsing errors if treated as regular input.
Format The "Format" parameter allows you to specify a different format to use rather than the standard GFF fields. In additional to literal text, the format string can contain field-codes surrounded by braces, e.g. {TYPE}. These field codes will be replaced by the corresponding property value of the region. Standard recognized field codes include: SEQUENCENAME,START,END,TYPE,SCORE,STRAND and ATTRIBUTES. Other field codes can be used to refer to user-defined properties. Tabs can be inserted using \t and extra newlines can be inserted with \n.
Example: Use the following format string to output a comma-separated list with the type of the region plus start and end coordinates in the sequence:
{TYPE},{START},{END}

See Also: output, EvidenceGFF, Region Dataset