Description
The EvidenceGFF format is an extension of the popular
GFF format for
region based features.
The format allows the user to specify a list of additional properties that will be output alongside the standard GFF fields for each region.
The additional properties can be output either in semicolon-separated "key=value" format as part of the normal "attributes" field in the standard GFF format
or as additional fields separated by TAB (which will then extend the standard GFF format). Which format to use can be selected with the "Evidence format" setting.
The additional properties to output are specified as a string in the "Evidence" setting. This setting should be a list of comma-separated fields in "key=value" format.
(Alternatively, the list can be separated by semicolons instead of commas and colons can be used instead of "=" to separate the name of the key from its value).
The "key" can either refer to a known
feature dataset or be one of the
special keywords
region,
motif,
module,
sequence or
text.
The proper format of the "value" will depend on the type of the key as described in the table below:
If the key is the special keyword "region" the "value" can refer to any property associated with the region.
Some common region properties are:
- type
- Will output the type of the region
- score
- Will output the score value associated with the region
- orientation
- Will output the orientation of the region: 1 (direct),
-1 (reverse) or 0 (undetermined).
In versions 1.05+ the property orientationsymbol
or orientationstring will return a plus-symbol (+) for regions in
the direct orientation, a minus-symbol (-) for regions in the
reverse orientation and a dot (.) for regions with
undetermined orientation.
- sequence
- Will output the DNA sequence spanned by the region (this property is usually only defined for regions in motif tracks)
|
If the key is the special keyword "motif" the following formats for "value" are recognized:
- ID
- Will output the name of the motif (usually just an identifier)
- short name
- Will output a short name for the motif (but usually more descriptive than the ID)
- long name
- Will output a longer name for the motif
- consensus
- Will output the consensus binding sequence of the motif
- classification
- Will output the classification of the motif (based on the type of binding factor)
- factors
- Will output a list of transcription factors that bind to this motif
In MotifLab version 2 (starting with v2.0.-3), the "value" can refer to any standard or user-defined motif property.
Note that the "motif" keyword is only applicable for motif tracks where each region refers to a TFBS with an associated motif.
If the region is not associated with a motif, the special code "N/A" will be output.
|
If the key is the special keyword "module" the "value" can be any standard or user-defined module property.
This feature was added in MotifLab v2.0.-3.
Note that the "module" keyword is only applicable for module tracks where each region refers to a known cis-regulatory module type.
If the region is not associated with a module, the special code "N/A" will be output.
|
If the key is the special keyword "sequence" the following formats for "value" are recognized: (requires MotifLab version 1.05+)
- name
- Will output the name of the sequence
- gene name (or genename)
- Will output the name of the gene associated with the sequence (if specified)
- species (or organism)
- Will output the common name of the organism the
sequence originates from
- latin species (or latin organism)
- Will output the latin name of the organism the sequence
originates from
- taxonomy
- Will output the species taxonomy identifier of the
organism the sequence originates from (E.g. for human
sequences this will be "9606")
- build
- Will output the genome build that the sequence
originates from
- start
- Will output the genomic coordinate for the start of the sequence
- end
- Will output the genomic coordinate for the end of the
sequence
- chromosome
- Will output the chromosome that the sequence resides on
- chr
- Same as "chromosome" above but with an added "chr" prefix.
- orientation
- Outputs a plus sign (+) if the sequence is from the
direct strand, a minus sign (-) if the sequence is from the
reverse strand or a dot (.) if the sequence orientation is unknown.
In MotifLab version 2 (starting with v2.0.-3), the "value" can refer to any standard or user-defined sequence property.
|
If the key is the special keyword "text" then the
corresponding value will be output verbatim.
E.g. the evidence code "text=BindingSite" will output "BindingSite" for every region.
|
If the key is the name of a DNA Sequence Dataset the following formats for "value" are recognized:
- direct
- Will output the DNA sequence spanned by the region. The DNA sequence will be from the direct strand.
- reverse
- Will output the DNA sequence spanned by the region. The DNA sequence will be from the reverse strand.
- relative
- Will output the DNA sequence spanned by the region. The DNA sequence will be from the strand relative to the orientation of the corresponding Sequence.
|
If the key is the name of a Numeric Dataset the following formats for "value" are recognized:
- minimum (or min)
- Will output the smallest value in the interval spanned by the region
- maximum (or max)
- Will output the largest value in the interval spanned by the region
- average (or avg)
- Will output the average value in the interval spanned by the region
- weighted average (or weighted avg)
- Will output the weighted average value in the interval spanned by the region.
This value is only defined if the region refers to a binding site for a motif and will default to the normal unweighted average for all other regions.
The value of the numeric track in each position is
weighted by the normalized information content of
the corresponding aligned PWM column of the Motif
- median
- Will output the median value in the interval spanned by the region
- sum
- Will output the sum of the values in the interval spanned by the region
- weighted sum
- The weighted sum of the values in the interval spanned by the region.
This value is only defined if the region refers to a binding site for a motif and will default to the normal unweighted sum for all other regions.
The value of the numeric track in each position is
weighted by the normalized information content of
the corresponding aligned PWM column of the Motif
- startValue
- Will output the value of the numeric track corresponding to the start position of the region (the smallest genomic coordinate)
- endValue
- Will output the value of the numeric track corresponding to the end position of the region (the largest genomic coordinate)
- relativeStartValue
- Will output the value of the numeric track corresponding to the position within the region located furthest upstream when viewed relative to the orientation of the Sequence
- relativeEndValue
- Will output the value of the numeric track corresponding to the position within the region located furthest downstream when viewed relative to the orientation of the Sequence
- regionStartValue
- Will output the value of the numeric track corresponding to the position within the region located furthest upstream when viewed relative to the orientation of the region
- regionEndValue
- Will output the value of the numeric track corresponding to the position within the region located furthest downstream when viewed relative to the orientation of the region
- centerValue
- Will output the value of the numeric track corresponding to the position at the center of the region
|
If the key is the name of a Region Dataset (hereafter called the "target dataset") the "value" should be in the following format:
<operator>
[qualifiers] <condition> [range] [additional]
The qualifiers field is optional but can contain a space-separated list of keywords as defined below.
The range field is only required when the condition is "within".
The additional field can be added when the operator is
"list". Allowed values for this field are described below in
connection with the list-operator.
Based on the condition, range and qualifiers, a set of target regions will be obtained from the target dataset.
The following conditions will determine which target regions are included in this set:
- overlapping
- The set will include those regions from the target dataset that overlap with the region being currently output by EvidenceGFF
- inside
- The set will include those regions from the target dataset that are fully inside the region being currently output by EvidenceGFF
- covering
- The set will include those regions from the target dataset that fully cover the region being currently output by EvidenceGFF
- within [range]
- The set will include those regions from the target dataset that overlap with an interval extending range bases on either side of the region being currently output by EvidenceGFF.
The range can be specified as a numeric literal or with a Numeric Variable or Numeric Map.
- present
- This set will include only those regions from the target dataset that are identical in every way to the region being currently output by EvidenceGFF.
This condition is really only useful in statements like
"filteredRegions=is present" which will be true if the
region being output is also present in the track named "filteredRegions"
The resulting set can be further filtered by requiring the target regions to have additional qualifications:
- non-overlapping
- Only target regions that do not overlap with the region being currently output by EvidenceGFF will be kept.
(This qualifier is really only useful in conjunction with the "within" condition).
- interacting (or "interaction partner")
- Only target regions that bind transcription factors known to interact with factors bound by the region being currently output by EvidenceGFF will be kept.
(This qualifier is only useful when both the current region and the target region represent TFBS)
After the set of target regions have been obtained based on the condition and filtered based on the selected qualifiers,
the choice of operator will determine the final output. EvidenceGFF recognizes the following operators:
- is
- The final output will be a boolean value (YES/NO, TRUE/FALSE or similar) reflecting whether the set of target regions is non-empty (i.e. whether any target regions met the speficied criteria).
- count
- The final output will be a numeric value reflecting the size of the set of target regions.
- list
- The final output will be a comma-separated list of type names for the target regions in the set.
As described above, an [additonal] field may be appended
having one of the following values: "with
scores", "with distances" or "with scores and
distances". When the list of target regions is output
"with scores", the score of each target region is written
out in parentheses behind the type name of the target
region. If the list is output "with distances", the
shortest distance from the target region to the region
being currently output by EvidenceGFF is written out in
brackets [] behind the type name of the target. If the two
regions overlap, a distance of -1 will be output.
As of MotifLab v2.0.-3, the value of this field can also be "with [motif | module] <propertyname>"
which will output the value of the specified region property within parentheses.
The property name can be prefixed with either motif or module to signal that the name
instead refers to a property of the motif or module associated with the region.
- percentage (or percent). (Requires version 1.05+)
- This operator can only be used in combination with the
'overlapping' condition (i.e. "percentage overlapping") and
will output the largest fraction of overlap that the currently
output region has with any of the target regions.
As of MotifLab v2.0.-3 it is also possible to use "percentage all overlapping" to output a comma-separated list with percentage overlap
for every overlapping target region. Note that the order in which these percentages are listed is the same as the order of the regions output
with the corresponding "list overlapping" statement.
- distance to <qualifier>
- The final output will be a numeric value reflecting the distance to the closest qualified target region or the special value "N/A" if no qualified regions could be found.
The required qualifier can be "any" (or "closest") which will just output the distance to the closest target region, "interacting" (or "interaction partner") which will
output the distance to the nearest region representing a known interaction partner (assuming both regions are motif sites), or it can be the name of a Collection or Text Variable
which will output the distance to the nearest region whose type is a member of the Collection or Text Variable.
The qualifier "non-overlapping" can also be added to ignore overlapping target regions.
Note that if the "target dataset" is the same as the region dataset being currently output in EvidenceGFF format, the current region being output will never be included in the set of target regions described here.
Examples: (keys are assumed to be referring to known Region Datasets)
DNaseHS=is overlapping
Will output YES or NO depending on whether the current region being output overlaps with any regions in the DNaseHS track.
ChIP_Seq_tags=count covering
Will output the number of ChIP_Seq_tags that are completely covering the current region being output (so that the current region is fully inside the tag region)
TFBS=list non-overlapping
interacting within 20 with scores and distances
Will list the type names of TFBS regions that are overlapping
an interval extending 20 bp on either side of the current
region but not overlapping with the current region itself. The
target regions must be associated with motifs that are known
to interact with the motif associated with the current
region. The score of the target region will be output in
parenthesis after its type name and this will be followed by
the distance between the target region and the current region
in brackets.
|
For example, the following "Evidence" format:
motif=short name,Conservation=average,Repeats=is overlapping,TFBS=list within 30
will add 4 new fields to the GFF format. The first new field will contain a short name of the motif associated with the region being output.
The second field will contain the average value of the "Conservation" track within the interval spanned by the region. The third field will contain a YES or NO
value depending on whether or not the region overlaps with a region in the "Repeats" track, and the fourth and last field will contain a list of type names for regions
in the "TFBS" track that are within 30 bp of the current region. The output could look something like this:
NTNG1 BindingSites M00378 48 59 5.963 - . V$PAX4_03 0.109 No
RPRM BindingSites M00253 296 303 3.801 + . V$CAP_01 0.235 Yes M00313
RPRM BindingSites M00313 301 308 5.697 + . V$GEN_INI2 0.0 Yes M00253,M00315
...
Name | Description |
Position |
Specifies whether the
coordinate positions [start-end] for a region should be given relative
to the start of the chromosome ("Genomic"), relative to the upstream
start of the sequence ("Relative") or relative to the transcription
start site associated with the sequence ("TSS-Relative")
|
Relative-offset |
If the "Position" setting is set to "Relative", this offset-value
specifies
what position the first base in the sequence should start at (common
choices are 0 or 1). If the "Position" setting is "TSS-Relative", a
value of 0 here will place the TSS at position +0 whereas any other
value will place the TSS at position +1 (and the coordinate-system
will then skip 0 and go directly from -1 to +1)
|
Orientation |
Orientation of relative coordinates (only applicable if the "Position"
setting is "Relative").
|
Sort1,Sort2,Sort3 |
These three parameters control how to sort the regions in the
output. The regions will be sorted first by "Sort1", then by "Sort2"
(for regions with similar values for the "Sort1" property) and finally
by Sort3. Valid choices for these parameters are "Position", "Type" and
"Score" and the three parameters should preferably have different
values. The default choice is to first so by "Position", then "Type" and
finally "Score".
|
Include header |
If selected, a single header line (starting with #) will be output at the beginning of the
output-document. The header contains a specification of all the fields included in the output.
|
Skip standard fields |
If selected, the standard GFF fields will not be output, only the evidence fields.
|
Boolean format |
This parameter specifies how boolean values should be formatted in the
output. Either as "Yes" versus "No" (alternatively "Y" versus "N"), "True" versus
"False" (alternatively "T" versus "F") or "1" versus "0".
|
Evidence format |
Specifies how the "evidence" should be output for each region. Options
are to output each evidence value in a column of its own or to output
all evidences in a single column in key=value pairs (separated by semicolons).
|
Evidence |
The "evidence" parameter should be a comma-separated list of key=value pairs
specifying additional information that should be output for each region.
See above for a complete description of recognized evidence codes.
|