DataFormat

EvidenceGFF

Applies to:  Region Dataset

Description

The EvidenceGFF format is an extension of the popular GFF format for region based features. The format allows the user to specify a list of additional properties that will be output alongside the standard GFF fields for each region. The additional properties can be output either in semicolon-separated "key=value" format as part of the normal "attributes" field in the standard GFF format or as additional fields separated by TAB (which will then extend the standard GFF format). Which format to use can be selected with the "Evidence format" setting.
The additional properties to output are specified as a string in the "Evidence" setting. This setting should be a list of comma-separated fields in "key=value" format. (Alternatively, the list can be separated by semicolons instead of commas and colons can be used instead of "=" to separate the name of the key from its value).
The "key" can either refer to a known feature dataset or be one of the special keywords region, motif, module, sequence or text.

The proper format of the "value" will depend on the type of the key as described in the table below:

If the key is the special keyword "region" the "value" can refer to any property associated with the region.
Some common region properties are:
type
Will output the type of the region
score
Will output the score value associated with the region
orientation
Will output the orientation of the region: 1 (direct), -1 (reverse) or 0 (undetermined).
In versions 1.05+ the property orientationsymbol or orientationstring will return a plus-symbol (+) for regions in the direct orientation, a minus-symbol (-) for regions in the reverse orientation and a dot (.) for regions with undetermined orientation.
sequence
Will output the DNA sequence spanned by the region (this property is usually only defined for regions in motif tracks)
If the key is the special keyword "motif" the following formats for "value" are recognized:
ID
Will output the name of the motif (usually just an identifier)
short name
Will output a short name for the motif (but usually more descriptive than the ID)
long name
Will output a longer name for the motif
consensus
Will output the consensus binding sequence of the motif
classification
Will output the classification of the motif (based on the type of binding factor)
factors
Will output a list of transcription factors that bind to this motif
In MotifLab version 2 (starting with v2.0.-3), the "value" can refer to any standard or user-defined motif property. Note that the "motif" keyword is only applicable for motif tracks where each region refers to a TFBS with an associated motif. If the region is not associated with a motif, the special code "N/A" will be output.
If the key is the special keyword "module" the "value" can be any standard or user-defined module property.

This feature was added in MotifLab v2.0.-3. Note that the "module" keyword is only applicable for module tracks where each region refers to a known cis-regulatory module type. If the region is not associated with a module, the special code "N/A" will be output.
If the key is the special keyword "sequence" the following formats for "value" are recognized:
(requires MotifLab version 1.05+)
name
Will output the name of the sequence
gene name (or genename)
Will output the name of the gene associated with the sequence (if specified)
species (or organism)
Will output the common name of the organism the sequence originates from
latin species (or latin organism)
Will output the latin name of the organism the sequence originates from
taxonomy
Will output the species taxonomy identifier of the organism the sequence originates from (E.g. for human sequences this will be "9606")
build
Will output the genome build that the sequence originates from
start
Will output the genomic coordinate for the start of the sequence
end
Will output the genomic coordinate for the end of the sequence
chromosome
Will output the chromosome that the sequence resides on
chr
Same as "chromosome" above but with an added "chr" prefix.
orientation
Outputs a plus sign (+) if the sequence is from the direct strand, a minus sign (-) if the sequence is from the reverse strand or a dot (.) if the sequence orientation is unknown.
In MotifLab version 2 (starting with v2.0.-3), the "value" can refer to any standard or user-defined sequence property.
If the key is the special keyword "text" then the corresponding value will be output verbatim.
E.g. the evidence code "text=BindingSite" will output "BindingSite" for every region.
If the key is the name of a DNA Sequence Dataset the following formats for "value" are recognized:
direct
Will output the DNA sequence spanned by the region. The DNA sequence will be from the direct strand.
reverse
Will output the DNA sequence spanned by the region. The DNA sequence will be from the reverse strand.
relative
Will output the DNA sequence spanned by the region. The DNA sequence will be from the strand relative to the orientation of the corresponding Sequence.
If the key is the name of a Numeric Dataset the following formats for "value" are recognized:
minimum (or min)
Will output the smallest value in the interval spanned by the region
maximum (or max)
Will output the largest value in the interval spanned by the region
average (or avg)
Will output the average value in the interval spanned by the region
weighted average (or weighted avg)
Will output the weighted average value in the interval spanned by the region. This value is only defined if the region refers to a binding site for a motif and will default to the normal unweighted average for all other regions. The value of the numeric track in each position is weighted by the normalized information content of the corresponding aligned PWM column of the Motif
median
Will output the median value in the interval spanned by the region
sum
Will output the sum of the values in the interval spanned by the region
weighted sum
The weighted sum of the values in the interval spanned by the region. This value is only defined if the region refers to a binding site for a motif and will default to the normal unweighted sum for all other regions. The value of the numeric track in each position is weighted by the normalized information content of the corresponding aligned PWM column of the Motif
startValue
Will output the value of the numeric track corresponding to the start position of the region (the smallest genomic coordinate)
endValue
Will output the value of the numeric track corresponding to the end position of the region (the largest genomic coordinate)
relativeStartValue
Will output the value of the numeric track corresponding to the position within the region located furthest upstream when viewed relative to the orientation of the Sequence
relativeEndValue
Will output the value of the numeric track corresponding to the position within the region located furthest downstream when viewed relative to the orientation of the Sequence
regionStartValue
Will output the value of the numeric track corresponding to the position within the region located furthest upstream when viewed relative to the orientation of the region
regionEndValue
Will output the value of the numeric track corresponding to the position within the region located furthest downstream when viewed relative to the orientation of the region
centerValue
Will output the value of the numeric track corresponding to the position at the center of the region
If the key is the name of a Region Dataset (hereafter called the "target dataset") the "value" should be in the following format:
        <operator> [qualifiers] <condition> [range] [additional]

The qualifiers field is optional but can contain a space-separated list of keywords as defined below.
The range field is only required when the condition is "within".
The additional field can be added when the operator is "list". Allowed values for this field are described below in connection with the list-operator.

Based on the condition, range and qualifiers, a set of target regions will be obtained from the target dataset.
The following conditions will determine which target regions are included in this set:
overlapping
The set will include those regions from the target dataset that overlap with the region being currently output by EvidenceGFF
inside
The set will include those regions from the target dataset that are fully inside the region being currently output by EvidenceGFF
covering
The set will include those regions from the target dataset that fully cover the region being currently output by EvidenceGFF
within [range]
The set will include those regions from the target dataset that overlap with an interval extending range bases on either side of the region being currently output by EvidenceGFF. The range can be specified as a numeric literal or with a Numeric Variable or Numeric Map.
present
This set will include only those regions from the target dataset that are identical in every way to the region being currently output by EvidenceGFF. This condition is really only useful in statements like "filteredRegions=is present" which will be true if the region being output is also present in the track named "filteredRegions"
The resulting set can be further filtered by requiring the target regions to have additional qualifications:
non-overlapping
Only target regions that do not overlap with the region being currently output by EvidenceGFF will be kept. (This qualifier is really only useful in conjunction with the "within" condition).
interacting   (or "interaction partner")
Only target regions that bind transcription factors known to interact with factors bound by the region being currently output by EvidenceGFF will be kept. (This qualifier is only useful when both the current region and the target region represent TFBS)

After the set of target regions have been obtained based on the condition and filtered based on the selected qualifiers, the choice of operator will determine the final output. EvidenceGFF recognizes the following operators:
is
The final output will be a boolean value (YES/NO, TRUE/FALSE or similar) reflecting whether the set of target regions is non-empty (i.e. whether any target regions met the speficied criteria).
count
The final output will be a numeric value reflecting the size of the set of target regions.
list
The final output will be a comma-separated list of type names for the target regions in the set.
As described above, an [additonal] field may be appended having one of the following values: "with scores", "with distances" or "with scores and distances". When the list of target regions is output "with scores", the score of each target region is written out in parentheses behind the type name of the target region. If the list is output "with distances", the shortest distance from the target region to the region being currently output by EvidenceGFF is written out in brackets [] behind the type name of the target. If the two regions overlap, a distance of -1 will be output.
As of MotifLab v2.0.-3, the value of this field can also be "with [motif | module] <propertyname>" which will output the value of the specified region property within parentheses. The property name can be prefixed with either motif or module to signal that the name instead refers to a property of the motif or module associated with the region.
percentage (or percent). (Requires version 1.05+)
This operator can only be used in combination with the 'overlapping' condition (i.e. "percentage overlapping") and will output the largest fraction of overlap that the currently output region has with any of the target regions.
As of MotifLab v2.0.-3 it is also possible to use "percentage all overlapping" to output a comma-separated list with percentage overlap for every overlapping target region. Note that the order in which these percentages are listed is the same as the order of the regions output with the corresponding "list overlapping" statement.
distance to <qualifier>
The final output will be a numeric value reflecting the distance to the closest qualified target region or the special value "N/A" if no qualified regions could be found. The required qualifier can be "any" (or "closest") which will just output the distance to the closest target region, "interacting" (or "interaction partner") which will output the distance to the nearest region representing a known interaction partner (assuming both regions are motif sites), or it can be the name of a Collection or Text Variable which will output the distance to the nearest region whose type is a member of the Collection or Text Variable. The qualifier "non-overlapping" can also be added to ignore overlapping target regions.

Note that if the "target dataset" is the same as the region dataset being currently output in EvidenceGFF format, the current region being output will never be included in the set of target regions described here.

Examples: (keys are assumed to be referring to known Region Datasets)

 DNaseHS=is overlapping 
Will output YES or NO depending on whether the current region being output overlaps with any regions in the DNaseHS track.

 ChIP_Seq_tags=count covering 
Will output the number of ChIP_Seq_tags that are completely covering the current region being output (so that the current region is fully inside the tag region)

 TFBS=list non-overlapping interacting within 20 with scores and distances 
Will list the type names of TFBS regions that are overlapping an interval extending 20 bp on either side of the current region but not overlapping with the current region itself. The target regions must be associated with motifs that are known to interact with the motif associated with the current region. The score of the target region will be output in parenthesis after its type name and this will be followed by the distance between the target region and the current region in brackets.


For example, the following "Evidence" format:
motif=short name,Conservation=average,Repeats=is overlapping,TFBS=list within 30

will add 4 new fields to the GFF format. The first new field will contain a short name of the motif associated with the region being output. The second field will contain the average value of the "Conservation" track within the interval spanned by the region. The third field will contain a YES or NO value depending on whether or not the region overlaps with a region in the "Repeats" track, and the fourth and last field will contain a list of type names for regions in the "TFBS" track that are within 30 bp of the current region. The output could look something like this:
NTNG1 BindingSites  M00378   48   59  5.963 - . V$PAX4_03   0.109  No   
RPRM  BindingSites  M00253  296  303  3.801 + . V$CAP_01    0.235  Yes  M00313
RPRM  BindingSites  M00313  301  308  5.697 + . V$GEN_INI2  0.0    Yes  M00253,M00315
...

Arguments

NameDescription
Position Specifies whether the coordinate positions [start-end] for a region should be given relative to the start of the chromosome ("Genomic"), relative to the upstream start of the sequence ("Relative") or relative to the transcription start site associated with the sequence ("TSS-Relative")
Relative-offset If the "Position" setting is set to "Relative", this offset-value specifies what position the first base in the sequence should start at (common choices are 0 or 1). If the "Position" setting is "TSS-Relative", a value of 0 here will place the TSS at position +0 whereas any other value will place the TSS at position +1 (and the coordinate-system will then skip 0 and go directly from -1 to +1)
Orientation Orientation of relative coordinates (only applicable if the "Position" setting is "Relative").
Sort1,Sort2,Sort3 These three parameters control how to sort the regions in the output. The regions will be sorted first by "Sort1", then by "Sort2" (for regions with similar values for the "Sort1" property) and finally by Sort3. Valid choices for these parameters are "Position", "Type" and "Score" and the three parameters should preferably have different values. The default choice is to first so by "Position", then "Type" and finally "Score".
Include header If selected, a single header line (starting with #) will be output at the beginning of the output-document. The header contains a specification of all the fields included in the output.
Skip standard fields If selected, the standard GFF fields will not be output, only the evidence fields.
Boolean format This parameter specifies how boolean values should be formatted in the output. Either as "Yes" versus "No" (alternatively "Y" versus "N"), "True" versus "False" (alternatively "T" versus "F") or "1" versus "0".
Evidence format Specifies how the "evidence" should be output for each region. Options are to output each evidence value in a column of its own or to output all evidences in a single column in key=value pairs (separated by semicolons).
Evidence The "evidence" parameter should be a comma-separated list of key=value pairs specifying additional information that should be output for each region. See above for a complete description of recognized evidence codes.

See Also: output, GFF, Region Dataset