Region Datasets (also called
region tracks) contain sets of
regions which are discrete segments of the sequence with associated properties.
Such regions could represent e.g. genes, exons, coding regions, DNase hypersensitive sites, ChIP-seq peak regions, CpG-islands, repeat regions, SNPs and transcription factor binding sites.
Each region has a location within its parent
sequence defined by a start and end position, and by extension also a
length (which technically could be 0 but not negative)
and
genomic location (if the genomic location of the parent sequence is known).
Other
standard properties of regions include a
type, a numeric
score value and a
strand orientation (which can be either "direct", "reverse" or "undetermined" and is relative to the genome not the parent sequence).
Additional
user-defined properties can be specified for regions as well, like for example the start and end coordinates for CDS subregions of genes or a "sequence" property for TFBS regions denoting the actual binding sequence at the particular site.
These user-defined properties can either have boolean, numeric or textual values.
Regions in the same track may overlap with each other, and regions are also allowed to extend beyond the boundaries of their parent
sequence (and could in theory also be located fully outside the sequence).
The consequences of regions extending outside of a sequence may differ depending on the particular operation or analysis applied to region tracks.
Motif track
A
motif track is a special kind of region dataset where the
type properties of the regions refer to known
motifs.
Some operations, like
motifDiscovery and
motifScanning will always return motif tracks,
and the motif track status is normally preserved when tracks are manipulated with other operations as well.
When region datasets are imported from files or preconfigured tracks, the regions are checked to see if they could potentially correspond to motif sites by comparing the regions' names and lengths to currently defined motifs.
If enough regions match with known motifs, the dataset will automatically be converted to a motif track.
Region datasets can also be converted to motif tracks manually by right-clicking on a region dataset in the Features Panel and selecting "Convert to Motif Track" from the context menu,
or with the following
display setting command:
$motifTrack(<trackname>)=true.
Motif tracks are listed with names in
boldface in the Feature Panel in MotifLab's graphical user interface.
Module track
A
module track is a special kind of region dataset where the
type properties of the regions refer to known
modules.
Some operations, like
moduleDiscovery and
moduleScanning will always return module tracks,
and the module track status is normally preserved when tracks are manipulated with other operations as well.
When region datasets are imported from files or preconfigured tracks, the regions are checked to see if they could potentially correspond to module sites by comparing them to currently defined modules.
If enough regions match with known modules, the dataset will automatically be converted to a module track.
Region datasets can also be converted to module tracks manually by right-clicking on a region dataset in the Features Panel and selecting "Convert to Module Track" from the context menu,
or with the following
display setting command:
$moduleTrack(<trackname>)=true.
Module tracks are listed with names in
bold italics in the Feature Panel in MotifLab's graphical user interface.
Nested track
A
nested track is a special kind of region dataset where the regions may contain nested
child regions. For example, in a gene annotation track the top-level gene regions could contain
nested regions corresponding to exons within each gene. The
module track type described above is actually a kind of nested track where the nested regions correspond to individual motif sites within the module.
The
extract operation can be used to create new (un-nested) tracks based on only the top-level regions or the child regions of a nested track.
Nested tracks are listed with names in
italics in the Feature Panel in MotifLab's graphical user interface.
Region annotation tracks (based on e.g. data from UCSC Genome Browser or other databases) can be imported from preconfigured tracks or loaded from files in various formats.
Operations that search for particular patterns within DNA sequences (including
motifDiscovery,
motifScanning,
moduleDiscovery,
moduleScanning and
search) will usually return the resulting matches as a region track, and regions can also be derived from numeric tracks with the
convert operation.
The
extract operation can extract child regions from a nested track and also extract the
start,
end and
center positions of regions.
# Import the preconfigured "RepeatMasker" annotation track for the current sequences
Repeats = new Region Dataset(DataTrack:RepeatMasker)
# Import a region track from file in BED format.
Genes = new Region Dataset(File:"C:\RefSeqGenes.bed", Format=BED)
# Create a new 'empty' track with no regions
Empty = new Region Dataset
# Create a new region track based on all the regions from three other tracks
AllRegions = combine_regions track1,track2,track3
# The search operation returns a new region dataset with regions matching the search pattern
Matches = search DNA for "CAssTG" on both strands
# The motifDiscovery operation will return both a Region Dataset (motif track)
# with the discovered binding sites and a collection with the newly discovered motifs
[TFBS,Motifs] = motifDiscovery in DNA with MEME { ... }
# Create a new region track with regions based on consecutive segments in the sequence
# with values above 0.8 in the Conservation track
ConservedRegions = convert Conservation to region where Conservation > 0.8
# Extract individual TFBS "child regions" from a module track
BindingSites = extract "TFBS" from ModuleTrack as Region Dataset
# Create a new track with 1bp long regions corresponding to gene transcription start sites
# by extracting the first position from each gene region (relative to its own orientation)
TSS = extract "regionStart" from EnsemblGenes as Region Dataset
Operations targeting region tracks will either modify the properties of existing regions, remove regions from the track (
filter and
prune)
or merge regions together. The start and end positions of regions cannot normally be manipulated directly (with e.g.
set or arithmetic operations),
but some operations like
extend can change the size of regions and thereby also alter their location.
Most numerical operations that can be used to modify
numeric tracks,
numeric maps and
numeric variables
can also be applied to modify numeric properties of regions.
Text properties can be altered with the
set and
replace operations.
If the arithmetic operations (
increase,
decrease,
multiply and
divide) are applied
to text properties of regions, they will function like
set operations treating the properties as (comma-separated) lists of values. The
increase and
multiply
operations will then function like
set addition (
union) whereas the
decrease and
divide operations will function like
set subtraction.
However, if arithmetic operations are applied to
boolean region properties they function like the following boolean operators: increase = OR, multiply = AND, decrease = NOR, divide = NAND.
There are currently no operations that can add new regions to an existing region track, but the GUI's
draw tool allows users to draw new regions directly into the visualized track, to delete existing regions
and to modify a region's properties in a popup dialog.
# Remove all predicted TFBS regions that are within gene regions
filter TFBS where region inside EnsemblGenes
# Remove overlapping TFBS regions representing the same binding motif (as defined in the partition)
# and keep only the top scoring region from each cluster
prune TFBS remove "alternatives" from MotifPartition1 keep "top scoring"
# Reduce the score of TFBS regions by half if they overlap with repeat regions
divide TFBS by 2 where region overlaps RepeatMasker
# Set the "conservation" property of every TFBS region to the average value from the Conservation track within each site
set TFBS[conservation] to average Conservation
# Increase the numeric region property "count" by a value defined in the variable for all regions
increase TFBS[count] by NumericVariable1
# This command goes through every RepeatMasker region and looks up its type property in the NameMap map
# Then it replaces the type of the region with the corresponding value from the map
replace NameMap in RepeatMasker property "type"
# Increase the size of all DNaseHS regions by 20 bp in both directions
extend DNaseHS by 20
# Extend all promoter regions in the upstream direction until they hit the closest gene
extend Promoter upstream until inside EnsemblGenes
# Merge overlapping ChIPseq regions of the same type into single regions
merge similar ChIPseq
# Merge all DNaseHS regions located closer than 10 bp apart from each other
# (Replace the original regions with a new region beginning at the start of the first region
# and ending at the end of the last region)
merge DNaseHS closer than 10
The primary purpose of MotifLab is to predict transcription factor binding sites and cis-regulatory modules within DNA sequences, and region datasets are used to represent such sites.
In addition, apart from being merely descriptive and informative, region tracks can be used in
conditions to limit operations to certain portions of the sequence.
Several different analyses can be applied to region datasets to examine the coverage of the regions in a single dataset, to compare the overlap between two datasets,
or to count the number of occurrences of each type of region in a dataset and compare this to another frequency distribution.
# Search for potential transcription factor binding sites in the DNA sequence
# and output the predicted sites in BED format
TFBS = motifScanning in DNA with MATCH { ... }
output TFBS in BED format
# Use the RepeatMasker dataset in a condition to mask only
# segments of the DNA sequence that fall within repeat regions
mask DNA with "N" where inside RepeatMasker
# Count the number of TFBS regions for each motif type and compare these counts to a background
# frequency distribution to determine which motifs are overrepresented in this dataset
Analysis1 = analyze count motif occurrences {Motif track=TFBS, Motifs=JASPAR,
Background frequencies=ExpectedFreq,
Significance threshold=0.05,
Bonferroni correction="All motifs"}
# Count the number of TFBS regions for each motif type within two sequence subsets
# representing respectively upregulated and downregulated genes.
# Compare these counts between the two sets and use a binomial test to determine
# which motifs are over- or underrepresented in one of the sets compared to the other
Analysis2 = analyze compare motif occurrences {Motif track=TFBS, Motifs=JASPAR,
Target set=UpregulatedGenes,
Control set=DownregulatedGenes,
Statistical test="Binomial",
Significance threshold=0.05,
Bonferroni correction="All motifs"}