DataType

Region Dataset

Region Datasets (also called region tracks) contain sets of regions which are discrete segments of the sequence with associated properties. Such regions could represent e.g. genes, exons, coding regions, DNase hypersensitive sites, ChIP-seq peak regions, CpG-islands, repeat regions, SNPs and transcription factor binding sites. Each region has a location within its parent sequence defined by a start and end position, and by extension also a length (which technically could be 0 but not negative) and genomic location (if the genomic location of the parent sequence is known). Other standard properties of regions include a type, a numeric score value and a strand orientation (which can be either "direct", "reverse" or "undetermined" and is relative to the genome not the parent sequence). Additional user-defined properties can be specified for regions as well, like for example the start and end coordinates for CDS subregions of genes or a "sequence" property for TFBS regions denoting the actual binding sequence at the particular site. These user-defined properties can either have boolean, numeric or textual values.

Regions in the same track may overlap with each other, and regions are also allowed to extend beyond the boundaries of their parent sequence (and could in theory also be located fully outside the sequence). The consequences of regions extending outside of a sequence may differ depending on the particular operation or analysis applied to region tracks.

Motif track
A motif track is a special kind of region dataset where the type properties of the regions refer to known motifs. Some operations, like motifDiscovery and motifScanning will always return motif tracks, and the motif track status is normally preserved when tracks are manipulated with other operations as well. When region datasets are imported from files or preconfigured tracks, the regions are checked to see if they could potentially correspond to motif sites by comparing the regions' names and lengths to currently defined motifs. If enough regions match with known motifs, the dataset will automatically be converted to a motif track. Region datasets can also be converted to motif tracks manually by right-clicking on a region dataset in the Features Panel and selecting "Convert to Motif Track" from the context menu, or with the following display setting command: $motifTrack(<trackname>)=true.
Motif tracks are listed with names in boldface in the Feature Panel in MotifLab's graphical user interface.

Module track
A module track is a special kind of region dataset where the type properties of the regions refer to known modules. Some operations, like moduleDiscovery and moduleScanning will always return module tracks, and the module track status is normally preserved when tracks are manipulated with other operations as well. When region datasets are imported from files or preconfigured tracks, the regions are checked to see if they could potentially correspond to module sites by comparing them to currently defined modules. If enough regions match with known modules, the dataset will automatically be converted to a module track. Region datasets can also be converted to module tracks manually by right-clicking on a region dataset in the Features Panel and selecting "Convert to Module Track" from the context menu, or with the following display setting command: $moduleTrack(<trackname>)=true.
Module tracks are listed with names in bold italics in the Feature Panel in MotifLab's graphical user interface.

Nested track
A nested track is a special kind of region dataset where the regions may contain nested child regions. For example, in a gene annotation track the top-level gene regions could contain nested regions corresponding to exons within each gene. The module track type described above is actually a kind of nested track where the nested regions correspond to individual motif sites within the module. The extract operation can be used to create new (un-nested) tracks based on only the top-level regions or the child regions of a nested track.
Nested tracks are listed with names in italics in the Feature Panel in MotifLab's graphical user interface.

Creating Region Datasets

Region annotation tracks (based on e.g. data from UCSC Genome Browser or other databases) can be imported from preconfigured tracks or loaded from files in various formats. Operations that search for particular patterns within DNA sequences (including motifDiscovery, motifScanning, moduleDiscovery, moduleScanning and search) will usually return the resulting matches as a region track, and regions can also be derived from numeric tracks with the convert operation. The extract operation can extract child regions from a nested track and also extract the start, end and center positions of regions.

# Import the preconfigured "RepeatMasker" annotation track for the current sequences
Repeats = new Region Dataset(DataTrack:RepeatMasker)

# Import a region track from file in BED format.
Genes = new Region Dataset(File:"C:\RefSeqGenes.bed", Format=BED)

# Create a new 'empty' track with no regions
Empty = new Region Dataset

# Create a new region track based on all the regions from three other tracks
AllRegions = combine_regions track1,track2,track3

# The search operation returns a new region dataset with regions matching the search pattern
Matches = search DNA for "CAssTG" on both strands

# The motifDiscovery operation will return both a Region Dataset (motif track)
# with the discovered binding sites and a collection with the newly discovered motifs
[TFBS,Motifs] = motifDiscovery in DNA with MEME { ... }

# Create a new region track with regions based on consecutive segments in the sequence
# with values above 0.8 in the Conservation track
ConservedRegions = convert Conservation to region where Conservation > 0.8

# Extract individual TFBS "child regions" from a module track
BindingSites = extract "TFBS" from ModuleTrack as Region Dataset

# Create a new track with 1bp long regions corresponding to gene transcription start sites
# by extracting the first position from each gene region (relative to its own orientation)
TSS = extract "regionStart" from EnsemblGenes as Region Dataset

Modifying Region Datasets

Operations targeting region tracks will either modify the properties of existing regions, remove regions from the track (filter and prune) or merge regions together. The start and end positions of regions cannot normally be manipulated directly (with e.g. set or arithmetic operations), but some operations like extend can change the size of regions and thereby also alter their location.

Most numerical operations that can be used to modify numeric tracks, numeric maps and numeric variables can also be applied to modify numeric properties of regions. Text properties can be altered with the set and replace operations. If the arithmetic operations (increase, decrease, multiply and divide) are applied to text properties of regions, they will function like set operations treating the properties as (comma-separated) lists of values. The increase and multiply operations will then function like set addition (union) whereas the decrease and divide operations will function like set subtraction. However, if arithmetic operations are applied to boolean region properties they function like the following boolean operators: increase = OR, multiply = AND, decrease = NOR, divide = NAND.

There are currently no operations that can add new regions to an existing region track, but the GUI's draw tool allows users to draw new regions directly into the visualized track, to delete existing regions and to modify a region's properties in a popup dialog.

# Remove all predicted TFBS regions that are within gene regions
filter TFBS where region inside EnsemblGenes

# Remove overlapping TFBS regions representing the same binding motif (as defined in the partition)
# and keep only the top scoring region from each cluster
prune TFBS remove "alternatives" from MotifPartition1 keep "top scoring"

# Reduce the score of TFBS regions by half if they overlap with repeat regions
divide TFBS by 2 where region overlaps RepeatMasker

# Set the "conservation" property of every TFBS region to the average value from the Conservation track within each site
set TFBS[conservation] to average Conservation

# Increase the numeric region property "count" by a value defined in the variable for all regions
increase TFBS[count] by NumericVariable1

# This command goes through every RepeatMasker region and looks up its type property in the NameMap map
# Then it replaces the type of the region with the corresponding value from the map
replace NameMap in RepeatMasker property "type"

# Increase the size of all DNaseHS regions by 20 bp in both directions
extend DNaseHS by 20

# Extend all promoter regions in the upstream direction until they hit the closest gene
extend Promoter upstream until inside EnsemblGenes

# Merge overlapping ChIPseq regions of the same type into single regions
merge similar ChIPseq

# Merge all DNaseHS regions located closer than 10 bp apart from each other
# (Replace the original regions with a new region beginning at the start of the first region
# and ending at the end of the last region)
merge DNaseHS closer than 10

Using Region Datasets

The primary purpose of MotifLab is to predict transcription factor binding sites and cis-regulatory modules within DNA sequences, and region datasets are used to represent such sites. In addition, apart from being merely descriptive and informative, region tracks can be used in conditions to limit operations to certain portions of the sequence. Several different analyses can be applied to region datasets to examine the coverage of the regions in a single dataset, to compare the overlap between two datasets, or to count the number of occurrences of each type of region in a dataset and compare this to another frequency distribution.

# Search for potential transcription factor binding sites in the DNA sequence
# and output the predicted sites in BED format
TFBS = motifScanning in DNA with MATCH { ... }
output TFBS in BED format

# Use the RepeatMasker dataset in a condition to mask only
# segments of the DNA sequence that fall within repeat regions
mask DNA with "N" where inside RepeatMasker

# Count the number of TFBS regions for each motif type and compare these counts to a background
# frequency distribution to determine which motifs are overrepresented in this dataset
Analysis1 = analyze count motif occurrences {Motif track=TFBS, Motifs=JASPAR,
                                             Background frequencies=ExpectedFreq,
                                             Significance threshold=0.05,
                                             Bonferroni correction="All motifs"}

# Count the number of TFBS regions for each motif type within two sequence subsets
# representing respectively upregulated and downregulated genes.
# Compare these counts between the two sets and use a binomial test to determine
# which motifs are over- or underrepresented in one of the sets compared to the other
Analysis2 = analyze compare motif occurrences {Motif track=TFBS, Motifs=JASPAR,
                                               Target set=UpregulatedGenes,
                                               Control set=DownregulatedGenes,
                                               Statistical test="Binomial",
                                               Significance threshold=0.05,
                                               Bonferroni correction="All motifs"}