Numeric Datasets (also called
numeric tracks) represent information with one numeric value for each position within a
sequence segment.
The type of information stored in numeric datasets could be, for instance, (per base) phylogenetic conservation levels,
physical or statistical characteristics of the DNA sequence/double helix (e.g. helix twist and roll, or local GC-content), the distance from each sequence position to some target feature,
per base quality scores (for sequence reads), number of ChIP-seq tag counts per position, and position-specific priors used to guide motif discovery, to list but a few examples.
Numeric annotation tracks (based on e.g. data from UCSC Genome Browser or other databases) can be imported from preconfigured tracks or loaded from files in various formats.
Numeric tracks can also be derived from information in other types of tracks. For example,
Priors Generators can be trained with machine learning methods to
predict
the location of certain features based on combined information from several different tracks. The output from a Priors Generator is a numeric track where each position reflects a
prior probability (or likelihood)
that the position could overlap with the target feature (for example a TF binding site).
# Import the "PhastCons100way" annotation track for the current sequences
Conservation = new Numeric Dataset(DataTrack:PhastCons100way)
# Import a conservation track from file in WIG format.
Conservation = new Numeric Dataset(File:"C:\phastcons.wig", Format=WIG)
# Create a new 'empty' numeric track where each position has a value of zero
Empty = new Numeric Dataset
# Create a new numeric track where each position is assigned the initial value 42
Answer = new Numeric Dataset(42)
# Create a new numeric track where the value at each position is the average of the values
# from three other tracks
AverageValueTrack = combine_numeric track1,track2,track3 using average
# Convert the existing region track "CpG_islands" into a numeric track such that all positions
# within the original regions are assigned the value 100 and all other position are assigned a value of 0
convert CpG_islands to numeric with value = 100
# Create a new track by counting the number of TFBS regions that overlap with a 5bp window
# centered around every position in the track
CountTrack = count number of regions in TFBS overlapping window of size 5 with anchor at center
# Create a new track where the value in each position is the distance (in bp)
# to the closest annotated EnsemblGenes region
DistanceToClosestGene = distance from EnsemblGenes
# Create a new track based on a measure of predicted 'propeller twist' along the DNA helix
twist = physical property "propeller twist" derived from DNA using window of size 10 with anchor at center
# Use the TFBSoracle priors generator to derive a new positional priors track based on
# an (implicit) set of feature tracks known to the priors generator object
TFBS_prior = predict with TFBSoracle
Existing numeric datasets can be modified with
arithmetic operations (
increase,
decrease,
multiply and
divide)
or assigned explicit values with the
set operation. They can also be
transformed with various mathematical functions (including square root, logarithm and random number),
the values could be
normalized to a new range or
thresholded to create "binary valued" tracks. All of these operations work on a position-by-position basis, but the
apply operation will transform tracks with
sliding window functions which allow the new value in each position to be derived from values of several positions in a neighbourhood around each sequence position.
In addition, the GUI's
draw tool allows users to manipulate numeric datasets by drawing directly into the visualized track.
# Increase the values in the Conservation track by 2 for every position
increase Conservation by 2
# Increase the values in the Conservation track by the values from another track (position by position)
increase Conservation by DistanceToClosestGeneTrack
# Assign the Conservation track a value of 0 within all repeat regions
# Return the results in a new track
MaskedConservation = set Conservation to 0 where inside RepeatMasker
# Return a new track based on the absolute values of Track1 (negative values converted to positive)
Track2 = transform Conservation with absolute
# Rescale Track1 so that the values fall within the new range 10 to 100.
# (i.e. the smallest value in the track will now be 10 and the largest value will now be 100)
normalize Track1 from range [dataset.min,dataset.max] to range [10,100]
# Transform the Conservation track so that all values previously above (or equal to) 0.5 will be set to 1
# and those below will be set to 0
threshold Conservation with cutoff=0.5 set values above cutoff to 1 and values below cutoff to 0
# Smooth the Conservation track by applying a 25bp wide "Bartlett" sliding window.
# This will assign each position a new value based on a weighted average of the values in its vicinity
SmoothConservation = apply Bartlett window of size 25 with anchor at center to Conservation
MotifLab is an expansion of an earlier program called
PriorsEditor whose primary purpose was for creating numeric tracks
that could be used as position-specific priors to guide the motif discovery process.
In addition, apart from being merely descriptive and informative, numeric tracks can be used in
conditions to limit operations to certain positions in the sequence or to regions with
certain value distributions within their sites.
# Search for motifs and binding sites with MEME using the "Conservation" track as positional priors
[TFBS,MEMEmotifs] = motifDiscovery in DNA with MEME {Positional priors=Conservation, ... }
# Mask positions in the DNA sequence with low conservation
mask DNA with "N" where Conservation < 0.2
# Remove predicted TFBS regions with low conservation within the site
filter TFBS_predicted where region's average Conservation < 0.2
# Use the statistic operation to find the maximum tag count value across all positions in a track.
# The result is returned as a Sequence Numeric Map with maximum values for each individual sequence
# and with a default map value reflecting the highest count across all sequences
Max_tag_count = statistic "maximum value" in ChIPseq_tag_counts
# Discover whether TF binding sites are more conserved than other parts of the genome
# by analyzing the distribution of conservation track values inside versus outside TFBS regions
Analysis1 = analyze numeric dataset distribution {Numeric dataset = Conservation, Region dataset = TFBS}