Numeric Datasets (also called numeric tracks) represent information with one numeric value for each position within a sequence segment. The type of information stored in numeric datasets could be, for instance, (per base) phylogenetic conservation levels, physical or statistical characteristics of the DNA sequence/double helix (e.g. helix twist and roll, or local GC-content), the distance from each sequence position to some target feature, per base quality scores (for sequence reads), number of ChIP-seq tag counts per position, and position-specific priors used to guide motif discovery, to list but a few examples.

Creating Numeric Datasets

Numeric annotation tracks (based on e.g. data from UCSC Genome Browser or other databases) can be imported from preconfigured tracks or loaded from files in various formats. Numeric tracks can also be derived from information in other types of tracks. For example, Priors Generators can be trained with machine learning methods to predict the location of certain features based on combined information from several different tracks. The output from a Priors Generator is a numeric track where each position reflects a prior probability (or likelihood) that the position could overlap with the target feature (for example a TF binding site).

# Import the "PhastCons100way" annotation track for the current sequences Conservation = new Numeric Dataset(DataTrack:PhastCons100way) # Import a conservation track from file in WIG format. Conservation = new Numeric Dataset(File:"C:\phastcons.wig", Format=WIG) # Create a new 'empty' numeric track where each position has a value of zero Empty = new Numeric Dataset # Create a new numeric track where each position is assigned the initial value 42 Answer = new Numeric Dataset(42) # Create a new numeric track where the value at each position is the average of the values # from three other tracks AverageValueTrack = combine_numeric track1,track2,track3 using average # Convert the existing region track "CpG_islands" into a numeric track such that all positions # within the original regions are assigned the value 100 and all other position are assigned a value of 0 convert CpG_islands to numeric with value = 100 # Create a new track by counting the number of TFBS regions that overlap with a 5bp window # centered around every position in the track CountTrack = count number of regions in TFBS overlapping window of size 5 with anchor at center # Create a new track where the value in each position is the distance (in bp) # to the closest annotated EnsemblGenes region DistanceToClosestGene = distance from EnsemblGenes # Create a new track based on a measure of predicted 'propeller twist' along the DNA helix twist = physical property "propeller twist" derived from DNA using window of size 10 with anchor at center # Use the TFBSoracle priors generator to derive a new positional priors track based on # an (implicit) set of feature tracks known to the priors generator object TFBS_prior = predict with TFBSoracle

Modifying Numeric Datasets

Existing numeric datasets can be modified with arithmetic operations (increase, decrease, multiply and divide) or assigned explicit values with the set operation. They can also be transformed with various mathematical functions (including square root, logarithm and random number), the values could be normalized to a new range or thresholded to create "binary valued" tracks. All of these operations work on a position-by-position basis, but the apply operation will transform tracks with sliding window functions which allow the new value in each position to be derived from values of several positions in a neighbourhood around each sequence position.
In addition, the GUI's draw tool allows users to manipulate numeric datasets by drawing directly into the visualized track.

# Increase the values in the Conservation track by 2 for every position increase Conservation by 2 # Increase the values in the Conservation track by the values from another track (position by position) increase Conservation by DistanceToClosestGeneTrack # Assign the Conservation track a value of 0 within all repeat regions # Return the results in a new track MaskedConservation = set Conservation to 0 where inside RepeatMasker # Return a new track based on the absolute values of Track1 (negative values converted to positive) Track2 = transform Conservation with absolute # Rescale Track1 so that the values fall within the new range 10 to 100. # (i.e. the smallest value in the track will now be 10 and the largest value will now be 100) normalize Track1 from range [dataset.min,dataset.max] to range [10,100] # Transform the Conservation track so that all values previously above (or equal to) 0.5 will be set to 1 # and those below will be set to 0 threshold Conservation with cutoff=0.5 set values above cutoff to 1 and values below cutoff to 0 # Smooth the Conservation track by applying a 25bp wide "Bartlett" sliding window. # This will assign each position a new value based on a weighted average of the values in its vicinity SmoothConservation = apply Bartlett window of size 25 with anchor at center to Conservation

Using Numeric Datasets

MotifLab is an expansion of an earlier program called PriorsEditor whose primary purpose was for creating numeric tracks that could be used as position-specific priors to guide the motif discovery process. In addition, apart from being merely descriptive and informative, numeric tracks can be used in conditions to limit operations to certain positions in the sequence or to regions with certain value distributions within their sites.

# Search for motifs and binding sites with MEME using the "Conservation" track as positional priors [TFBS,MEMEmotifs] = motifDiscovery in DNA with MEME {Positional priors=Conservation, ... } # Mask positions in the DNA sequence with low conservation mask DNA with "N" where Conservation < 0.2 # Remove predicted TFBS regions with low conservation within the site filter TFBS_predicted where region's average Conservation < 0.2 # Use the statistic operation to find the maximum tag count value across all positions in a track. # The result is returned as a Sequence Numeric Map with maximum values for each individual sequence # and with a default map value reflecting the highest count across all sequences Max_tag_count = statistic "maximum value" in ChIPseq_tag_counts # Discover whether TF binding sites are more conserved than other parts of the genome # by analyzing the distribution of conservation track values inside versus outside TFBS regions Analysis1 = analyze numeric dataset distribution {Numeric dataset = Conservation, Region dataset = TFBS}