DNA Sequence Datasets (also called
DNA tracks or
DNA sequence tracks) are used to hold the DNA sequence for a
sequence segment, represented with one base letter for each position within the sequence.
Most often, objects of this type will hold the original DNA sequence from that location, but this does not have to be the case.
The DNA sequence could instead be a slightly modified version of the original sequence, a scrambled version or even a fully artificially created sequence.
The base letters would normally be either A, C, G or T, but all types of letters are allowed in the sequence. For instance could N's or X's be used to mask portions of a sequence.
Base letters can be in either uppercase or lowercase, and the case may or may not be important depending on the context and the tools used to analyze the sequence.
For example, lowercase letters can be used to indicate repetitive segments of a sequence that should be ignored by a motif discovery tool.
DNA sequences are always stored relative to the direct strand internally in MotifLab (independent of the annotated strand orientation of the sequence),
but DNA sequences can be converted
on-the-fly to display or manipulate the sequence relative to either strand when necessary.
DNA Sequence Datasets are normally imported from predefined tracks or loaded from files (in
FASTA or
2bit format), but they can also be artificially created based on a background distribution.
# Import the DNA sequence for the current sequences from the preconfigured track called "Genomic DNA"
DNA = new DNA Sequence Dataset(DataTrack:Genomic DNA)
# Import the DNA sequences for the current sequences from a FASTA file. Note that the sequence objects
# must already have been created and match the names and lengths of the sequences in the FASTA file.
DNA = new DNA Sequence Dataset(File:"C:\data.fas", Format=FASTA)
# Create a new 'empty' DNA sequence track consisting of only N's
DNA = new DNA Sequence Dataset()
# Create a new DNA sequence track consisting of only A's (on the direct strand)
DNA = new DNA Sequence Dataset('A')
# Create an artificial DNA sequence track by randomly sampling base letters from the distribution
# defined in the background model object "EDP_human_3"
DNA = new DNA Sequence Dataset(EDP_human_3)
The main operation for modifying DNA Sequence Datasets is
mask, which can replace base letters in certain positions with new letters or change the case of the letters.
In addition, the
plant operation can insert new binding motifs for transcription factors into an existing DNA sequence.
The GUI's
draw tool allows users to manipulate the DNA sequence by drawing or typing directly into the visualized track.
# Replace the DNA sequence letters with the letter X within RepeatMasker regions
mask DNA with "X" where inside RepeatMasker
# Replace the DNA sequence letters with the letter "A" within RepeatMasker regions
# taking the strand orientation of the sequences into account
mask DNA on relative strand with "A" where inside RepeatMasker
# Change the case of all DNA bases outside of gene regions to lowercase.
# Return the result as a new track named "DNA_masked"
DNA_masked = mask DNA with lowercase where not inside EnsemblGenes
# Replace bases within TFBS regions with new bases randomly sampled from the background model "EDP_human_3"
# (This will destroy the binding motifs)
mask DNA on relative strand with EDP_human_3 where inside TFBS
# Replace bases within TFBS regions with the "sequence" property annotated in these regions
mask DNA with TFBS
# Insert the motif M00003 at a random location in each sequence (overwriting the current sequence)
# Return the modified sequence in a new track called "SequenceWithMotif".
# The region track "PlantedMotifs" indicate where the motif was planted in each sequence.
[SequenceWithMotif,PlantedMotifs] = plant M00003 in DNA
DNA sequence tracks are used as input to
motif discovery and
motif scanning tools
(and also
module discovery/
scanning)
and similar operations or tools that search DNA sequences for specific patterns (such as the
search and
score operations).
Background Models can be derived from DNA tracks, and base frequency statistics can also be derived with the
statistic operation
or the
GC-content analysis.
Sequence dependent characteristics of the DNA helix, such as e.g. stacking energy and propeller twist, can be derived from a DNA track with the
physical operation and represented with numeric tracks.
In MotifLab v2 it is possible to extract the corresponding amino acid sequence from the DNA sequence for all six reading frames.
DNA sequence tracks can also be referenced in
conditions, as demonstrated in the last example below. Here, segments of a DNA sequence masked with X's
are used to derive a new Region Dataset representing these masked portions. This is done by first creating a Numeric Dataset with value 1 for every position with an X and then converting this numeric track to a region track.
# Search for the pattern "CACGTG" within the DNA sequence and return matching regions in a new track
Matches = search DNA for "CACGTG" on both strands
# Use the MATCH algorithm to scan for matches to JASPAR motifs in the DNA sequence
TFBS = motifScanning in DNA with MATCH {Motif collection=JASPAR,Matrix threshold=0.9}
# Use the DNA track (on the relative strand) to derive a second-order Markov model of the base distribution
BGmodel = new BackGround Model {Track:DNA, Order=2, Strand=Relative}
# Count the number of T's in each sequence. Return the result as a Sequence Numeric Map
T_count = statistic "T-count" in DNA on relative strand
# Derive the GC-frequency from annotated CpG island regions of each sequence
GC_content = statistic "GC-content" in DNA where inside CpG_islands
# Perform GC-content analysis. Results are returned as an Analysis object rather than a numeric map
GC_content = analyze GC-content {DNA track = DNA}
# Derive a measure of 'propeller twist' along the DNA helix
twist = physical property "propeller twist" derived from DNA using window of size 10 with anchor at center
# Derive the amino acid sequence corresponding to the DNA sequence on the direct strand
# using a reading frame offset 2bp from the start of the sequence. The AA sequence is returned
# as a region track with consecutive 3bp regions named after the amino acids
AA_frame2 = extract "Direct-2" from DNA as Region Dataset
# Derive a Region Dataset representing the masked regions of a DNA sequence.
MaskedRegions = new Numeric Dataset(0)
set MaskedRegions to 1 where DNA equals "X"
convert MaskedRegions to region where MaskedRegion > 0