# Translate DNA (or RNA) sequences into sequences of amino acids
So I accidentally sat in on a Biology lecture today and it gave me an idea for another little matlab project: turning sequences of nucleotides into sequences of amino acids.
If you’re interested you can read about translation here but the short of it is that translation is the process by which mRNA inputs are processed in 3-nucleotide length combinations so as to generate amino acid outputs.
So, let us say you have some strand of DNA of the sequence ‘GTAGGTATGTGCGGTGGATTCAACCTTTGGACCGAAATTGGG’ and you wanted to know what amino acids this sequence codes for. Normally this entails a tedious process where (assuming the sequence contains your start codon) you identify the start codon and then you match each following 3-nucleotide length sequence to a chart which indicates its corresponding amino acid. However, it’s rather trivial to write a script which will identify the sequence for you.
Nucleotide sequence:Amino Acid-Database
I won’t get into the specifics (because they’re boring) but first you start by creating a database which stores the identities of amino acids and their corresponding nucleotide sequences. Since I’ve created one, you can just use mine here and save it to one of the directories in your PATH.
Matching sequences to amino acids
The first function you’ll want to create is something that will match a given three-nucleotide string to the nucleotide based on the supplied database.
function match=dnamatcher(sequence) load('DNA.mat') i=find(list.codingmatrix(:,1)==sequence(1)); j=find(list.codingmatrix(:,2)==sequence(2)); k=find(list.codingmatrix(:,3)==sequence(3)); p=intersect(i,j); q=intersect(j,k); r=intersect(p,q); match=list.namematrix(r,:); |
This function returns the amino acid identity of any 3-nucleotide sequence.
General scanner
The next step is to create a function which identifies the start codon and strips the rest of the sequence into proper 3-nucleotide length segments for which to apply the previous function.
function aminoacidsequence=dnascan(sequence,type) if type=='RNA' sequence(find(sequence==U))=T; end %In case you want to use an RNA sequence instead. start=min(regexp(sequence,'ATG')); for i=start:3:length(sequence) aminoacidsequence(ceil(i/3),:)=dnamatcher(sequence(i:i+2)); end |
And there we go.
Demo
>> dnascan('GTAGGTATGTGCGGTGGATTCAACCTTTGGACCGAAATTGGG','DNA') ans = Methionine Cysteine Glycine Glycine Phenylalanine Asparagine Leucine Tryptophan Threonine Glutamic acid Isoleucine Glycine