perlsOfLondon

 

parsing_101_lab

Page history last edited by Darin London 1 yr ago

Take a look at the EMBL file /home/londo003/perl_class2/data_files/EMBL_example.txt.  This is an EMBL file, from the European Molecular Biological Laboratory.  It represents the sequence string, and ID assigned to that sequence by the EMBL, Genbank, and DDBJ, along with all of the collective annotation that these institutions have decided to track for each sequence.

 

 

1.  Write a script which takes this file as input, and produces a file with all sequence records in the following format:

 

 

ANSWER

 

#!/usr/bin/perl                                                                                                                

 

use strict; # this will always help you                                                                                        

 

# we could just work on <>, but this is more helpful to the user                                                               

my $embl_file = shift or die "embl_processor.pl path_to_embl_file\n";

 

# open the filehandle into a scalar variable                                                                                   

# this will become more useful later on in the class                                                                           

open (my $embl, "<".$embl_file) or die "Couldnt open $embl_file $!\n";

 

# set up an array to hold                                                                                                      

my @records = ();

 

# this is a state variable that is set to true after the SQ line is encountered                                                

# which tells the line processecor to process sequence entry lines, since there                                                

# is no identifiable pattern that can be matched for them                                                                      

my $inseq;

 

# this will hold our current record                                                                                            

# it is a hashref because we intend to add it to the                                                                           

# @records array.  We could make it a hash, and reference                                                                      

# using \%curr_record later, but it is always easier to just                                                                   

# get used to using a data container as a reference from                                                                       

# the start                                                                                                                    

my $curr_record = {}; # this is a hashref                                                                                      

 

LINE: while (my $line = <$embl>) {

    chomp $line;

    next if ($line eq 'XX'); # these are spacer lines                                                                          

 

    if ($line =~ m/^ID/) {

        # harvest the id                                                                                                       

        my $id = (split /\s+/, $line)[1];

        $id =~ s/\;$//;

        $curr_record->{id} = $id;

    }

    elsif ($line =~ m/^AC/) {

        # harvest the accession                                                                                                

        my $acc = (split /\s+/, $line)[1];

        $acc =~ s/\;$//;

        $curr_record->{accession} = $acc;

    }

    elsif ($line =~ m/^RL\s+(.*)/) {

        # harvest the RL line                                                                                                  

        my $rl = $1;

 

        # note, we could push this                                                                                             

        # onto an arrayref keyed to rl, as well                                                                                

        $curr_record->{rl} .= ' '.$rl; # prepend it with a space                                                               

    }

    elsif ($line =~ m/^FT/) {

        # there are several of these that we are interested in                                                                 

        if ($line =~ m/\/mol_type\=(.*)/) {

            # harvest the mol_type                                                                                             

            my $mol_type = $1;

            $mol_type =~ s/\"//;

            $curr_record->{mol_type} = $mol_type;

        }

        elsif ($line =~ m/\/chromosome\=(.*)/) {

            # harvest the chromosome                                                                                           

            my $chromosome = $1;

            $chromosome =~ s/\"//;

            $curr_record->{chromosome} = $chromosome;

        }

        elsif ($line =~ m/\/organism\=(.*)/) {

            # harvest the organism                                                                                             

            my $organism = $1;

            $organism =~ s/\"//;

            $curr_record->{organism} = $organism;

        }

        else {

            # skip all other FT lines                                                                                          

            next LINE;

        }

    }

    elsif ($line =~ m/^SQ\s+(.*)/) {

        # harvest the sequence bp info

        my $seq_info = $1;

 

        # there are other ways of doing the following, this is just one way                                                    

        my @parts = split /\;\s+/, $seq_info; # note the split pattern takes care of whitespace between records                

 

        $curr_record->{total_bp} = (split /\s+/, $parts[0])[2];

        $curr_record->{num_a} = (split /\s+/, $parts[1])[1];

        $curr_record->{num_c} = (split /\s+/, $parts[2])[1];

        $curr_record->{num_g} = (split /\s+/, $parts[3])[1];

        $curr_record->{num_t} = (split /\s+/, $parts[4])[1];

 

        # reset the internal state of the program to know that it is                                                           

        # harvesting sequence bps for future lines until the state gets reset                                                  

        $inseq = 1;

    }

    elsif ($inseq) {

        if ($line eq '//') {

            # reset inseq to undefined to denote that                                                                          

            # we are no longer harvesting sequence lines.                                                                      

            # This is necessary to allow the program to skip                                                                   

            # the attributes of the EMBL file that we are not interested in                                                    

            # otherwise, it will shove these lines onto the sequence                                                           

            $inseq = undef; # reset inseq state                                                                                

 

            # add the record onto the end of the @records array                                                                

            push @records, $curr_record;

 

            # refresh the curr_record to a new, empty reference to make sure                                                   

            # we dont have contamination of record information                                                                 

            $curr_record = {};

        }

        else {

            $line =~ s/^\s+//; # get rid of leading whitespace                                                                 

 

            # there are other ways of doing the following, this is just one way                                                

            my @parts = split /\s+/, $line;

            pop @parts; # throw away the bp count                                                                              

            $curr_record->{sequence} .= join('', @parts); # concatenation into one line                                        

        }

    }

    else {

        # skip all other lines                                                                                                 

        next LINE;

    }

}

close $embl;

 

foreach my $record (@records) {

    # this is called a heredoc                                                                                                 

    # it prints everything from the next line to the string (I use EOR, but this is arbitrary) occurring                       

    # by itself, on a line of the code.  If the String is double quoted, variables are interpolated.                           

    # If it is single quoted, variables are not interpolated.                                                                  

    # It is a redimentary form of templating.  For more powerful templating, read up on the Template Toolkit,                  

    # which is contained in the CPAN 'Template' Package.                                                                       

 

    print <<"EOR";

record_start

  ID $record->{id}

  Accession $record->{accession}

  RL $record->{rl}

  Molecular Type: $record->{mol_type}

  Organism: $record->{organism}

  Chromosome: $record->{chromosome}

 

  Number of Base Pairs: $record->{total_bp}

  Number of As: $record->{num_a}

  Number of Cs: $record->{num_c}

  Number of Gs: $record->{num_g}

  Number of Ts: $record->{num_t}

 

  sequence_start

  $record->{sequence}

  sequence_end

record_end

EOR

}

exit;

 

 

record_start

  ID $id

  Accession $acc

  RL $rl_line

  Molecular Type: $mol_type

  Organism: $organism

  Chromosome: $chrom

 

  Number of Base Pairs:  $num_bp

  Number of As: $num_a

  Number of Cs: $num_c

  Number of Gs: $num_g

  Number of Ts: $num_t

 

  sequence_start

  $seq

  sequence_end

record_end

.

.

.

 

Note that the EMBL file puts a number denoting the position in the sequence after each line of sequence, and splits each line of sequence into chunks of 10.  Your sequence must not contain the location numbers.  If you want to remove the whitespaces between each chunk, or just print the entire sequence out on one line, that is up to you.  Also, your RL line must be one line of text only, e.g. concatenate all RL entries in the file.  Also, we only want identifiers for ID and AC, nothing else (including the ;).  One nice thing about EMBL is that it does have '//' as a record separator, which is a signal that you can process the current_record.  This makes it easier to stream, as we will see later.

 

Here is some psuedocode to help you get started:

 

my inseq # a state variable denoting whether we are collecting sequence data

my record

while (line = input) {

  cleanup lineending;

  if (line begins with ID) {

    harvest id into record data container

  }

  elsif (line begins with AC) {

    harvest acc into record

  }

  elsif (line begins with RL) {

    add RL line to record

  }

  elsif (line begins with FT) {

    if (line contains '/mol_type') {

      harvest the mol_type into record

    }

    elsif (line contains '/chromosome') {

       harvest the chromosome into record

    }

    elsif (line contains '/organism') {

       harvest the organism into record

    }

    else {

      next

    }

  }

  elsif (line begins with SQ) {

    inseq = true

    harvest num_bp, num_a, num_c, num_g, num_t into record

  }

  elsif ($inseq) {

    if (line eq '//') {

      undef inseq # reset the state

      add record onto stack of records

      clean record data container

     }

     else {

       add sequence bps to record

     }

  }

  else {

    next;

  }

}

 

foreach record (records) {

  print "record_start\n"

  print "  ID: "record:id"\n"

  print "  Accession: "record:acc"\n"

  print "  RL: "record:RL"\n"

  print "  Molecular Type: "record:mol_type"\n"

  ...

  print "  sequence_start\n";

  print record:seq"\n"

  print "  sequence_end\n";

  print "record_end\n";

}

 

exit

 

 

Please make sure you have finished question 1 before the next class, as the labs in the rest of the class assume that you have a working parser capable of producing this output.  If you have finshed question 1, and there is still time, or if you want further practice outside of class, you should be able to quickly modify/cut-and-paste/copy it to do the following:

2.  Write these sequence records out to different files based on Molecular Type.

 

3. Write out Fasta formatted records to different files based on Chromosome.  Use the following ID line format:

>$id $accession chr:$chr bp:$num_bp a:$num_a c:$num_c g:$num_g t:$num_t

$seq

.

.

 

Note, it is perfectly fine to print the entire sequence out on one line, if you want.  You must remove the whitespace between each 10bp chunk in the original file.  As extra credit, see if you can print out each sequence in 80bp lines.

 

Comments (0)

You don't have permission to comment on this page.