perlsOfLondon

 

cutandpaste_to_subroutines_lab

Page history last edited by Darin 1 yr ago

1.  Take the algorithm that you wrote to parse the EMBL_example.txt file and place it into a subroutine.  Then take the algorithm that you wrote to write out each sequence record in the specified format and put it into another subroutine.  The script should function exactly like it did before, except that it should use a subroutine to get the next sequence record from the input handle, instead of processing each line within the main section itself.  You should probably make a copy of the original working script, and modify the copy to contain the subroutine.  You can then compare the output of the new script to that of the 'reference' script using diff to make sure it still works.

 

ANSWER

#!/usr/bin/perl                                                                                                                                                               

use strict; # this will always help you                                                                                                                                                              

# we could just work on <>, but this is more helpful to the user                                                                                                                                     
my $embl_file = shift or die "embl_processor.pl path_to_embl_file\n";

# open the filehandle into a scalar variable                                                                                                                                                         
# this will become more useful later on in the class                                                                                                                                                 
open (my $embl, "<".$embl_file) or die "Couldnt open $embl_file $!\n";

while (my $seq = &next_seq($embl)) {
    &write_seq($seq);
}
close $embl;
exit;

sub next_seq {
    my $handle = shift;

    # this will hold our current record                                                                                                                                                              
    my $curr_record; # this must start undef, so that it returns undef on EOF                                                                                                                        

    # this is a state variable that is set to true after the SQ line is encountered                                                                                                                  
    # which tells the line processecor to process sequence entry lines, since there                                                                                                                  
    # is no identifiable pattern that can be matched for them                                                                                                                                        
    my $inseq;

    LINE: while (my $line = <$embl>) {
        chomp $line;
        next if ($line eq 'XX'); # these are spacer lines                                                                                                                                            

        if ($line =~ m/^ID/) {
            # harvest the id                                                                                                                                                                         
            my $id = (split /\s+/, $line)[1];
            $id =~ s/\;$//;
            $curr_record->{id} = $id;
        }
        elsif ($line =~ m/^AC/) {
            # harvest the accession                                                                                                                                                                  
            my $acc = (split /\s+/, $line)[1];
            $acc =~ s/\;$//;
            $curr_record->{accession} = $acc;
        }
        elsif ($line =~ m/^RL\s+(.*)/) {
            # harvest the RL line                                                                                                                                                                    
            my $rl = $1;

 

            # note, we could push this                                                                                                                                                                  

            # onto an arrayref keyed to rl, as well                                                                                                                                                     

            $curr_record->{rl} .= ' '.$rl; # prepend it with a space                                                                                                                                    

        }

        elsif ($line =~ m/^FT/) {

            # there are several of these that we are interested in                                                                                                                                      

            if ($line =~ m/\/mol_type\=(.*)/) {

                # harvest the mol_type                                                                                                                                                                  

                my $mol_type = $1;

                $mol_type =~ s/\"//;

                $curr_record->{mol_type} = $mol_type;

            }

            elsif ($line =~ m/\/chromosome\=(.*)/) {

                # harvest the chromosome                                                                                                                                                                

                my $chromosome = $1;

                $chromosome =~ s/\"//;

                $curr_record->{chromosome} = $chromosome;

            }

            elsif ($line =~ m/\/organism\=(.*)/) {

                # harvest the organism                                                                                                                                                                  

                my $organism = $1;

                $organism =~ s/\"//;

                $curr_record->{organism} = $organism;

            }

            else {

                # skip all other FT lines                                                                                                                                                               

                next LINE;

            }

        }

        elsif ($line =~ m/^SQ\s+(.*)/) {

            # harvest the sequence bp info                                                                                                                                                              

            my $seq_info = $1;

 

            # there are other ways of doing the following, this is just one way                                                                                                                         

            my @parts = split /\;\s+/, $seq_info; # note the split pattern takes care of whitespace between records                                                                                     

 

            $curr_record->{total_bp} = (split /\s+/, $parts[0])[2];

            $curr_record->{num_a} = (split /\s+/, $parts[1])[1];

            $curr_record->{num_c} = (split /\s+/, $parts[2])[1];

            $curr_record->{num_g} = (split /\s+/, $parts[3])[1];

            $curr_record->{num_t} = (split /\s+/, $parts[4])[1];

 

            # reset the internal state of the program to know that it is                                                                                                                                

            # harvesting sequence bps for future lines until the state gets reset                                                                                                                       

            $inseq = 1;

        }

        elsif ($inseq) {

            if ($line eq '//') {

                # note that because of the self-contained nature of the subroutine, each time it is called                                                                                              

                # the $inseq state variable is set to undefined, so we dont need to worry about resetting it                                                                                            

                # manually here anymore.                                                                                                                                                                

                last LINE;

            }

            else {

                $line =~ s/^\s+//; # get rid of leading whitespace                                                                                                                                      

 

                # there are other ways of doing the following, this is just one way                                                                                                                     

                my @parts = split /\s+/, $line;

                pop @parts; # throw away the bp count                                                                                                                                                   

                $curr_record->{sequence} .= join('', @parts); # concatenation into one line                                                                                                             

            }

        }

        else {

            # skip all other lines                                                                                                                                                                      

            next LINE;

        }

    }

 

    return $curr_record;

}

 

sub write_seq {

    my $record = shift;

 

    # this is called a heredoc                                                                                                                                                                          

    # it prints everything from the next line to the string (I use EOR, but this is arbitrary) occurring                                                                                                

    # by itself, on a line of the code.  If the String is double quoted, variables are interpolated.                                                                                                    

    # If it is single quoted, variables are not interpolated.                                                                                                                                           

    # It is a redimentary form of templating.  For more powerful templating, read up on the Template Toolkit,                                                                                           

    # which is contained in the CPAN 'Template' Package.                                                                                                                                                

 

    print <<"EOR";

record_start

  ID $record->{id}

  Accession $record->{accession}

  RL $record->{rl}

  Molecular Type: $record->{mol_type}

  Organism: $record->{organism}

  Chromosome: $record->{chromosome}

 

  Number of Base Pairs: $record->{total_bp}

  Number of As: $record->{num_a}

  Number of Cs: $record->{num_c}

  Number of Gs: $record->{num_g}

  Number of Ts: $record->{num_t}

 

  sequence_start

  $record->{sequence}

  sequence_end

record_end

EOR

}

 

Here is some psuedocode to get you started:

 

  $handle = open file EMBL_example.txt

  while (my $seq  = &next_seq($handle)) {

    &write_seq($seq);

  }

 

sub next_seq {

  my $handle = shift;

   my $seq;

   while (my $line = <$handle>) {

     # your parsing algorithm goes here

     # be sure it jumps out of the while loop each time it gets to the '//' line

   }

   return $seq;  # note, if there were no more lines in $handle, this will be undefined

}

 

sub write_seq {

  my $seq = shift;

  place your writer code here

}  

 

2.  Write another script which uses this subroutine (yes, you will have to cut and paste the subroutine) to write sequence records out to different files based on Molecular Type.

 

3.  Copy your script in problem 1, and modify the next_seq subroutine method to add the sequence features to your $seq datastructure.  Each sequence feature will, itself, be a datastructure which has:

 

  name:  The 'Key' of the feature (e.g. CDS, mRNA, source,...)

  build_description: The string which describes how to build the feature from the underlying sequence (e.g. join(1..1234,4433..4566)). 

 

Dont worry about parsing the build_description strings, just remove the newline from them and store them as-is.  In addition to these attributes, each sequence feature will store a series of 'Attribute_key=Attribute_value' definitions (e.g. it might have a 'clone' attribute with value 'XX-15E1').

 

You should then store each feature datastructure into an arrayref stored as the value on the original $seq datastructure with the key being the feature name (e.g. you will have one or more 'CDNA' features, one or more mRNA features, etc. on each $seq).

 

Your script should modify the write_seq subroutine to produce the following information:

 

record_start

  ID $id

  Accession $acc

  RL $rl_line

  Molecular Type: $mol_type

  Organism: $organism

  Chromosome: $chrom

 

  Number of Base Pairs:  $num_bp

  Number of As: $num_a

  Number of Cs: $num_c

  Number of Gs: $num_g

  Number of Ts: $num_t

 

 

  features_start

    feature_type: $name

      feature_start

        build_description: $build_description

        att: $att

        .

        .

      feature_end

      .

      .

   feature_type: $name2

     .

     .

     .

  features_end

 

  sequence_start

  $seq

  sequence_end

record_end

 

4.  Do the same with the script that you wrote for problem 2.  Wouldnt this be easier if you could just store the next_seq and write_seq subroutines into a central place that can be shared by multiple scripts? :)

 

Comments (0)

You don't have permission to comment on this page.