1. Take the algorithm that you wrote to parse the EMBL_example.txt file and place it into a subroutine. Then take the algorithm that you wrote to write out each sequence record in the specified format and put it into another subroutine. The script should function exactly like it did before, except that it should use a subroutine to get the next sequence record from the input handle, instead of processing each line within the main section itself. You should probably make a copy of the original working script, and modify the copy to contain the subroutine. You can then compare the output of the new script to that of the 'reference' script using diff to make sure it still works.
ANSWER
#!/usr/bin/perl
use strict; # this will always help you
# we could just work on <>, but this is more helpful to the user
my $embl_file = shift or die "embl_processor.pl path_to_embl_file\n";
# open the filehandle into a scalar variable
# this will become more useful later on in the class
open (my $embl, "<".$embl_file) or die "Couldnt open $embl_file $!\n";
while (my $seq = &next_seq($embl)) {
&write_seq($seq);
}
close $embl;
exit;
sub next_seq {
my $handle = shift;
# this will hold our current record
my $curr_record; # this must start undef, so that it returns undef on EOF
# this is a state variable that is set to true after the SQ line is encountered
# which tells the line processecor to process sequence entry lines, since there
# is no identifiable pattern that can be matched for them
my $inseq;
LINE: while (my $line = <$embl>) {
chomp $line;
next if ($line eq 'XX'); # these are spacer lines
if ($line =~ m/^ID/) {
# harvest the id
my $id = (split /\s+/, $line)[1];
$id =~ s/\;$//;
$curr_record->{id} = $id;
}
elsif ($line =~ m/^AC/) {
# harvest the accession
my $acc = (split /\s+/, $line)[1];
$acc =~ s/\;$//;
$curr_record->{accession} = $acc;
}
elsif ($line =~ m/^RL\s+(.*)/) {
# harvest the RL line
my $rl = $1;
# note, we could push this
# onto an arrayref keyed to rl, as well
$curr_record->{rl} .= ' '.$rl; # prepend it with a space
}
elsif ($line =~ m/^FT/) {
# there are several of these that we are interested in
if ($line =~ m/\/mol_type\=(.*)/) {
# harvest the mol_type
my $mol_type = $1;
$mol_type =~ s/\"//;
$curr_record->{mol_type} = $mol_type;
}
elsif ($line =~ m/\/chromosome\=(.*)/) {
# harvest the chromosome
my $chromosome = $1;
$chromosome =~ s/\"//;
$curr_record->{chromosome} = $chromosome;
}
elsif ($line =~ m/\/organism\=(.*)/) {
# harvest the organism
my $organism = $1;
$organism =~ s/\"//;
$curr_record->{organism} = $organism;
}
else {
# skip all other FT lines
next LINE;
}
}
elsif ($line =~ m/^SQ\s+(.*)/) {
# harvest the sequence bp info
my $seq_info = $1;
# there are other ways of doing the following, this is just one way
my @parts = split /\;\s+/, $seq_info; # note the split pattern takes care of whitespace between records
$curr_record->{total_bp} = (split /\s+/, $parts[0])[2];
$curr_record->{num_a} = (split /\s+/, $parts[1])[1];
$curr_record->{num_c} = (split /\s+/, $parts[2])[1];
$curr_record->{num_g} = (split /\s+/, $parts[3])[1];
$curr_record->{num_t} = (split /\s+/, $parts[4])[1];
# reset the internal state of the program to know that it is
# harvesting sequence bps for future lines until the state gets reset
$inseq = 1;
}
elsif ($inseq) {
if ($line eq '//') {
# note that because of the self-contained nature of the subroutine, each time it is called
# the $inseq state variable is set to undefined, so we dont need to worry about resetting it
# manually here anymore.
last LINE;
}
else {
$line =~ s/^\s+//; # get rid of leading whitespace
# there are other ways of doing the following, this is just one way
my @parts = split /\s+/, $line;
pop @parts; # throw away the bp count
$curr_record->{sequence} .= join('', @parts); # concatenation into one line
}
}
else {
# skip all other lines
next LINE;
}
}
return $curr_record;
}
sub write_seq {
my $record = shift;
# this is called a heredoc
# it prints everything from the next line to the string (I use EOR, but this is arbitrary) occurring
# by itself, on a line of the code. If the String is double quoted, variables are interpolated.
# If it is single quoted, variables are not interpolated.
# It is a redimentary form of templating. For more powerful templating, read up on the Template Toolkit,
# which is contained in the CPAN 'Template' Package.
print <<"EOR";
record_start
ID $record->{id}
Accession $record->{accession}
RL $record->{rl}
Molecular Type: $record->{mol_type}
Organism: $record->{organism}
Chromosome: $record->{chromosome}
Number of Base Pairs: $record->{total_bp}
Number of As: $record->{num_a}
Number of Cs: $record->{num_c}
Number of Gs: $record->{num_g}
Number of Ts: $record->{num_t}
sequence_start
$record->{sequence}
sequence_end
record_end
EOR
}
Here is some psuedocode to get you started:
$handle = open file EMBL_example.txt
while (my $seq = &next_seq($handle)) {
&write_seq($seq);
}
sub next_seq {
my $handle = shift;
my $seq;
while (my $line = <$handle>) {
# your parsing algorithm goes here
# be sure it jumps out of the while loop each time it gets to the '//' line
}
return $seq; # note, if there were no more lines in $handle, this will be undefined
}
sub write_seq {
my $seq = shift;
place your writer code here
}
2. Write another script which uses this subroutine (yes, you will have to cut and paste the subroutine) to write sequence records out to different files based on Molecular Type.
3. Copy your script in problem 1, and modify the next_seq subroutine method to add the sequence features to your $seq datastructure. Each sequence feature will, itself, be a datastructure which has:
name: The 'Key' of the feature (e.g. CDS, mRNA, source,...)
build_description: The string which describes how to build the feature from the underlying sequence (e.g. join(1..1234,4433..4566)).
Dont worry about parsing the build_description strings, just remove the newline from them and store them as-is. In addition to these attributes, each sequence feature will store a series of 'Attribute_key=Attribute_value' definitions (e.g. it might have a 'clone' attribute with value 'XX-15E1').
You should then store each feature datastructure into an arrayref stored as the value on the original $seq datastructure with the key being the feature name (e.g. you will have one or more 'CDNA' features, one or more mRNA features, etc. on each $seq).
Your script should modify the write_seq subroutine to produce the following information:
record_start
ID $id
Accession $acc
RL $rl_line
Molecular Type: $mol_type
Organism: $organism
Chromosome: $chrom
Number of Base Pairs: $num_bp
Number of As: $num_a
Number of Cs: $num_c
Number of Gs: $num_g
Number of Ts: $num_t
features_start
feature_type: $name
feature_start
build_description: $build_description
att: $att
.
.
feature_end
.
.
feature_type: $name2
.
.
.
features_end
sequence_start
$seq
sequence_end
record_end
4. Do the same with the script that you wrote for problem 2. Wouldnt this be easier if you could just store the next_seq and write_seq subroutines into a central place that can be shared by multiple scripts? :)
Comments (0)
You don't have permission to comment on this page.