Subroutines and Code Reuse
Everyone who has written more than 5 programs, regardless of language, has probably experienced the need to copy code which performs a critical function from one part of a program to another part of the program, or from one program to another. This is perfectly fine, except when you need to make changes to that process, such as when the data that it is designed to work with changes, or you want new functionality from the process. At that point, you need to change the same code multiple times, potentially in multiple files. Ultimately, the way to prevent this is to code these tasks into subroutines. Once they are written into subroutines, there are ways to make it possible to share the same subroutine between multiple programs (day 3), but today we will focus on the process of coercing code into a subroutine.
Subroutine Structure
All subroutines have the same basic structure.
* They either have a defined name, or are stored into a variable, which you use in place of the one or more lines of code that it is performing within your code.
* They can take one or more arguments as inputs, in the form of a list
* They process one or more statements limited to the same syntax rules that the overall programming language is limited.
* They can return one or more values as output, in the form of a list.
This subroutine defition is the same in every programming language. Here we will delve into each of these aspects.
Definition
All subroutines must either be given a name, or must be stored into a variable. This name or variable is used in place of the code that it is performing throughout your code.
# name definition
sub print_greeting {
print "HELLO WORLD\n";
}
# subroutine referenced to scalar variable
my $print_greeting = sub {
print "Hello World\n";
}; # note the semicolon here. It is absolutely necessary
#here is how to use the above
&print_greeting();
$print_greeting->();
Subroutines have package scope, meaning that they are available for use anywhere in the same package (including the main package). We will get a better understanding of what this means when we talk about packages, but for now, assume it means that you can use a named subroutine anywhere in the same script that the subroutine is defined in.
Subrefs are variables, and so have the same scope issues as normal variables, but these can be passed as arguments to other subroutines, which makes them a little more useful in some contexts.
Inputs and Outputs
In order to take code that you have written to perform a task and make it into a subroutine, you have to have an understanding of its inputs and outputs. Perl provides the @_ array to your subroutine. When you call a subroutine, or subref, this is automatically populated with all of the arguments that you pass into the () after the subname or reference.
sub print_greeting_to {
my $name = shift; # shift by default works on @_ in a subroutine or subref
print "HELLO $name\n";
}
my $print_greeting_to = sub { my $name = shift; print "HELLO $name\n"; };
&print_greeting("Dave");
$print_greeting->("Dave");
Note that @_ is a single array. If you pass in two or more arrays or hashes, without referencing them to scalars, they are squashed together in a way that you probably wont like.
Access To Variables
Ideally, subroutines should be written as if they are a complete black box, without any access to data that is not explicitly passed into it as arguments, or that it creates itself within its block. You will experience great joy the first time you go to copy a subroutine from one script to another, or move it into its own Package, if you follow this advice. Otherwise, you will probably be frustrated. In fact, in some cases, it is possible to break a script just by moving the subroutine around within it, if the subroutine attempts to access variables defined outside of its block. Here are a few examples, with notes on whether they work or not:
# works, but is fragile
my $value = 'Dave';
my $printer = sub { print "Hello $value\n"; };
&print_greeting();
$printer->();
exit;
sub print_greeting {
print "Hello $value\n";
}
# does not work at all
sub print_greeting {
print "Hello $value\n";
}
my $printer = sub { print "Hello $value\n"; };
my $value = "Dave";
$printer->();
&print_greeting();
exit;
# much more robust
my $value = "Dave";
my $printer = sub { my $name = shift; print "Hello $name\n"; };
&print_greeting($value);
$printer->($value);
exit;
sub print_greeting {
my $name = shift;
print "Hello $name\n";
}
# works just as well as the previous example
sub print_greeting {
my $name = shift;
print "Hello $name\n";
}
my $printer = sub { my $name = shift; print "Hello $name\n"; };
my $value = "Dave";
&print_greeting($value);
$printer->($value);
exit;
There is one exception to this advice which you may, or may not, wish to learn about. It is sometimes very useful to create something called a closure. These are subroutines or subroutine references which create one or more my variables, and then return subroutine references which make use of these variables inside their own blocks. These can be quite useful. Here is a timer that you can make use of in code to time how long it takes to perform one or more statements.
sub timer {
my $time = time; # this is the perl time function
return sub {
return time - $time;
}
}
Note the way that the subref returned by the timer subroutine makes use of the $time variable defined within the timer subroutine block with my. As long as you have a reference to the subref returned by this subroutine in scope somewhere in your code, the original value of time at the moment timer was called and its reference was assigned to that scalar is available for use by the subref.
my $timer1 = timer();
do some stuff....
print $timer1->();
my $timer2 = timer(); # contains a different $time than timer1
do some stuff
print $timer1->(); # this is the time from the first timer instantiation
print $timer2->(); # this is the time from the second timer instantiation
Example Fasta Parser Subroutine
Here I will outline the above analysis applied to our fasta parser.
1. Definition
The first thing to do is copy the entire block which reads from the handle into a subroutine definition block named 'next_seq', and replace it with the call that we want in its place.
...
my @seqs;
while (my $current_seq = &next_seq()) {
push @seqs, $current_seq;
}
...
sub next_seq {
while (my $line = <>) {
chomp $line;
if ($line =~ m/^>/) {
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
}
my ($id, $acc, $description) = split /\|/, $line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
}
else {
unless ($current_seq->{'id'}) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
}
2. Input and Output Analysis
Now we can identify the inputs and outputs that we need for the subroutine to function. We can see that the caller of the subroutine expects a hashref back, based on what it is doing with each entry stored in the @seqs array later on. We can also see that the subroutine is reading from a <>. Here we will want to make a slight change to how we worked with the filehandle in the non subroutine version. Instead of relying on the magic of perl's <> operator to magically read in the file we pass as argument to the script and provides its lines to our code, we want to explicitly process this argument, die with an error if it isnt present, or create an IO::File object with the path which we can pass to the subroutine as an argument. The advantages of this will become readily apparent when we move the subroutine into a package and later object.
my $file = shift or die "usage: fasta_processor.pl path_to_fasta_file\n";
# note, the open function allows you to create an IO::File object by using a scalar variable in place of an all-caps string
open (my $fh, '<', $file) or die "Could not open $file $!\n";
my @seqs;
while (my $current_seq = &next_seq($fh)) {
push @seqs, $current_seq;
}
...
sub next_seq {
my $fh = shift;
my $current_seq = {};
while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^>/) {
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
}
my ($id, $acc, $description) = split /\|/, $line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
}
else {
unless ($current_seq->{'id'}) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
return $current_seq;
}
Now the basic outline of our subroutine is in place, but we dont want the line processor to work through the entire file, but want it to stop processing when it is finished with its current record. Lets give the while loop a name, and use the last operator on that name when we are ready to jump out of the loop.
my $file = shift or die "usage: fasta_processor.pl path_to_fasta_file\n";
# note, the open function allows you to create an IO::File object by using a scalar variable in place of an all-caps string
open (my $fh, '<', $file) or die "Could not open $file $!\n";
my @seqs;
while (my $current_seq = &next_seq($fh)) {
push @seqs, $current_seq;
}
...
sub next_seq {
my $fh = shift;
my $current_seq = {};
SEQLINE: while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^>/) {
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
last SEQLINE;
}
my ($id, $acc, $description) = split /\|/, $line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
}
else {
unless ($current_seq->{'id'}) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
return $current_seq;
}
Now we have a barebones system, ready for the next analysis.
2. Access to Variables
Here we see two problems. The original design wants to be able to access our @seqs array, and push the sequence onto it when it hits the next id line. We could code the subroutine to do this, but it would prevent us from moving that subroutine into a package or object method later on, and it is usually not good practice to do this. Related to this is the fact that we want to finish processing the current record the moment we hit a new idline, but we need to save this idline for later use, as it is not possible to reverse the file reader to get it again the next time the subroutine is called. For now, we will break our rule of never allowing the subroutine to have access to variables declared outside of its scope by creating a scalar variable outside of the subroutine for it to use. This is a good example of getting a working solution in place which we can optimize later.
my $file = shift or die "usage: fasta_processor.pl path_to_fasta_file\n";
# note, the open function allows you to create an IO::File object by using a scalar variable in place of an all-caps string
open (my $fh, '<', $file) or die "Could not open $file $!\n";
my @seqs;
while (my $current_seq = &next_seq($fh)) {
push @seqs, $current_seq;
}
...
my $last_id_line; # make it undef to start
sub next_seq {
my $fh = shift;
my $current_seq = {};
SEQLINE: while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^>/) {
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
$last_id_line = $line; # here is where we set the last_id_line for use the next time the subroutine is called
last SEQLINE;
}
my ($id, $acc, $description) = split /\|/, $line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
}
else {
unless ($current_seq->{'id'}) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
return $current_seq;
}
Now we need to process that last_id_line if it is defined when the subroutine is called. It is actually easier if we slightly adjust the subroutine code to always read one line of input from the filehandle into that variable if it is not already defined every time the subroutine is called. Then we can process that line, including error checking, and returning undefined if it is still undefined, meaning we have reached the EOF.
my $file = shift or die "usage: fasta_processor.pl path_to_fasta_file\n";
# note, the open function allows you to create an IO::File object by using a scalar variable in place of an all-caps string
open (my $fh, '<', $file) or die "Could not open $file $!\n";
my @seqs;
while (my $current_seq = &next_seq($fh)) {
push @seqs, $current_seq;
}
...
my $last_id_line; # make it undef to start
sub next_seq {
my $fh = shift;
my $current_seq = {};
unless ($last_id_line) {
$last_id_line = <$fh>;
chomp $last_id_line;
}
return unless $last_id_line; # this happens at EOF
my ($id, $acc, $description) = split /\|/, $last_id_line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
undef $last_id_line; # this is critical in order to return undef on EOF
SEQLINE: while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^>/) {
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
$last_id_line = $line; # here is where we set the last_id_line for use the next time the subroutine is called
last SEQLINE;
}
}
else {
unless ($current_seq->{'id'}) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
return $current_seq;
}
Notice how we moved the code which processes the idline outside of the while loop. There are other things we can do to clean that loop up, as well, given the way the subroutine is being used.
a. We can move the check for missing sequence out to just before the record is returned, which is a logical place to check to make sure that we have a sequence.
b. We can move the check for non-fasta compliance into the idline processing loop, since that is the best place for it anyway.
c. Since the subroutine always processes the idline, the handle reader is simply expected to store sequence, or last_id_line. This
makes the loop much simpler to code, as doesnt need to check for sequence when the next idline is hit, and it doesnt need to check for
the non-fasta compliant error.
my $last_id_line; # make it undef to start
sub next_seq {
my $fh = shift;
my $current_seq = {};
unless ($last_id_line) {
$last_id_line = <$fh>;
chomp $last_id_line;
# if this isnt EOF, and we get a line, test for Fasta Compliance
if ($last_id_line && !($last_id_line =~ m/^>/)) {
die "Non Fasta File Error\n";
}
}
return unless $last_id_line; # this happens at EOF
my ($id, $acc, $description) = split /\|/, $last_id_line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
undef $last_id_line; # this is critical in order to return undef on EOF
SEQLINE: while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^>/) {
$last_id_line = $line; # here is where we set the last_id_line for use the next time the subroutine is called
last SEQLINE;
}
else {
$current_seq->{'sequence'} .= $line;
}
}
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
return $current_seq;
}
And we are essentially done. This will work in place of the original looping code. If we want to move the subroutine into its own package, we simply need to make sure that we move the my $last_id_line into the package as well. Or, we could use a closure to create a subref which the code can call to get each sequence, which stores the last_id_line within the closure in a manner which more purely adheres to our rules for subroutines not accessing variables outside of their scope.
use strict;
my $file = shift or die "fasta_processor_sub.pl <path_to_fasta_file>\n";
open (my $fh, '<'.$file) or die "Couldnt open $file $!\n";
my @seqs;
my $next_seq = &next_seq_closure();
while (my $current_seq = $next_seq->($fh)) {
push @seqs, $current_seq;
}
foreach my $seq (@seqs) {
printf "ID: %s\nACC: %s\nDescription: %s\nSeq:\n%s\n",
$seq->{'id'},
$seq->{'accession'},
$seq->{'description'},
$seq->{'sequence'};
}
exit;
sub next_seq_closure {
my $last_id_line;
return sub {
my $fh = shift;
my $current_seq; # we want this to be undef, so it can be passed back undef at EOF
# this will only pull in the next line on the first call to the subroutine
# and the call after the last sequence in the file
unless ($last_id_line) {
$last_id_line = <$fh>;
if ($last_id_line && !($last_id_line =~ m/^>/)) {
die "Non Fasta File Error\n";
}
chomp $last_id_line if ($last_id_line);
}
return unless $last_id_line; # this returns at EOF
my ($id, $acc, $description) = split /\|/, $last_id_line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
undef $last_id_line; # this sets up the ability to return undef on EOF
SEQLINE: while (my $line = <$fh>) {
chomp $line;
if ($line =~ m/^>/) {
$last_id_line = $line;
last SEQLINE;
}
else {
$current_seq->{'sequence'} .= $line;
}
}
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
return $current_seq;
}
}
This really is a matter of taste. Either way, the subroutine works, and can be pulled into a package or object later on without much trouble.
Now, it is possible to change how and when we access sequence records.
my $current_seq = &next_seq($fh);
if ($current_seq->{'description'} eq 'Something We Dont Care About') {
exit;
}
else {
&process_seq($current_seq);
while (my $current_seq = &next_seq($fh)) {
&process_seq($current_seq);
}
}
Comments (0)
You don't have permission to comment on this page.