perlsOfLondon

 

from_packages_to_objects

Page history last edited by Darin London 1 yr ago

Transitioning Packages to Objects

The process of transitioning a package to an object is very streightforward.  Object oriented code in perl takes advantage of a feature of the way perl uses the '->' operator, in the context of a package name or reference that has been 'blessed' using the perl bless function.

 

From the package lecture, you will recall that any subroutine in a package can be accessed using the fully-qualified package syntax.  For instance, if you have a package MyPackage which has a do_stuff subroutine, you can access it in your code as such:

 

use MyPackage;

MyPackage::do_stuff();

 

The '->' operator takes things one step further.  You can replace the final '::' between the package name (including namespace, with all previous '::' left intact) and its subroutine name with '->', and perl will do something special.  It will transparently replace the call with the traditional syntax call, but it will automatically pass the package name string (including namespace) as the first argument to the subroutine.  Thus, using the above example:

 

use MyDomain::MyPackage;

MyDomain::MyPackage->do_stuff();

 

Is equivalent to:

 

MyDomain::MyPackage::do_stuff('MyDomain::MyPackage');

 

The method of passing arguments to a subroutine in this manner is unaffected, you use the same syntax as expected, e.g.

 

MyDomain::MyPackage->do_stuff("A",1,$ref);

 

is magically transformed in perl to:

MyDomain::MyPackage::do_stuff('MyDomain::MyPackage',"A",1,$ref);

 

Perl does some further magic when the '->' operator is used in the context of blessed references instead of package names.  Blessed references are normal references that have been passed to the 'bless' function, along with a string package name to bless the reference to, e.g.

 

bless $ref, "MyDomain::MyPackage";

 

results in a $ref which is blessed into the MyDomain::MyPackage package namespce.  WIth this blessed reference, you can then call a subroutine on it using the '->' operator in the following manner:

 

$ref->do_stuff(@args);

 

And perl will magically turn it into the following:

MyDomain::MyPackage::do_stuff($ref, @args);

 

Notice the subtle but profound difference here.  Instead of passing the string package name as the first argument to the subroutine, perl passes the blessed reference.  Thus, it is possible for the subroutine to use the blessed reference in the same way that the calling code uses it.

 

Notice that, because object subroutines are ultimately used with their fully qualified namespace syntax, objects no longer need to use and extend Exporter.  Instead, the person who codes an object has a slight burden placed on them to handle these extra arguments properly, in exchange for preserving the expected argument passing syntax in the calling code.   This is how it should be.  You only have to handle the magic code in one place, instead of the 10's, hundreds, or even millions (consider how many people use DBI) of calling codes which use the package.  But notice the logical link that exists between that of the package name version, and the blessed reference version.  Calling a subroutine with the package name results in the string package name being passed as the first argument.  This is extremely handy in so-called constructor subroutines (typically called 'new', but sometimes called other things), which take it and pass it along with a reference to the bless function to create a reference blessed to their own package name.  The other subroutines, which are now called accessors or methods (see below), simply need to account for the blessed reference being passed in as the first argument, along with any other arguments, and make use of that blessed reference to store or access any data associated with that particular instance of the object.

 

Writing and Using Objects

Here is an example of a very simple object, called a stack.  Stacks have a push method, which push things onto the end of the stack, and a pop method which removes the last thing on the stack and returns it. It can also contain a count method which returns the current number of things on the stack. The underlying functionality is easily accomplished using an array reference.  We will use this as an example of the things that module writers must do, as well as how module users should use it.

 

=head1 Stack

 

=head2 Synopsis

 

use Stack;

my $stack = Stack->new;

$stack->push("A");

print $stack->count;

my $a = $stack->pop;

exit;

 

=head2 Description

 

This object implements a traditional stack data structure.  It allows scalars (or references) to be pushed onto the stack in order, and then accessed in a last-in-first-out method using the pop method.  It has one attribute, the current count of items in a particular instance of the stack.

 

=cut

 

package Stack;

 

=head2 new

 

This is the constructor for a stack.

 

=cut

 

sub new {

  my $class = shift;

  return bless [] , $class;

}

 

=head2 push

 

This method takes a scalar argument, and adds it to the end of the stack, pushing anything already on inward.

 

arguments: scalar

returns undef

 

=cut

 

sub push {

  my ($self, $thing) = @_;

  return unless ($thing);

  push @{$self}, $thing;

}

 

=head2 pop

 

This method removes the last item from the stack, and returns it.  Permenantly changing the size of the stack.

 

=cut

 

 

sub pop {

  my $self = shift;

  return pop @{$self};

}

 

=head2

 

This attribute returns the current number of items in the stack.

 

=cut

 

sub count {

  my $self = shift;

  return scalar @{$self};

}

1;

 

Calling code would use this in the following manner:

 

use Stack;

my $stack1 = Stack->new;

my $stack2 = Stack->new;

 

$stack1->push('A');

$stack2->push('B');

$stack2->push("C");

print "There are currently ".$stack1->count." things on stack 1\n";

my $thing = $stack2->pop();

print "There are currently ".$stack2->count." things on stack 2 after ${thing} was popped off\n";

exit;

 

Notice that you could read the source code for stack, and write the above code in the following manner (remember that the blessed reference that is returned from new is the same thing that is being passed into the methods).

 

use Stack;

my $stack1 = Stack->new;

my $stack2 = Stack->new;

 

push @{$stack1}, 'A';

push @{$stack2}, 'B';

push @{$stack2}, "C";

print "There are currently ".scalar @{$stack1}." things on stack 1\n";

my $thing = pop @{$stack2};

print "There are currently ".scalar @{$stack2}." things on stack 2 after ${thing} was popped off\n";

exit;

 

But this is a very bad idea.  What if I, the module writer, decide to change the underlying model of the object to the following:

 

sub new {

  my $class = shift;

  bless {

               'current_index' => 0,

               'values' => []

             }, $class

}

 

sub push {

  my ($self, $thing) = shift;

  $self->{'values'}->[$self->{'current_index'}++] = $thing;

}

 

sub pop {

  my $self = shift;

  my $thing = pop @{$self->{'values'}};

  $self->{'current_index'}--;

  return $thing;

}

 

sub count {

  my $self = shift;

  return $self->{'current_index'} - 1;

}

1;

 

Your first script continues to work as expected.  However, your second script breaks.  And in order to fix it, you have to go read the source code again, and fix your script (or scripts).  This could happen everytime the module is updated by the module writer.  Basically, an object presents a specific interface, called the application programming interface, to all users of the module.  In fact, most module writers document their objects with perldoc, so that users do not even have to read the source code in order to use it.  This API represents a contract with the user, and developers are very wary of breaking this contract.  But the way that object oriented programming works is that module writers can change all of the invisible stuff that they want, as long as they dont break the binding contract between the module and its users.  For this reason, you should avoid, at all costs, the use of the underlying model in your own code, and only stick to the documented API.

 

Types of Objects

When you start to use and develop objects, you will come across a few distinct patterns that objects can fall into.  These are called Design Patterns.  Here are a couple of patterns that you might come into contact with.

 

* Factory:  Objects fitting this pattern are designed to construct and return other objects.  In addition, they may use arguments passed to them (via the constructor, or the method being called) to decide which type of object among many other types which implement the same interface to construct and return.

 

* Object Pool: Objects fitting this pattern are designed to provide access to a pool of many instances of the same object.  This could be accomplished using a simple algorithm like our Stack object, but usually these objects use caching, lazy loading, and other techniques to minimize the amount of memory required to provide access to the objects in the pool at any one time, which may involve them implementing an iterator pattern (see next).

 

* Iterator: Objects fitting this class provide some method to get access to a (potentially infinite) stream of objects in a sequential manner.

 

* Singleton: Objects fitting this class force each 'instantiation' of the object to use the exact same underlying data container(s).  What this means is that the object is only created once, and each subsequent time the constructor is called, no matter where in the code, the same underlying blessed reference is returned.  Object Pools will typically be implemented as singletons.

 

See the link above for many other examples.

 

Types of Object Subroutines

We have already been using a bit of lingo with regard to objects that I want to clear up in this section.  When talking about objects, the word subroutine, as it is originally understood, becomes ambigous because of the different contexts in which they can be used.  We need a few more terms to clear up this ambiguity.  Just remember that all of these terms ultimately relate to a subroutine used in a specific context.

 

* Constructor:  Constructor subroutines take the package name as argument, and construct a new blessed reference (or in the case of a Singleton return the same blessed reference, possibly with meta information tacked onto it) with the package name.  Typically, objects contain a single 'new' constructor method, but this is not a hard and fast rule.  Other types of constructors may include a copy constructor which takes no arguments and returns an exact copy (ideally with all underlying references dereferenced and copied) of the object; a bridge, which takes some other type of object and copies its relevant data into a new instance of itself; and others.

 

* Accessor: Accessor subroutines take the blessed reference and optional argument, and provide access to the attributes of the object.  Note, not all objects have attributes.  There are a couple of different ways accessors can be presented:

 

    - getter/setter:  In this pattern, each attribute X is represented by a separate get_X and set_X subroutine.  The get_X subroutine returns the current value assigned to the

    attribute, even if it is undefined.   The set_X subroutine takes the reference, and a value as argument (including undef), and sets the attribute to this value.

    - accessor/unsetter: In this pattern, each attribute X is represented by an X and unset_X subroutine.  The X subroutine takes the reference, and an optional value as

    arguments.  If a defined value is passed in, the attribute is reset to this value.  The subroutine always returns the value associated with the attribute (after setting the value if

    done).  The unset_X method is simply designed to allow users to unset the value of the attribute.  There are cases where objects have accessors without unsetters. If

    unsetting the value of an attribute is not important, this is perfectly ok.

 

* Method:  Method subroutines take the blessed reference, and optional arguments, and perform some kind of comutation based on the data in the reference and the arguments.  This is where object oriented programming is more powerful than traditional package based access to subroutines.  The code for performing complex computations on specific objects is written into the same source code as the code for storing and representing the object, instead of into separate packages, as is typically the case.

 

Examples: IGSP::SeqFactory and IGSP::FastaSeqO

Transitioning IGSP::FastaSeq to an object is pretty streightforward.  We simply remove the use and extension of Exporter, code a constructor to take the filehandle as argument, and code the next_seq method to use the filehandle stored onto the blessed self instead of taking it as an argument.  In addition, we should rename the object to better reflect what it does.  It is a Factory object designed to create sequence resources from Fasta formatted files.  We can call it a SeqFactory. This makes it a bit easier to modify in the future to handle other types of sequences.

 

In addition, we can get rid of that private last_id_line variable, and replace it with an instance value on the object.  This is important.  If you try to use a my variable to store this value, like the package does, then it becomes a Singleton.  Even though you create two IGSP::SeqFactory objects, with two different file handles (as you very well may want to do), you will find that the last_id_line from one is shared with the all others, and clashes with them.  If you move it to an instance variable (in our model, this is tied to the 'last_id_line' key on the blessed hash reference, but it could be tied to an index on a blessed array, or something else), then each instance of the object can store its own value in this field.

 

package IGSP::SeqFactory;

 

sub new {

    my ($class, $fh) = @_;

    die ("IGSP::SeqFactory requires a FileHandle!\n")

    unless (

             $fh

             && ( $fh->isa('GLOB') )

           ); # can also be an IO::File                                                                                                     

 

    my $self = {

        'fh' => $fh,

        'last_id_line' => undef,

    };

    return bless $self, $class;

}

 

sub next_seq {

    my $self = shift;

    my $fh = $self->{'fh'};

    my $current_seq;  # we want this to be undef, so it can be passed back undef at EOF                                                     

 

    # this will only pull in the next line on the first call to the subroutine                                                             

    # and the call after the last sequence in the file                                                                                      

    unless ($self->{'last_id_line'}) {

        $self->{'last_id_line'} = <$fh>;

        if ($self->{'last_id_line'} && !($self->{'last_id_line'} =~ m/^>/)) {

            die "Non Fasta File Error\n";

        }

 

        chomp $self->{'last_id_line'} if ($self->{'last_id_line'});

    }

    return unless $self->{'last_id_line'}; # this returns at EOF                                                                            

 

    my ($id, $acc, $description) = split /\|/, $self->{'last_id_line'};

    $id =~ s/^>//; # get rid of the leading >

    $current_seq = {

        'id' => $id,

        'accession' => $acc,

        'description' => $description,

    };

    undef $self->{'last_id_line'}; # this sets up the ability to return undef on EOF                                                        

 

    SEQLINE: while (my $line = <$fh>) {

        chomp $line;

        if ($line =~ m/^>/) {

            $self->{'last_id_line'} = $line;

            last SEQLINE;

        }

        else {

            $current_seq->{'sequence'} .= $line;

        }

    }

 

    if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {

        die "Missing Sequence Error\n";

    }

 

    return $current_seq;

}

1;

 

Now, this is all well and good, but why not go ahead and convert our $current_seq hash into an object itself.  Then we can create a documented API on it that programmers can use, and still have the flexibility of changing the underlying model we use if something comes around in the future which makes it easier to represent fasta sequences, and we can add some interesting methods on the object which might be of interest to users for various purposes.   We can call it IGSP::FastaSeqO, to differentiate it from our IGSP::FastaSeq package.

 

package IGSP::FastaSeqO;

 

sub new {

    my $class = shift;

    my $self = {};

 

    # allow a hashref with id, accession,                                                                                                   

    # description, and sequence to be passed into                                                                                           

    # the constructor                                                                                                                       

    my $in = shift;

    if ($in && (ref $in eq 'HASH')) {

        $self = { map { $_ => $in->{$_} } qw/id accession description sequence/ };

    }

 

    bless $self, $class;

    return $self;

}

 

sub id {

    my ($self, $id) = @_;

    if ($id) {

        $self->{'id'} = $id;

    }

    return $self->{'id'};

}

 

sub unset_id {

    my self = shift;

    undef $self->{'id'};

}

 

sub accession {

    my ($self, $acc) = @_;

    if ($acc) {

        $self->{'accession'} = $acc;

    }

    return $self->{'accession'};

}

 

sub unset_accession {

    my ($self, $acc) = @_;

    if ($acc) {

        $self->{'accession'} = $acc;

    }

    return $self->{'accession'};

}

 

sub description {

    my ($self, $desc) = @_;

    if ($desc) {

        $self->{'description'} = $desc;

    }

    return $self->{'description'};

}

 

sub unset_description {

    my self = shift;

    undef $self->{'description'};

}

 

sub sequence {

    my ($self, $seq) = @_;

    if ($seq) {

        $self->{'sequence'} = $seq;

    }

    return $self->{'sequence'};

}

 

sub unset_sequence {

    my self = shift;

    undef $self->{'sequence'};

}

 

sub seq_codons {

       my $self = shift;

      my @codons =    grep { length($_) > 0 } split /(\w\w\w)/, $self->{'sequence'};

      return (wantarray) ? @codons : \@codons;

}

1;

 

The way we defined the constructor on this object makes it a snap to change the next_seq method on IGSP::SeqFactory to return a IGSP::FastaSeqO object instead of the hash reference.  It takes this reference as an argument to the constructor, and, if present and contains the right keys, uses its data to set its own attributes.  Because of this, we only have to change two lines of code in IGSP::SeqFactory to make it return IGSP::FastaSeqO objects.  First we have to add the following line, for obvious reasons.

 

use IGSP::FastaSeqO;

 

Then we have to change the last line of the next_seq method from:

 

return $current_seq;

 

to:

 

return IGSP::FastaSeqO->new($current_seq);

 

And now we have a nice object oriented system to handle fasta sequence files.

Here is how our client code could use it (it is not much different from the code that used the package version).

 

use strict;

use IGSP::SeqFactory;

 

my $file = shift or die "fasta_processor_sub.pl <path_to_fasta_file>\n";

open (my $fh, '<'.$file) or die "Couldnt open $file $!\n";

 

my $seqf = IGSP::SeqFactory->new($fh);

my @seqs;

while (my $current_seq = $seqf->next_seq()) {

    push @seqs, $current_seq;

}

 

foreach my $seq (@seqs) {

  printf "ID: %s\nACC: %s\nDescription: %s\nSeq:\n%s\n",

         $seq->id,

         $seq->accession,

         $seq->description,

         $seq->sequence;

}

exit;

 

This actually gets us down to the size of code that could be written in a perl one-liner at the commandline.

 

unix> cat /path/to/fasta | perl -MIGSP::SeqFactory -le 'my $sf = IGSP::SeqFactory->new(\*STDIN); while (my $seq = $sf->next_seq) { ... }'

 

One thing that you could do with this, if you felt inclined, was make other SeqO objects capable of representing other sequence files, such as our EMBL sequence.  Then, IGSP::SeqFactory could be coded to determine which type of sequence object to construct from its filehandle using an argument to the constructor.  Since the attributes and methods that you would expect from a FastaSeqO object is a subset of what you would expect from the EMBLSeqO object, it should be easy to do this without changing the client code which uses these objects too much.   This is what makes Object Oriented Programming so useful.

 

Comments (0)

You don't have permission to comment on this page.