Introduction
BioPerl is a comprehensive object oriented software system designed to help bioinformaticians automate their processes using the perl programming language. Its aim is to provide access to a variety of bioinformatics file formats, and programs using a unified, systematic set of common interfaces and designs. Because of the scale of its remit, new programmers are often overwhelmed by it when they first encounter it. In the next two classes, we will gain a better understanding of the overall structure of the BioPerl system, and learn to use some of its features in our own projects.
The first places that you will want to start in learning about BioPerl are, the BioPerl Wikis and HowTos and the Bioperl Tutorial. I would also recommend learning how to use the BioPerl Deobfuscator. Currently, BioPerl is broken down into 4 basic installation packages. These are:
bioperl-core: This is the largest, and most important installation. All of the parsers are contained here, organized into catagories based on their overall functionality.
bioperl-run: This installation contains all of the objects design to wrap running other bioinformatics processes, such as Blast, Fasta, ClustalW, RepeatMasker, etc.
bioperl-xs: This installation can optimize the performance of many of the core parsers by replacing their pure-perl code with faster c-code transparently to the user.
bioperl-db: This installation contains objects related to accessing/storing sequences and annotations from central repositories. Repositories may be those provided by other software, such as the online Genbank and EMBL webservices, or the indexed flatfiles that Fasta creates, or they may be stored into a special database schema, called Bio::SQL, developed by many of the same people who developed BioPerl, BioPython, BioRuby, BioJava, etc, and designed to work with all of thse software language systems.
If you need to install BioPerl onto your own machine, I would recommend only installing the core installation alone, and possibly the bioperl-db. At the IGSP we have the bioperl-core, run, and db packages installed on compute.igsp.duke.edu, and will likely install them on the Sun Grid Engine nodes if they have not already been installed. We will be using compute to work in this class.
Design Overview
The overall design of BioPerl is to take all of the vast array of programs and formats, and categorize them into as few systems as possible. As far as is possible, each of these systems is represented by one or a few 'interface' objects (denoted by their name ending with I, such as Bio::SeqI), which store the most important information common to all formats in the system, and one or a few IO factory objects designed to read formats into these objects, and write these objects out to any of the formats. Here are a few of these categories:
Sequences
BioPerl represents all of the major Sequence formats (EMBL, GenBank, DDBJ, Fasta, etc), in the Seq namespace.
Represented by the following main parser interfaces, which will most likely provide the methods that you are wanting
to use to get at information about sequences:
- Bio::PrimarySeqI : This object represents the data that is stored in so-called light formats like Fasta. They only
store the sequence string, and its names/descriptions. In addition, Bio::PrimarySeqI provides basic methods for
accessing the revearse-compliment and translation of nucleotide sequences, and information on what type of sequence
is being represented (will be one of 'dna', 'rna', 'protein').
- Bio::SeqI: This object takes the Bio::PrimarySeqI object, and extends it to support Bio::SeqFeatureI features, which
are annotations tied to a specific location on the sequence.
- Bio::Seq::RichSeqI: This interface extends Bio::SeqI to provide accession numbers, versions, and other aspects of the richly annotated sequences
provided by the Major Sequence Databases.
- Bio::AnnotationCollectionI/Bio::AnnotationI: The Bio::AnnotationCollectionI interface provides access to 'interesting' things that are attached to the sequence
record in the form of Bio::AnnotationI implementing objects. In the EMBL file, you will see 'CC' Comments, 'RT' and 'RL' references, etc. All of these end up being represented
as Bio::AnnotationI objects in the Bio::AnnotationCollectionI objects with a particular key. You can get the keys for all possible AnnotationCollectionI objects attached to a seq
using a method, and then you can access each AnnotationCollectionI object from the sequence using its key. From the AnnotationCollectionI object you can get access to the
Bio::AnnotationI objects for that class.
- Bio::SeqFeatureI/Bio::LocationI: these interfaces presents a uniform layer around many different types of
sequence annotations. A Bio::SeqFeatureI implementing object presents a series of tag-value pairs of information.
Individual tags can be tied to a single value, or multiple values. Also, Bio::SeqFeatureI objects can have features
themselves, called subseq features. All Bio::SeqFeatureI objects can be located on the sequence using the
Bio::LocationI object returened by the location method. These simply present a start, end, strand, and seq_id for the
exact location of something on the sequence. You can than get access to the sequence for that location using the
subseq method of Bio::PrimarySeqI.
- Parsed with a single IO system, called Bio::SeqIO.
Searching
BioPerl represents many of the systems designed to find sequences based on their characteristics, such as blast, fasta, hmmer, wise, etc, in a single Search namespace. It provides access to reports from these tools using a single IO system, called Bio::SearchIO. In addition, it generalizes the results of all of these reports into a hierarchical abstract system consisting of Bio::Search::Result::ResultI implementing objects which contain various attributes, and access to zero or more Bio::Search::Hit::HitI implementing objects, each of which provide their own attributes, and access to zero or more Bio::Search::HSP::HSPI implementing objects, which has its own set of attributes, and methods related to information about a High Scoring Pair.
Alignments
BioPerl represents many popular alignment formats (bl2seq, fasta, clustalw, phylip, etc) in a similar abstract system of Bio::Align::AlignI implementing objects which can be read in from, and written out to, any of these formats using Bio::AlignIO. Thus, it is possible to read in an alignment of one format, and write to another format. Similarly, you could use the get_aln method from a Bio::Search::HSP::HSPI high scoring pair return from a blast hit to get a simple type of Bio::Align::AlignI implementing object called Bio::SimpleAlign, and write this out in clustalw, fasta, or phylip format using Bio::AlignIO.
Running other programs in BioPerl
BioPerl provides a completely separate installation of modules designed to wrap the process of running other bioinformatics software applications, and returning their results in the appropriate BioPerl context. This is the Bio::Run system. If you want to automate the process of, say, running Blast (remotely against NCBI, or against your own installation), or RepeatMasker, etc, this is where to look. We will examine this system more tomorrow.
Comments (0)
You don't have permission to comment on this page.