As mentioned in the introduction, BioPerl attempts to categorize and represent related entities and reports into abstract hierarchical generalizations. The best way to learn about how this works is to get familiar with the Bio::SeqIO system, which provides access to the great variety of established sequence formats in the wild using the same abstract Bio::PrimarySeqI and Bio::SeqI interfaces. In order to write code to answer questions about sequences, it is important to know the different kinds of information that each format provides regardless of what BioPerl provides. Once you have an idea of the information that you wish to collect from these sequences, you can then start to work your way through the Bio::PrimarySeqI and Bio::SeqI interfaces to determine whether they provide direct access to this information through their attributes and methods. If they do not themselves provide access to the information you want, then you may need to determine which Bio::SeqI implementation is actually being used to represent the format (Bio::PrimarySeqI really only has one implementation currently), and use its perldocs to determine how to access the information you want. Chances are, though, that the most important information is going to be represented in either Bio::SeqI or Bio::PrimarySeqI.
Note, also, that once you have a good understanding of how Bio::SeqI works with things like GenBank and EMBL files, you can make use of these same features in a great variety of other perl systems, such as Ensembl, and GBrowse, through their own implementations of this interface.
A. Bio::PrimarySeqI
Bio::PrimarySeqI is a very basic representation of a sequence. Its design is to provide uniform access to the kinds of information you might find in the following typical fasta sequence representation (note, fasta is notorious for its id line being 'overloaded', meaning your mileage may vary with how Bio::PrimarySeqI parses it into its attributes):
>display_id description
sequence (may be split up across multiple lines)
display_id is supposed to be a string with no spaces. The space separates the identifier from the description. The description, if present, can have spaces, and non alphanumeric characters. Bio::PrimarySeqI provides access to these with the following attributes:
display_id: returns the the first string of non-whitespace characters on the header line.
id: synonym of display_id
desc: returns the description, which is all characters after the first whitespace(s) on the header line.
seq: returns the entire sequence string, with any newlines or whitespace that may be in the actual file removed.
In addition, Bio::PrimarySeqI provides access to some useful sequence slicing/dicing/manipulation methods.
Some of these methods return information about the sequence in the form of strings/numbers, and behave like attributes.
length: returns the number of bp in the sequence
alphabet: returns one of 'dna', 'rna', or 'protein'
subseq: takes a start and end as arguments, and returns the part of the sequence string between the specified start and end, inclusive. Start must be <= end.
is_circular: This will return true if the seq is based on a piece of circular dna, rna, or protein.
Other methods return new Bio::PrimarySeqI implementing objects.
trunc: Behaves like subseq, but returns a new Bio::PrimarySeqI implementing object with the same attributes, but the truncated sequence as its seq.
revcom: If alphabet is dna or rna, this returns a new Bio::PrimarySeqI implementing object with the same attributes, but a reverse-complimented seq. Throws an error if alphabet is protein.
translate: For sequences of alphabet dna or rna, this will return a new Bio::PrimarySeqI implementing object with the same attributes, but the seq translated into a protein, using full IUPAC ambiguity codes. It is possible, with optional arguments, to change things like the codon table, terminator symbol, unknown amino acid symbol, etc, if you are so inclined.
In addition to being used to represent the basic fasta sequence record, Bio::PrimarySeqI is reused in the Bio::SeqI implementations to store their information. To support this, Bio::PrimarySeqI also makes the following attributes and methods available. Note, these may not return any useful information for some implementations, such as a true fasta record.
accession_number: this is designed to hold the unique biological id assigned to the record by the database, if it is present. If no accession is defined for a specific format (such as fasta), this will return unknown.
primary_id: unlike accession_number, this is designed to guarantee the return of a unique id for each individual record in the database. Even if two records have the same accession_number, or do not have defined accession_numbers, this will attempt to assign some kind of unique id that you can use to hash the sequence object against for future identification. Note that each Bio::SeqI implementation has its own idea of what the primary_id is. The fasta parser usually sets this equal to the display_id. Assuming you dont have different records with the same display_id in your fasta databse, this should uniquely represent the record. Other implementations may return something like a memory address string, if there is not a well defined id specified by the backend system.
In addition, some Bio::SeqI implementations may store different information into some of the Bio::PrimarySeqI attributes and methods, or prevent them from being accessed at all. Again, you should first write code to assume that these attributes and methods return what you would expect, and then troubleshoot problems by drilling down into the actual implementations when they occur (this is where the deobfuscator comes in handy).
B. Bio::SeqI
Bio::SeqI takes the basic Bio::PrimarySeqI object and enhances it. In object oriented lingo, it 'extends' Bio::PrimarySeqI to add its own attributes and methods. You should be aware that Bio::SeqI is an interface, and, as such, can be implemented in a variety of ways for different database sources. However, if an object states that it implements Bio::SeqI in its interface, it must provide access to its data through the attributes and methods of the interface in some way. In addition, it is possible that some implementations can add attributes or methods, but it is always best to code to the interface if at all possible. Then, your code can be applied to Bio::SeqI objects coming from other sources, and you (maybe) will not be surprised by changes. The main thing that Bio::SeqI adds to Bio::PrimarySeqI is access to a potentially hierachical set of Sequence Features in the form of Bio::SeqFeatureI implementing objects. These are basically key-value annotations tied to a specific Location on the sequence. The location itself is stored into a Bio::LocationI implementing object, which is a simple start, end, and strand. It is these sequence features that hold alot of the meaningful information about the richer sequences that are provided by GenBank, EMBL, DDBJ, etc. In some instances, generic 'interesting facts' about a sequence (not necessarily tied to a location) can be stored in the annotation objects that are provided from Bio::SeqI. The annotation method returns a Bio::AnnotationCollectionI implementing object, which provides access to a collection of Bio::AnnotationI implementing objects. See the BioPerl Feature-Annotation HowTo for more information on these.
C. Bio::Seq::RichSeqI
Bio::Seq::RichSeqI takes the Bio::SeqI interface and enhances it to provide attributes and methods for accessing accession numbers, and versions, which are provided by the major Sequence Centers such as NCBI, EMBL, and DDBJ.
D. Bio::SeqIO
Bio::SeqIO is the first example of a factory object that we will encounter, but you should see it as a prototype for all other 'IO' objects in BioPerl. It is designed to take arguments in the constructor which instruct the object which type of Bio::SeqI implementation you are dealing with, and where you want to read in or write out from (you cannot do both), and provide you with a uniform way of accessing or writing sequence entries to and from their respective database files. One use of the Bio::SeqIO objects is in writing out one format of sequence file database from the entries contained in another sequence file database. This is most useful when going from so called 'rich' databases like Genbank to the leaner datases like Fasta, or from one rich format to another. It is not advised to try to create a rich format database from a lean format database (e.g. fasta to genbank), no matter how much data is stuffed into the id line on each entry in the file :). Bio::SeqIO is really simple. It takes two arguments in the constructor: a format, and a file (or fh, if you already have a filehandle open). The argument to file must also specify the access type by prepending the path to the file with '<' for files which contain entries that you want to read in, or '>' for files that you want to write entries out to. It then provides a next_seq method which will return either a defined Bio::SeqI implementing object for the next record in the file, or undef when the end-of-file has been reached. That is it. Here is a perl one-liner which will access a file and read in the first 2 entries in it, and then exit. Keep this handy, as you can use it to answer questions about how the data for a record in a particular file format is stored into the Bio::SeqI or Bio::PrimarySeqI interfaces.
shell>perl -MBio::SeqIO -e 'my $f = shift; my $sio = Bio::SeqIO->new(q(-file) => $f, q(-format) => 'embl') or die ($!); my $i = 2; while ($i && my $seq = $sio->next_seq) { print $seq->id."\n"; $i--;} $sio->close; exit;'
Comments (0)
You don't have permission to comment on this page.