perlsOfLondon

 

getting_sequences_with_Bio::DB

Page history last edited by Darin London 1 yr ago

Bio::DB

The BioPerl Bio::DB suite provides objects which allow you access to sequence records stored into a variety of recognized indexed flat-files, relational databases, or over the web.  Here is a good tutorial for the Various Ways to get Sequences from Remote and Local Repositories.

 

Here is an example of how to turn a directory of fasta files on your filesystem into a single database which can be accessed extremely quickly.  If you have used the fasta program, you will understand what this is doing.

 

use Bio::DB::Fasta;

my $source_fasta_dir = shift or die "please provide directory of fasta files";

die ($source_fasta_dir." does not exist) unless (-d $source_fasta_dir);

 

my $db      = Bio::DB::Fasta->new($source_fasta_dir);

 

There are two different ways of accessing the sequences from this object:

1.  As a single stream of fasta records, which can be iterated over as if it were a single file:

 

my $stream  = $db->get_PrimarySeq_stream;

while (my $seq = $stream->next_seq) {

  ... do something with $seq

}

 

2. or as a system to query for individual records by id

 

my $seq = $db->get_Seq_by_id($seqid);

 

You might also be interested in using Bio::DB::GenBank to query Bio::Seq objects from the NCBI webserver using its entrez webservice.

 

use Bio::DB::GenBank;

$gb = new Bio::DB::GenBank;

 

$seq = $gb->get_Seq_by_id(’MUSIGHBA1’); # Unique ID

 

# or ...

 

$seq = $gb->get_Seq_by_acc(’J00522’); # Accession Number

$seq = $gb->get_Seq_by_version(’J00522.1’); # Accession.version

$seq = $gb->get_Seq_by_gi(’405830’); # GI Number

 

 

ODBA

A new standard, called ODBA, for getting access to sequences has emerged, as well.  This uses a set of configuration files that can be shared between BioPerl, BioPython, BioJava, etc.

It lets you set up links to the various remote sources such as EMBL and Genbank, and also local flatfiles that have been indexed with the ODBA index.  Once these sources

are setup, it is easy to get sequences by ID, location, etc, using a common access point, without the need to specify paths to files or directories, or urls to remote systems.  We will not cover its usage in the class, but consult the

ODBA HowTo.

Comments (0)

You don't have permission to comment on this page.