perlsOfLondon

 

BioPerl_lab1

Page history last edited by Darin London 1 yr ago

All of the lab scripts we run today will work if you work on compute.igsp.duke.edu.  Remember perldoc is your friend, and so is google.

 

1. Copy /home/londo003/perl_class3/data_files/EMBL_example.txt into your home directory.  Write a perl script which Bio::SeqIO to read in all sequences in this file and write each entry out in the following format:

 

record_start

  ID $id

  Accession $accession

  Molecular Type: $mol_type

  Species: $species

  Chromosome: $chromosome

 

  sequence_start

  $sequence

  sequence_end

record_end

 

Answer:

use Bio::SeqIO;

 

my $tmpl = q(

record_start

     ID: %s

     Accession: %s

     Molecular Type: %s

     Species: %s

     Chromosome: %s

 

     sequence_start

     %s

     sequence_end

record_end

);

 

my $file = shift or die "usage: embl_reader1.pl /path/to/EMBL_file\n";

 

my $input = Bio::SeqIO->new('-file' => '<'.$file, '-format' => 'EMBL')  or die $!;

 

while (my $seq = $input->next_seq) {

        my $source_feature = (grep { $_->primary_tag eq 'source' } $seq->get_SeqFeatures)[0];

    printf $tmpl,

        $seq->id,

        $seq->accession_number,

        join(' ', $source_feature->get_tag_values('mol_type')),

        $seq->species->binomial,

        join(' ', $source_feature->get_tag_values('chromosome')),

        $seq->seq;;

}

$input->close;

exit;

 

 

Note that, since the mol_type and chromosome are part of the FT entries, which are sequence features, you must search the top Bio::SeqFeatureI objects returned from the sequence object.  You might also like to know that the primary_tag of the Bio::SeqFeatureI object containing these features is 'source'.

 

Here is some psuedocode to get you started:

use Bio::SeqIO;

 

my $tmpl = q(

record_start

     ID %s

     Accession:%s

     Molecular Type: %s

     Species:  %s

     Chromosome:  %s

 

     sequence_start

     %s

     sequence_end

  record_end

);

 

my file = shift or die "usage: my_script.pl /path/to/EMBL file\n";

 

my input = new Bio::SeqIO  from file '<'.file in format EMBL  or die $!;

 

foreach seq in input:

  source_feature = find the 'top SeqFeature' whose primary_tag is 'source' (hint, grep is your friend, but it returns a list)   

   # use printf on the template, notice each entry in the array supplied after $tmpl is pasted into the %s in the $tmpl above

  printf $tmpl,

     $seq->id,

     $seq->accession_number (note this is from Bio::Seq::RichSeq),

     join(' ', $source_feature->get_tag_values('mol_type')),

     $seq->species->binomial (note, species returns a Bio::Species object),

     join(' ', $source_feature->get_tag_values('chromosome')),

     $seq->seq;

 

 

 

2. Copy and modify your script from problem 1 to read in EMBL_example.txt in, and write out each sequence entry to separate FASTA files in the /home/<you userid>/interesting_sequences directory, named the id for the record.

 

Answer:

use strict;

use Bio::SeqIO;

 

my $file = shift or die "usage: embl_reader1.pl /path/to/EMBL_file\n";

my $input = Bio::SeqIO->new('-file' => '<'.$file, '-format' => 'EMBL')  or die $!;

 

my %outputs = ();

while (my $seq = $input->next_seq) {

        my $id = $seq->id;

        $id =~ s/\;//g;

        my $output = Bio::SeqIO->new(

                '-file' => '>/home/londo003/interesting_sequences/'.$id.'.fa',

                '-format' => 'Fasta'

        );

        $output->write_seq($seq);

        $output->close;

}

$input->close;

map { $outputs{$_}->close } keys %outputs;

exit;

 

Here is some psuedocode to get you started:

(use mkdir at the unix commandline to make the ~/interesting_sequences directory in your homedir)

 

use Bio::SeqIO;

 

 

my file = shift or die "usage: my_script.pl /path/to/EMBL file\n";

 

my input = new Bio::SeqIO  (from file '<'.file in format EMBL)  or die $!;

 

foreach seq in input:

  my id = seq id

  get rid of the trailing ';' on the id

  output = open a Bio::SeqIO file in /home/<your userid>/interesting_sequences/id.fa of format fasta

  write seq to output

  close output

input->close;

 

 

3. Use Bio::DB::Fasta to index the FASTA files that you created in /home/<your userid>/interesting_sequences and query for the FASTA record with 'BC107146;'.  Use the translate method on the resulting Bio::PrimarySeqI object to get a new Bio::PrimarySeqI object, write the FASTA formatted output of the resulting translation to STDOUT.

 

 

Answer:

use Bio::DB::Fasta;
use Bio::SeqIO;

my $dir = shift or die "please provide a directory of fasta files\n";
my $id = shift or die "please provide an id to query\n";
$id .= ';' unless ($id =~ m/\;$/);

my $db = Bio::DB::Fasta->new($dir) or die $!;

my $output = Bio::SeqIO->new('-fh' => *STDOUT, '-format' => 'Fasta');
my $seq = $db->get_Seq_by_id($id);
$output->write_seq($seq->translate);
exit;

 

Here is some psuedocode to get you started:

 

use Bio::DB::Fasta;

use Bio::SeqIO;

 

my dir = shift or die "please provide a directory of fasta files\n"; # you will want to feed in path to /home/<your userid>/interesting_sequences

 

my db = open Bio::DB::Fasta on dir

my $output = Bio::SeqIO->new('-fh' => *STDOUT, '-format' => 'Fasta');

my seq = get Seq by id 'BC107146;' from db; # note the ';' at the end is important, as EMBL file placed these into the id, and BioPerl did not strip them

output->write_seq(seq->translate);

 

4.  Copy the script from problem 3 and modify it to iterate over all entries in the Bio::DB::Fasta stream.  For each entry in the stream, query GenBank using Bio::DB::GenBank to find out the mol_type of the entry. Write the resulting GenBank record out to an EMBL file named for the mol_type using Bio::SeqIO.

 

Answer1, uses the stream from the Bio::DB::Fasta object, and queries each sequence from GenBank individually:

use Bio::DB::GenBank;
use Bio::DB::Fasta;
use Bio::SeqIO;

my $dir = shift or die "please provide a directory of fasta files\n";
my $local_db = Bio::DB::Fasta->new($dir) or die $!;
my $gb_db = Bio::DB::GenBank->new;

my %outputs;
my $stream = $local_db->get_PrimarySeq_stream;
while (my $seq = $stream->next_seq) {
        my $id = $seq->id;
        $id =~ s/\;//;

    my $gb_seq = $gb_db->get_Seq_by_id($id);
        my $source_feature = (grep { $_->primary_tag eq 'source' } $gb_seq->get_SeqFeatures)[0];
    my $mol_type = join(' ', $source_feature->get_tag_values('mol_type'));
        $mol_type =~ s/\s+/\_/g;
        unless (exists $outputs{$mol_type}) {
                $outputs{$mol_type} = Bio::SeqIO->new(
                        '-file' => '>'.$mol_type.'.embl',
                        '-format' => 'embl'
        );
    }
        $outputs{$mol_type}->write_seq($gb_seq);
}
map { $outputs{$_}->close } keys %outputs;
exit;

 

Answer2: uses the ids function of Bio::DB::Fasta to get an array of ids, and uses the get_Stream_by_id function on the reference to those ids to get a stream (like Bio::SeqIO) from GenBank:

use Bio::DB::GenBank;
use Bio::DB::Fasta;
use Bio::SeqIO;

my $dir = shift or die "please provide a directory of fasta files\n";
my $local_db = Bio::DB::Fasta->new($dir) or die $!;
my $gb_db = Bio::DB::GenBank->new;

my %outputs;
my @ids = $local_db->ids;
my $stream = $gb_db->get_Stream_by_id(\@ids);
while (my $gb_seq = $stream->next_seq) {
        my $source_feature = (grep { $_->primary_tag eq 'source' } $gb_seq->get_SeqFeatures)[0];
    my $mol_type = join(' ', $source_feature->get_tag_values('mol_type'));
        $mol_type =~ s/\s+/\_/g;
        unless (exists $outputs{$mol_type}) {
                $outputs{$mol_type} = Bio::SeqIO->new(
                        '-file' => '>'.$mol_type.'.embl',
                        '-format' => 'embl'
        );
    }
        $outputs{$mol_type}->write_seq($gb_seq);
}
map { $outputs{$_}->close } keys %outputs;
exit;

 

Here is some psuedocode to get you started:

 

use Bio::DB::Fasta;

use Bio::DB::GenBank;

use Bio::SeqIO;

 

my dir = shift or die "please provide a directory of fasta files\n";

 

my local_db = open Bio::DB::Fasta on dir

gb_db = new Bio::DB::GenBank;

 

my %outputs = ();

my local_stream = get PrimarySeq stream from local_db

while (my seq = local_stream->next_seq)

  my gb_seq = get Seq by id seq->id from gb_db

 

  my source_feature = from gb_seq find the 'top SeqFeature' whose primary_tag is 'source' (hint, grep is your friend, but it returns a list)   

  my mol_type = join(' ', source_feature->get_tag_values('mol_type))

  unless mol_type exists in %outputs:

    $outputs{mol_type} = new Bio::SeqIO (to file '>'.mol_type.'.embl' in format EMBL) or die $!;

  $outputs{mol_type}->write_seq(gb_seq)

 

map { $outputs{$_}->close } keys %outputs;

 

Comments (0)

You don't have permission to comment on this page.