All of the lab scripts we run today will work if you work on compute.igsp.duke.edu. Remember perldoc is your friend, and so is google.
1. Copy /home/londo003/perl_class3/data_files/EMBL_example.txt into your home directory. Write a perl script which Bio::SeqIO to read in all sequences in this file and write each entry out in the following format:
record_start
ID $id
Accession $accession
Molecular Type: $mol_type
Species: $species
Chromosome: $chromosome
sequence_start
$sequence
sequence_end
record_end
Answer:
use Bio::SeqIO;
my $tmpl = q(
record_start
ID: %s
Accession: %s
Molecular Type: %s
Species: %s
Chromosome: %s
sequence_start
%s
sequence_end
record_end
);
my $file = shift or die "usage: embl_reader1.pl /path/to/EMBL_file\n";
my $input = Bio::SeqIO->new('-file' => '<'.$file, '-format' => 'EMBL') or die $!;
while (my $seq = $input->next_seq) {
my $source_feature = (grep { $_->primary_tag eq 'source' } $seq->get_SeqFeatures)[0];
printf $tmpl,
$seq->id,
$seq->accession_number,
join(' ', $source_feature->get_tag_values('mol_type')),
$seq->species->binomial,
join(' ', $source_feature->get_tag_values('chromosome')),
$seq->seq;;
}
$input->close;
exit;
Note that, since the mol_type and chromosome are part of the FT entries, which are sequence features, you must search the top Bio::SeqFeatureI objects returned from the sequence object. You might also like to know that the primary_tag of the Bio::SeqFeatureI object containing these features is 'source'.
Here is some psuedocode to get you started:
use Bio::SeqIO;
my $tmpl = q(
record_start
ID %s
Accession:%s
Molecular Type: %s
Species: %s
Chromosome: %s
sequence_start
%s
sequence_end
record_end
);
my file = shift or die "usage: my_script.pl /path/to/EMBL file\n";
my input = new Bio::SeqIO from file '<'.file in format EMBL or die $!;
foreach seq in input:
source_feature = find the 'top SeqFeature' whose primary_tag is 'source' (hint, grep is your friend, but it returns a list)
# use printf on the template, notice each entry in the array supplied after $tmpl is pasted into the %s in the $tmpl above
printf $tmpl,
$seq->id,
$seq->accession_number (note this is from Bio::Seq::RichSeq),
join(' ', $source_feature->get_tag_values('mol_type')),
$seq->species->binomial (note, species returns a Bio::Species object),
join(' ', $source_feature->get_tag_values('chromosome')),
$seq->seq;
2. Copy and modify your script from problem 1 to read in EMBL_example.txt in, and write out each sequence entry to separate FASTA files in the /home/<you userid>/interesting_sequences directory, named the id for the record.
Answer:
use strict;
use Bio::SeqIO;
my $file = shift or die "usage: embl_reader1.pl /path/to/EMBL_file\n";
my $input = Bio::SeqIO->new('-file' => '<'.$file, '-format' => 'EMBL') or die $!;
my %outputs = ();
while (my $seq = $input->next_seq) {
my $id = $seq->id;
$id =~ s/\;//g;
my $output = Bio::SeqIO->new(
'-file' => '>/home/londo003/interesting_sequences/'.$id.'.fa',
'-format' => 'Fasta'
);
$output->write_seq($seq);
$output->close;
}
$input->close;
map { $outputs{$_}->close } keys %outputs;
exit;
Here is some psuedocode to get you started:
(use mkdir at the unix commandline to make the ~/interesting_sequences directory in your homedir)
use Bio::SeqIO;
my file = shift or die "usage: my_script.pl /path/to/EMBL file\n";
my input = new Bio::SeqIO (from file '<'.file in format EMBL) or die $!;
foreach seq in input:
my id = seq id
get rid of the trailing ';' on the id
output = open a Bio::SeqIO file in /home/<your userid>/interesting_sequences/id.fa of format fasta
write seq to output
close output
input->close;
3. Use Bio::DB::Fasta to index the FASTA files that you created in /home/<your userid>/interesting_sequences and query for the FASTA record with 'BC107146;'. Use the translate method on the resulting Bio::PrimarySeqI object to get a new Bio::PrimarySeqI object, write the FASTA formatted output of the resulting translation to STDOUT.
Answer:
use Bio::DB::Fasta;
use Bio::SeqIO;
my $dir = shift or die "please provide a directory of fasta files\n";
my $id = shift or die "please provide an id to query\n";
$id .= ';' unless ($id =~ m/\;$/);
my $db = Bio::DB::Fasta->new($dir) or die $!;
my $output = Bio::SeqIO->new('-fh' => *STDOUT, '-format' => 'Fasta');
my $seq = $db->get_Seq_by_id($id);
$output->write_seq($seq->translate);
exit;
Here is some psuedocode to get you started:
use Bio::DB::Fasta;
use Bio::SeqIO;
my dir = shift or die "please provide a directory of fasta files\n"; # you will want to feed in path to /home/<your userid>/interesting_sequences
my db = open Bio::DB::Fasta on dir
my $output = Bio::SeqIO->new('-fh' => *STDOUT, '-format' => 'Fasta');
my seq = get Seq by id 'BC107146;' from db; # note the ';' at the end is important, as EMBL file placed these into the id, and BioPerl did not strip them
output->write_seq(seq->translate);
4. Copy the script from problem 3 and modify it to iterate over all entries in the Bio::DB::Fasta stream. For each entry in the stream, query GenBank using Bio::DB::GenBank to find out the mol_type of the entry. Write the resulting GenBank record out to an EMBL file named for the mol_type using Bio::SeqIO.
Answer1, uses the stream from the Bio::DB::Fasta object, and queries each sequence from GenBank individually:
use Bio::DB::GenBank;
use Bio::DB::Fasta;
use Bio::SeqIO;
my $dir = shift or die "please provide a directory of fasta files\n";
my $local_db = Bio::DB::Fasta->new($dir) or die $!;
my $gb_db = Bio::DB::GenBank->new;
my %outputs;
my $stream = $local_db->get_PrimarySeq_stream;
while (my $seq = $stream->next_seq) {
my $id = $seq->id;
$id =~ s/\;//;
my $gb_seq = $gb_db->get_Seq_by_id($id);
my $source_feature = (grep { $_->primary_tag eq 'source' } $gb_seq->get_SeqFeatures)[0];
my $mol_type = join(' ', $source_feature->get_tag_values('mol_type'));
$mol_type =~ s/\s+/\_/g;
unless (exists $outputs{$mol_type}) {
$outputs{$mol_type} = Bio::SeqIO->new(
'-file' => '>'.$mol_type.'.embl',
'-format' => 'embl'
);
}
$outputs{$mol_type}->write_seq($gb_seq);
}
map { $outputs{$_}->close } keys %outputs;
exit;
Answer2: uses the ids function of Bio::DB::Fasta to get an array of ids, and uses the get_Stream_by_id function on the reference to those ids to get a stream (like Bio::SeqIO) from GenBank:
use Bio::DB::GenBank;
use Bio::DB::Fasta;
use Bio::SeqIO;
my $dir = shift or die "please provide a directory of fasta files\n";
my $local_db = Bio::DB::Fasta->new($dir) or die $!;
my $gb_db = Bio::DB::GenBank->new;
my %outputs;
my @ids = $local_db->ids;
my $stream = $gb_db->get_Stream_by_id(\@ids);
while (my $gb_seq = $stream->next_seq) {
my $source_feature = (grep { $_->primary_tag eq 'source' } $gb_seq->get_SeqFeatures)[0];
my $mol_type = join(' ', $source_feature->get_tag_values('mol_type'));
$mol_type =~ s/\s+/\_/g;
unless (exists $outputs{$mol_type}) {
$outputs{$mol_type} = Bio::SeqIO->new(
'-file' => '>'.$mol_type.'.embl',
'-format' => 'embl'
);
}
$outputs{$mol_type}->write_seq($gb_seq);
}
map { $outputs{$_}->close } keys %outputs;
exit;
Here is some psuedocode to get you started:
use Bio::DB::Fasta;
use Bio::DB::GenBank;
use Bio::SeqIO;
my dir = shift or die "please provide a directory of fasta files\n";
my local_db = open Bio::DB::Fasta on dir
gb_db = new Bio::DB::GenBank;
my %outputs = ();
my local_stream = get PrimarySeq stream from local_db
while (my seq = local_stream->next_seq)
my gb_seq = get Seq by id seq->id from gb_db
my source_feature = from gb_seq find the 'top SeqFeature' whose primary_tag is 'source' (hint, grep is your friend, but it returns a list)
my mol_type = join(' ', source_feature->get_tag_values('mol_type))
unless mol_type exists in %outputs:
$outputs{mol_type} = new Bio::SeqIO (to file '>'.mol_type.'.embl' in format EMBL) or die $!;
$outputs{mol_type}->write_seq(gb_seq)
map { $outputs{$_}->close } keys %outputs;
Comments (0)
You don't have permission to comment on this page.