Take a look at the EMBL file /home/londo003/perl_class2/data_files/EMBL_example.txt. This is an EMBL file, from the European Molecular Biological Laboratory. It represents the sequence string, and ID assigned to that sequence by the EMBL, Genbank, and DDBJ, along with all of the collective annotation that these institutions have decided to track for each sequence.
1. Write a script which takes this file as input, and produces a file with all sequence records in the following format:
ANSWER
#!/usr/bin/perl
use strict; # this will always help you
# we could just work on <>, but this is more helpful to the user
my $embl_file = shift or die "embl_processor.pl path_to_embl_file\n";
# open the filehandle into a scalar variable
# this will become more useful later on in the class
open (my $embl, "<".$embl_file) or die "Couldnt open $embl_file $!\n";
# set up an array to hold
my @records = ();
# this is a state variable that is set to true after the SQ line is encountered
# which tells the line processecor to process sequence entry lines, since there
# is no identifiable pattern that can be matched for them
my $inseq;
# this will hold our current record
# it is a hashref because we intend to add it to the
# @records array. We could make it a hash, and reference
# using \%curr_record later, but it is always easier to just
# get used to using a data container as a reference from
# the start
my $curr_record = {}; # this is a hashref
LINE: while (my $line = <$embl>) {
chomp $line;
next if ($line eq 'XX'); # these are spacer lines
if ($line =~ m/^ID/) {
# harvest the id
my $id = (split /\s+/, $line)[1];
$id =~ s/\;$//;
$curr_record->{id} = $id;
}
elsif ($line =~ m/^AC/) {
# harvest the accession
my $acc = (split /\s+/, $line)[1];
$acc =~ s/\;$//;
$curr_record->{accession} = $acc;
}
elsif ($line =~ m/^RL\s+(.*)/) {
# harvest the RL line
my $rl = $1;
# note, we could push this
# onto an arrayref keyed to rl, as well
$curr_record->{rl} .= ' '.$rl; # prepend it with a space
}
elsif ($line =~ m/^FT/) {
# there are several of these that we are interested in
if ($line =~ m/\/mol_type\=(.*)/) {
# harvest the mol_type
my $mol_type = $1;
$mol_type =~ s/\"//;
$curr_record->{mol_type} = $mol_type;
}
elsif ($line =~ m/\/chromosome\=(.*)/) {
# harvest the chromosome
my $chromosome = $1;
$chromosome =~ s/\"//;
$curr_record->{chromosome} = $chromosome;
}
elsif ($line =~ m/\/organism\=(.*)/) {
# harvest the organism
my $organism = $1;
$organism =~ s/\"//;
$curr_record->{organism} = $organism;
}
else {
# skip all other FT lines
next LINE;
}
}
elsif ($line =~ m/^SQ\s+(.*)/) {
# harvest the sequence bp info
my $seq_info = $1;
# there are other ways of doing the following, this is just one way
my @parts = split /\;\s+/, $seq_info; # note the split pattern takes care of whitespace between records
$curr_record->{total_bp} = (split /\s+/, $parts[0])[2];
$curr_record->{num_a} = (split /\s+/, $parts[1])[1];
$curr_record->{num_c} = (split /\s+/, $parts[2])[1];
$curr_record->{num_g} = (split /\s+/, $parts[3])[1];
$curr_record->{num_t} = (split /\s+/, $parts[4])[1];
# reset the internal state of the program to know that it is
# harvesting sequence bps for future lines until the state gets reset
$inseq = 1;
}
elsif ($inseq) {
if ($line eq '//') {
# reset inseq to undefined to denote that
# we are no longer harvesting sequence lines.
# This is necessary to allow the program to skip
# the attributes of the EMBL file that we are not interested in
# otherwise, it will shove these lines onto the sequence
$inseq = undef; # reset inseq state
# add the record onto the end of the @records array
push @records, $curr_record;
# refresh the curr_record to a new, empty reference to make sure
# we dont have contamination of record information
$curr_record = {};
}
else {
$line =~ s/^\s+//; # get rid of leading whitespace
# there are other ways of doing the following, this is just one way
my @parts = split /\s+/, $line;
pop @parts; # throw away the bp count
$curr_record->{sequence} .= join('', @parts); # concatenation into one line
}
}
else {
# skip all other lines
next LINE;
}
}
close $embl;
foreach my $record (@records) {
# this is called a heredoc
# it prints everything from the next line to the string (I use EOR, but this is arbitrary) occurring
# by itself, on a line of the code. If the String is double quoted, variables are interpolated.
# If it is single quoted, variables are not interpolated.
# It is a redimentary form of templating. For more powerful templating, read up on the Template Toolkit,
# which is contained in the CPAN 'Template' Package.
print <<"EOR";
record_start
ID $record->{id}
Accession $record->{accession}
RL $record->{rl}
Molecular Type: $record->{mol_type}
Organism: $record->{organism}
Chromosome: $record->{chromosome}
Number of Base Pairs: $record->{total_bp}
Number of As: $record->{num_a}
Number of Cs: $record->{num_c}
Number of Gs: $record->{num_g}
Number of Ts: $record->{num_t}
sequence_start
$record->{sequence}
sequence_end
record_end
EOR
}
exit;
record_start
ID $id
Accession $acc
RL $rl_line
Molecular Type: $mol_type
Organism: $organism
Chromosome: $chrom
Number of Base Pairs: $num_bp
Number of As: $num_a
Number of Cs: $num_c
Number of Gs: $num_g
Number of Ts: $num_t
sequence_start
$seq
sequence_end
record_end
.
.
.
Note that the EMBL file puts a number denoting the position in the sequence after each line of sequence, and splits each line of sequence into chunks of 10. Your sequence must not contain the location numbers. If you want to remove the whitespaces between each chunk, or just print the entire sequence out on one line, that is up to you. Also, your RL line must be one line of text only, e.g. concatenate all RL entries in the file. Also, we only want identifiers for ID and AC, nothing else (including the ;). One nice thing about EMBL is that it does have '//' as a record separator, which is a signal that you can process the current_record. This makes it easier to stream, as we will see later.
Here is some psuedocode to help you get started:
my inseq # a state variable denoting whether we are collecting sequence data
my record
while (line = input) {
cleanup lineending;
if (line begins with ID) {
harvest id into record data container
}
elsif (line begins with AC) {
harvest acc into record
}
elsif (line begins with RL) {
add RL line to record
}
elsif (line begins with FT) {
if (line contains '/mol_type') {
harvest the mol_type into record
}
elsif (line contains '/chromosome') {
harvest the chromosome into record
}
elsif (line contains '/organism') {
harvest the organism into record
}
else {
next
}
}
elsif (line begins with SQ) {
inseq = true
harvest num_bp, num_a, num_c, num_g, num_t into record
}
elsif ($inseq) {
if (line eq '//') {
undef inseq # reset the state
add record onto stack of records
clean record data container
}
else {
add sequence bps to record
}
}
else {
next;
}
}
foreach record (records) {
print "record_start\n"
print " ID: "record:id"\n"
print " Accession: "record:acc"\n"
print " RL: "record:RL"\n"
print " Molecular Type: "record:mol_type"\n"
...
print " sequence_start\n";
print record:seq"\n"
print " sequence_end\n";
print "record_end\n";
}
exit
Please make sure you have finished question 1 before the next class, as the labs in the rest of the class assume that you have a working parser capable of producing this output. If you have finshed question 1, and there is still time, or if you want further practice outside of class, you should be able to quickly modify/cut-and-paste/copy it to do the following:
2. Write these sequence records out to different files based on Molecular Type.
3. Write out Fasta formatted records to different files based on Chromosome. Use the following ID line format:
>$id $accession chr:$chr bp:$num_bp a:$num_a c:$num_c g:$num_g t:$num_t
$seq
.
.
Note, it is perfectly fine to print the entire sequence out on one line, if you want. You must remove the whitespace between each 10bp chunk in the original file. As extra credit, see if you can print out each sequence in 80bp lines.
Comments (0)
You don't have permission to comment on this page.