Parsing
Parsing is the art of taking some kind of data input intended for some other use and programatically coercing its contents into a form that facilitates its usage within a computer program. Data can come in almost any form, from a text file on the filesystem, to a webpage viewed on the internet. The data may have been created for a purpose completely unrelated to the task that you are attempting to perform with it, e.g. it might be a human-readable report with formatting and highlighting to make things stand out visually, or it might be in a format specifically designed for some other computer program. None of this matters, as long as it is composed of text characters that the programming language can understand. Here are a few key concepts to understand when you start to approach a particular datastream for parsing.
*Database
In general, you can consider the entire stream, e.g. file or webpage returned, as a single database with one or more entries. In fact, it is perfectly possible to imagine of a spectrum of data storage structures which range from the less structured, such as arbitrary files on a filesystem, to the more advanced relational databases such as Oracle which do all of the parsing of their files for you, and present a unified interface to the data.
*Record
Each entry in the database can be represented as a single record. It is the record that is most interest to you, and the art of parsing efficiently and effectively is the art of getting you from database to individual records using the least amount of time and computational effort.
*Field
Each record will have one or more fields associated with it which are tied to the data that you find interesting. Fields can be related to single values, or multiple values for each record.
*Data Structure
The job of parsing involves examining a stream, or some small representative sample of the stream, to get an idea of how to programmatically access and store each record in the database, with all of the field values that you find interesting, into some collection of data storage containers that the programming language provides. In this class, we will discuss several ways of creating data structures in perl. In general, datastructures in perl are based on references, be they the normal types of references that you see in multi-dimensional hash-array systems, or the blessed references that are the underpinnings of object oriented programming in perl.
*Record and Field Separator
In some cases, each record, and/or each field in a record, is separated by a well-defined character or string of some kind. An example is a file which is formatted in what is called csv, or comma-separated-values, format. In this format, each record is separated by the next record by a newline, and each field is separated by the next by a comma. Actually, csv files can have any combination of simple record/field separators, e.g. they could use pipes (|) instead of newlines to separate records, and they could use tabs or spaces, to separate fields within the record. These formats are much easier to parse into a data structure. However, many streams do not have simple record or field separators, and require code to extract their contents.
Program State: A Short Diversion into Computer Science Theory
One of the things that new programmers struggle with as they begin to learn to program is how to understand, and model the state of their program at any particular moment. This is a critical idea, and fundamental to all computer programming. For this reason, we will spend a bit of time talking about program state.
You can imagine that any computer program is essentially a special computer. Running the program is basically just like switching on the computer, allowing it to start listening for input. When input it recieved, it will either ignore and discard the input, or respond to the input. When the computer responds to input, it will typically change its internal state in one or more of the following ways:
*saving some or all of the input into a data structure
*changing its overall response state
*writing some part of its data structure somewhere else, such as to a file
The program determines whether to ignore or respond to an input based on its state. In this way, a response to input in one state can cause the computer to change its state so that it may react differently to input the next time it is recieved. Eventually, your program will enter into a halt state. This may come about when the machine it is running on is turned off, but it is usually good practice to specifically write into your programs a well-defined halt state.
When you begin to design a parser, the first thing you should do is determine what kind of input you will be recieving, and how the input will change the state of the program. If you are dealing with csv files, much of the work is done when you read in a new line. However, most interesting formats do not use the newline as a record separator, and records span multiple lines. Part of your state transition model should encompass the process of recognizing the character/strings that mark the beginning, and sometimes the end, of a record. Your program will respond to inputs which include these character/strings differently than it will other input lines, typically by processing the data structure for the previous record, and creating a new data structure to store the contents of the new record in. You must also make the model understand when it is done parsing, e.g. its halt state. In cases where there is a clearly defined record separator, such as the newline in csv, or some kind of distinct end-of-record string, you will have already processed the record when the halt state is reached. If you are dealing with multi-line records without a distinct record separator, such as FASTA, it is important to process the data structure for the last encountered record before moving on or ending your program. A common mistake that people make when they write, say, a FASTA parser is that they forget to process the last record in the file before halting.
To get an understanding of how to model state in your program, you might want to study the literature of Turing Machines and Finite State Machines.
The Perl Parsing Arsenal
Perl provides a few tools that are essential in the design of a parser. Here are the most important:
* <> : This is the standard record reader that perl provides to read contents in from a stream, called an input handle. If you do not pass in an already defined handle as an argument between the open < and close > (e.g. <$handle> or <HANDLE>) perl uses the default STDIN handle. Contrary to what you may think, this operator is not a line reader. It just so happens that the default record separator in perl is the newline character, which makes <> act as a line reader for most cases. However, it is possible to change the record separator in perl, which is stored in the global $/ variable (Perl Special Variables) in ways that make it behave differently. For most common parsing tasks, the newline character is the most logical choice, even if it means that a real record usually spans multiple lines.
* split: This command takes two arguments, a regular expression which defines the field separator, and a string to split up into an array of fields separated by your field separator regular expression. If you use a grouping construct on the field separator regular expression, then the characters which match the regular expression are returned as seperate fields in the returned array. Otherwise, only the strings which occur between the field separator are returned. Here is an example:
# line is 23 46 32 12
my @things = split /\s+/, $line;
# @things is now (23,46,32,12)
but be sure to chomp the $line first, as @things is actually
(23,46,32,'12\n').
instead do:
chomp $line;
# line is 23 46 32 12 without a newline
my @things = split /\s+/, $line;
# @things is now (23,46,32,12)
chomp $line;
# line is 23|32|46|12
my @things = split /(\|)/, $line;
# @things is now (23,'|',46,'|',32,'|',12)
watch out for leading and trailing whitespace in certain patterns:
chomp $line;
# line is 23 | 32 | 46 | 12
my @things = split /\|/, $line;
# @things is now ('23 ' ,' 32 ',' 46 ',' 12')
# you will need to strip off leading and trailing whitespace here, see s/// below
alternatively, you can use:
my @things = split /\s*\|\s*/, $line;
which would produce what you probably expected
You can separate each character in the input into individual elements in an array by splitting on 'nothing'.
# line is part of the sequence of nucleotides
my @nucs = split //, $line;
but remember to chomp, or you will end up with the last element in @things being a newline character
* m//, s/// : these are the perl match and substitution regular expressions. They are extremely powerful, especially when used with grouping constructs to provide access to the $1, $2, $3, ... backreferences.
The match operator provides you with a general beginning-of-record/end-of-record recognition system.
if ($line =~ m/^\>//) {
# this is the start of a new record, process the id
}
else {
# we are harvesting the sequence characters for the current record
}
And, it gives you a general purpose field parsing system.
if ($line =~ m/ID\=(.*)/) {
$id = $1;
}
The substitution operator provides you with the ability to clean up and further process fields.
$id =~ s/\s+//; # get rid of leading and trailing whitespace
Sometimes you need to split on one regular expression, but need to further process the strings returned by the split.
#line is >ao12343|c3324 Regulator Sequence
my @things = split /\|/, $line;
$things [0] =~ s/^\>//;
my ($accession, $description) = split /\s+/, $things[1];
* references: It is certainly possible to parse each field in a record into its own scalar, array, or hash variable.
e.g. $id, $description, @sequence_nucleotides. But what if you want to collect some or all of the records in your database into an array to process later? You could place the scalars into their own arrays, e.g. have a @ids array, and a @descriptions array, and store each id and sequence into these arrays in the same order. But what about the @sequence_nucleotides? You could store a reference to this array into its own @sequences array, but why not create a single arrayref, or hashref to store all of the fields in a record together, and then push that reference onto an array for later processing, e.g.
$seq = {
'id' => $id,
'description' => $description,
'sequences => \@nucleotide_sequence
};
or
my $seq;
while ($line = <>) {
if ($line =~ m/^\>(\w+)\|(\w+)\s+(\w+)/) {
push @seqs, $seq if ($seq); # process the previous record
$seq = {}; # recreate a new reference
$seq->{'id'} = $1;
$seq->{'accession'} = $2;
$seq->{'description'} = $3;
...
}
References are also the basis for objects, which are just blessed references. In this class, we will be designing our data structures using references.
An Example: The FASTA File
In this class, you will be focusing on parsing a sequence file that is in a format provided by the European Molecular Biology Laborotory. EMBL, Genbank, and the DDBJ in Japan are part of a world-wide consortium which provides free access to sequence data produced by researchers around the world, including the output of the various publicly supported Genome Projects. While each institution still maintains its own standard format for representing a sequence entry, they have refactored them to make sure they present the same annotation data around each sequence in such a way that it is possible to convert an EMBL entry into a Genbank entry, or vice versa, etc. To get an idea of the process of designing a parser for this file, I will go over the process of designing a parser for a much simpler format, the FASTA file.
Part 1. FASTA Format
The FASTA format is very simple. The beginning of a new record is signaled by the line beginning with '>'. There are some standards for what goes on this line defined by NCBI, but you will notice if you look at many fasta files that these are typically ignored, and each file really has its own format for what goes on the id line. Almost anything minus the sequence itself can be found on this line, with almost any kind of formatting, field definitions, etc. After this line, you will see one or more lines of sequence characters, either nucleotide or amino acids. You should not see strange characters in the sequence strings, although you will see variation in the way that capitalization is used in the sequence strings.
>id acc description
acgtacacaatttgggggacg
cagtacgggaaattt
>id2 acc2 description2
accggttttaaatcgtaaaaaaa
cagaattttgc
One gotcha about the FASTA format is that it doesnt have a distinct end-of-record symbol. This becomes especially problematic when you finish parsing all of the lines in the file. It is impossible for the system to do anything with the current record until it knows it is finished processing it, which it only knows when it hits the id line of the next record. This will not happen for the last record. It is important to code your system to handle this appropriately.
Part 2. Perl Data Structure for FASTA
One of the common mantras in the Perl programming community is 'There is More Than One Way To Do It', or TIMTOWTDI (pronounced timtowdee). We could store each ID into a hash, as a key with its index in the value. Then, we could store each other piece of information that we want into separate arrays at the position with that index. So, if the sequence with id a0004433 is the third sequence in the fasta file, it could be stored into the following three containers:
$seq_index{'a0004433'} = 2; #notice that we subtract one for zero based array indexing
$accessions[2] = 244532;
$descriptions[2] = 'An important sequence';
$sequences[2] = 'acgtaagggccc';
And then we could manipulate the entries in the appropriate array depending on what we want to manipulate. But, there is a better way. With references, you can store all of the information for a particular record together in the same container. Then, you can store that container in other containers as needed. In this case we could have a simple $seq variable which is itself a hash reference. Note, we could just as easily use a hash instead of a hash reference, but it will need to be dereferenced before it can be stored into another array or hash, so why not just get used to working with it as a reference right up front. Note, also, that we dont have any extraneous data to store, such as its position in the file, as we do with the previous method.
$seq->{'id'} = 'a0004433';
$seq->{'accession'} = 244532;
$seq->{'description'} = 'An important sequence';
$seq->{'sequence'} = 'acgtaagggccc';
Another thing we could do, if it makes things easier later on, is store the sequence itself as a reference to an array of individual nucleotides.
$seq->{'sequence'} = [ split //, $sequence ]; # note the split return is surrounded by the array reference [ and ]
We could then push that seq entry onto an array
push @sequences, $seq;
or tie it to a key based on some value of our choosing:
$sequences_by_accession{$seq->{'accession'}} = $seq;
When we start to talk about subroutines, packages, and objects, the utility of this type of data structure will become very apparent.
Part 3. State Transition Model
Our fasta parser needs 5 states:
process previous seq
Line Begins with '>'
|----------|
| |
Line Begins with '>' V | eof
start -> Expecting ID --------------------------->Expecting Sequence-----------------------------
eof | | Line Does Not Begin with '>' | Line Begins with '>' |
___| | | |
| V V | process last seq
| Non Fasta File Error Missing Sequence Error |
| |
| |
|------------------------------->Halt<---------------------------------------------------------|
Part 4. Code
Here is a first stab at writing the above state model into a perl script.
use strict;
my @seqs;
my $expecting_id = 1;
my $expecting_sequence = 0;
my $current_seq = {};
while (my $line = <>) {
chomp $line;
if ($line =~ m/^>/) {
if ($expecting_sequence && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
}
$expecting_sequence = 1;
$expecting_id = 0;
my ($id, $acc, $description) = split /\|/, $line;
$id =~ s/^>//;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
}
else {
if ($expecting_id) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
# process last seq
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
}
# this is the halting process
foreach my $seq (@seqs) {
printf "ID: %s\nACC: %s\nDescription: %s\nSeq:\n%s\n",
$seq->{'id'},
$seq->{'accession'},
$seq->{'description'},
$seq->{'sequence'};
}
exit;
This works as advertised, but it is actually a bit bloated, as it contains redundant information. In particular, it is seldom necessary to code each state into its own variable explicitly. Sometimes, the presence or absence of data in the data structure is enough to know what state the system is in. Here we clean up this redundancy by using the presence of an 'id' value on the current_seq hash as the 'expecting_sequence' state, and the presence of a 'sequence' value on the current_seq hash in place of the 'expecting_id' state.
use strict;
my @seqs;
my $current_seq = {};
while (my $line = <>) {
chomp $line;
if ($line =~ m/^>/) {
if ($current_seq->{'id'} && !$current_seq->{'sequence'}) {
die "Missing Sequence Error\n";
}
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
}
my ($id, $acc, $description) = split /\|/, $line;
$current_seq = {
'id' => $id,
'accession' => $acc,
'description' => $description,
};
}
else {
unless ($current_seq->{'id'}) {
die "Non Fasta File Error\n";
}
else {
$current_seq->{'sequence'} .= $line;
}
}
}
# this is the halting process
if ($current_seq->{'sequence'}) {
push @seqs, $current_seq;
}
foreach my $seq (@seqs) {
printf "ID: %s\nACC: %s\nDescription: %s\nSeq:\n%s\n",
$seq->{'id'},
$seq->{'accession'},
$seq->{'description'},
$seq->{'sequence'};
}
exit;
The advantages of removing this redundancy will become apparent when we study how to code this algorithm into a subroutine.
Comments (0)
You don't have permission to comment on this page.