perlsOfLondon

 

play parsing

Page history last edited by dmlond 2 yrs ago

Download the file at the following URL into a place on your machine where you will be able to find it from the file system (UNIX home is a good place).

nm007770.gb

Now we will play around with this modified GenBank file (It has had some of the more complex heirarchical aspects such as features, and a bunch of references removed).

1. First read it into a perl program, and simply 'parrot' the lines back to STDOUT.

2. Now modify this program to remove the newlines from the lines as they are being read in. Then manually add the newline back in when you print it out (if you are successful, it should be a 'parrot' again, and you will know that you have removed the newline in memory if you dont get extra blank lines in your output).

3. substitute spaces in each line with the at '@' sign. Play around with the difference between s/\s/@/, s/\s+/@/, s/\s/@/g, and s/\s+/@/g.

4. split the contents of each line into a @parts array. Again, play around with the difference between split \s, $line vs. split \s+, $line. you can use the following print statement to help you see the difference
print join("@", @parts), "\n";

5. Now print out the first string on each line (this may be an empty string, which is actually important). The printout should look something like "I GOT FIRST STRING $firstString\n".

6. Now skip every line with a blank first string (hint use the 'next' command with an 'unless' test for defined $firstString), and print the contents of each line as if it were a key $firstString and a value of everything else concatenated together, e.g. print "KEY $firstString VALUE $rest\n"; (hint you will want to use the join function on the array elements other than the first (0) element, there is more than one way to do this).

extra credit

If you want to continue 'parsing' this file on your own outside of class, it will be available on my public_html space here for a while. Think about how you would handle the contents of the lines that you skipped above. This is somewhat complicated, in some cases the skipped line was a continuation of the previous line, which you would append to the stuff that you printed out as $rest in step 4 (this would require a state variable that is scoped outside of the while loop where you are reading in the contents of the file), in other cases the skipped lines represented a hierarchical subsection of REFERENCE, with some 'attributes' of that reference that you might want to somehow store and manipulate (this would involved a state hash variable that, again, is scoped outside of the while loop where you are reading the contents of the file). Ideally, you might want to be able to parse this file into a hash keyed by the LOCUS ID to another Hash with keys like 'LOCUS' -> $rest where $firstString eq LOCUS but $rest does not include the LOCUS ID, 'DEFINITION' -> $rest where $firstString eq DEFINITION, etc. You would want REFERENCE to point to an array of hashref entries with keys like 'REFERENCE', AUTHORS, TITLE (this one will have a continuation), etc. Note, this is not an easy thing to do, but if you can do it, you have really begun to master the use of perl to parse things, and you will have started to think about a computational algorithm, involving things like 'state' and data structures, etc. This will be the kind of thing that would be covered in a more advanced perl class.

Comments (0)

You don't have permission to comment on this page.