perlsOfLondon

 

RepeatMasking_with_Bio::Tools::Run

Page history last edited by Darin London 1 yr ago

Parsing RepeatMasker Output with Bio::Tools::RepeatMasker

Bio::Tools::RepeatMasker (part of the bioperl-core download) can be used to parse the output from RepeatMasker.  It is a simple system designed to parse each result within the RepeatMasker output into a Bio:SeqFeature::FeaturePair object.  The Bio::SeqFeature::FeaturePair object represents a feature on one sequence which is itself a feature on another sequence, e.g. a blast hit between two sequences.  It contains methods for getting at attributes of both features, either as attributes calls on the object itself ($fp->start is the start of the feature on its first feature, or feature1; while $fp->hstart is the start of the feature on the second feature, or feature2, sometimes called the 'h'it); or you can get the $fp->feature1 or $fp->feature2 Bio::SeqFeatureI compliant objects to further parse.  For large files that you are repeat masking, this is the most efficitent way to use BioPerl to access RepeatMasker output.

 

Running RepeatMasker on Individual Bio::PrimarySeqI objects

With Bio::Tools::Run::RepeatMasker, you can run the RepeatMasker program on Bio::PrimarySeqI objects that you have pulled out of some database, e.g. Bio::SeqIO, Bio::DB::*, Ensembl, etc.    The Bio::Tools::Run::RepeatMasker object is essentially a factory which you can use to 'run' RepeatMasker on an input Bio::PrimarySeqI (remember all Bio::SeqI and Bio::RichSeqI objects are Bio::PrimarySeqI compliant) object to get an array of Bio::SeqFeature::FeaturePair objects.  Note, you can also get access to the masked sequence for a result using the masked_seq method on each Bio::SeqFeature::FeaturePair object returned.  The constructor to Bio::Tools::Run::RepeatMasker takes all of the standard arguments that you would pass to the RepeatMasker program itself.  Note, this can be less efficient than using Bio::Tools::RepeatMasker to parse existing RepeatMasker output, as it calls the RepeatMasker program separately for each sequence you run it on.  For small numbers of sequences, especially if downloaded from web or remote databases, this should be fine.

 

Examples

Here are two versions of a script I wrote for Julie Horvath.  She was wanting to catalogue repeats in primate sequences into a set of useful markers that she could use in further genomic research.  She used the James Watson short read sequences available from the Sanger Institute, quality masked them, and then searched for Simple_repeat class repeats within the quality_masked sequences with a start > 30bp of non-repeat sequence, and end greater than 30bp of non-repeat sequence.

This requires a directory of quality masked fasta files to be indexed using Bio::DB::Fasta, either to search against using the output of RepeatMasker parsed through Bio::Tools::RepeatMasker, or to actually RepeatMask using Bio::Tools:Run::RepeatMasker.   In either case, the repeat_masked interesting fasta records are output to one file, and a summary of the RepeatMasker output is output to another file.

 

 

1. Parses RepeatMasker output

#!/usr/bin/perl

use strict;

use warnings;

use Bio::Tools::RepeatMasker;

use Bio::DB::Fasta;

use Bio::SeqIO;

use Getopt::Long;

# parse @ARGV

my $usage = 'read_repeats.pl --rm_out path_to_repeat_masker_out [--source_fasta_dir path_to_dir_with_fasta]'

            .' --min_start number --min_end number --repeat_class class [--output_filename name]'

            ."\n\n"

            .'  --source_fasta_dir defaults to .'

            ."\n"

            .'  --output_filename is the name of both the summary and the fasta file, defaults to interesting'

            ."\n";

my ($rm_out, $min_start, $min_end, $repeat_class);

my $source_fasta_dir = '.';

my $output_filename = 'interesting';

GetOptions(

        'rm_out=s' => \$rm_out,

        'source_fasta_dir:s' => \$source_fasta_dir,

        'min_start=i' => \$min_start,

        'min_end=i' => \$min_end,

        'repeat_class=s' => \$repeat_class,

        'output_filename:s' => \$output_filename

);

die ($usage)

unless($rm_out && $min_start && $min_end && $repeat_class);

# create a fasta indexed database of all fasta files in the source_fasta_dir

my $db      = Bio::DB::Fasta->new($source_fasta_dir);

my $fasta_out = Bio::SeqIO->new(

        '-file' => '>'.$output_filename.'.fa',

        '-format' => 'Fasta',

);

open (my $summary_out, '>'.$output_filename.'_summary');

my $parser = new Bio::Tools::RepeatMasker(-file => $rm_out);

while( my $result = $parser->next_result ) {

        next unless (

                $result->primary_tag eq $repeat_class

                && $result->start > $min_start

        );

        # have to get the actual sequence, and subtract the feature->end from the sequence length to recalculate the left, as BioPerl doesnt store it

    my $seqid = $result->seq_id;

    my $seq = $db->get_Seq_by_id($seqid) or die "Couldnt find $seqid\n";

        my $end = $seq->length - $result->end;

    next unless ($end > $min_end);

        $fasta_out->write_seq($seq);

    print $summary_out join("\t", $seqid, $result->start, $end, $result->primary_tag)."\n";

}

$fasta_out->close;

$summary_out->close;

exit;

 

2. Runs RepeatMasker on the inpout fasta records.

#!/usr/bin/perl

use strict;
use warnings;

use Bio::Tools::Run::RepeatMasker;
use Bio::DB::Fasta;
use Bio::SeqIO;
use Getopt::Long;


# parse @ARGV
my $usage = 'run_repeats.pl --min_start number --min_end number --repeat_class class [--output_filename name]'
            .' [--source_fasta_dir path_to_fasta_files]'
            ."\n\n"
            .'  --source_fasta_dir defaults to .'
            ."\n"
            .'  --output_filename is the name of both the summary and the fasta file, defaults to interesting'
            ."\n"
            .'  --species species to repeat_mask against, default is human'
            ."\n";
my ($min_start, $min_end, $repeat_class);
my $source_fasta_dir = '.';
my $output_filename = 'interesting';
my $species = 'human';

GetOptions(
        'source_fasta_dir:s' => \$source_fasta_dir,
        'min_start=i' => \$min_start,
        'min_end=i' => \$min_end,
        'repeat_class=s' => \$repeat_class,
        'output_filename:s' => \$output_filename
);
die ($usage)
unless($min_start && $min_end && $repeat_class);


# create a fasta indexed database of all fasta files in the source_fasta_dir
my $db      = Bio::DB::Fasta->new($source_fasta_dir);
my $stream  = $db->get_PrimarySeq_stream;

my $fasta_out = Bio::SeqIO->new(
        '-file' => '>'.$output_filename.'.fa',
        '-format' => 'Fasta',);
open (my $summary_out, '>'.$output_filename.'_summary');

my @params = (
        'species' => $species
);
my $factory = Bio::Tools::Run::RepeatMasker->new(@params);
#$factory->verbose(1);
SEQ: while( my $seq = $stream->next_seq ) {
        my @feats = eval {$factory->run($seq)};
    next SEQ if ($@);

    RESULT: foreach my $result (@feats) {
          next unless (
                $result->primary_tag eq $repeat_class
                && $result->start > $min_start
          );

          my $end = $seq->length - $result->end;
      next RESULT unless ($end > $min_end);

          $fasta_out->write_seq($result->masked_seq);
      print $summary_out join("\t", $seq->id, $result->start, $end, $result->primary_tag)."\n";
    }
}
$fasta_out->close;
$summary_out->close;
exit;

Comments (0)

You don't have permission to comment on this page.