bioperl(3).pdf

(98 KB) Pobierz

Using bioperl

#!/usr/bin/perl –w

use Bio::Perl;

Bioperl provides software modules for many of the typical tasks of bioinformatics programming. These include:

•

Accessing sequence data from local and remote databases

Transforming formats of database/ file records

Manipulating individual sequences

Searching for similar sequences

Creating and manipulating sequence alignments

Searching for genes and other structures on genomic DNA

Developing machine readable sequence annotations

Not covered in this tutorial

Accessing sequence data from local and remote databases

Much of bioperl is focused on sequence manipulation. However, before bioperl can manipulate sequences, it needs to

have access to sequence data. Now one can directly enter data sequence data into a bioperl Seq object, eg:

$seq = Bio::Seq->new(-seq

-desc

-display_id

-accession_number

-alphabet

'actgtggcgtcaact',

'Sample Bio::Seq object',

'something',

'accnum',

'dna' );

However, in most cases, you will probably be accessing sequence data from some online data file or database.

Bioperl supports accessing remote databases as well as creating indices for accessing local databases. There are two

general approaches to accomplishing this. If you know what kind of database the sequences are stored in (i.e. flat file,

local relational database or a database accessed remotely over the internet), you can write a script that specifically

accesses data from that kind of database. To explicitly access sequence data from a local relational database requires

installing and setting up the modules in the bioperl-db library and the BioSQL schema.

The other approach is to use the recently developed

OBDA

(Open Bioinformatics Data Access) Registry system. Using

OBDA it is possible to import sequence data from a database without your needing to know whether the required

database is flat-file or relational or even whether it is local or accessible only over the net. Descriptions of how to set up

the necessary registry configuration file and access sequence data with the registry in described at

http://bioperl.org/HOWTOs/html/OBDA_Access.html.

Accessing remote databases (Bio::DB::GenBank, etc)

I am putting this nifty box around items that will be required as part of today’s practice assignment.

Write a bioperl script that will retrieve the GenBank files from a list of GenBank ID’s given on the

command line. For example:

>perl RetrieveGenBank.pl 62857694 16741142 4506102

Accessing sequence data from the principal molecular biology databases is straightforward in bioperl. Data can be

accessed by means of the sequence's accession number or id. Batch mode access is also supported to facilitate the

efficient retrieval of multiple sequences. For retrieving data from genbank, for example, the code could be as follows:

$gb = new Bio::DB::GenBank();

# this returns a Seq object :

$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');

# this also returns a Seq object :

$seq2 = $gb->get_Seq_by_acc('AF303112');

# this returns a SeqIO object, which can be used to get a Seq object :

$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);

$seq3 = $seqio->next_seq;

The SeqIO object is explained later.

Bioperl currently supports sequence data retrieval from:

•

genbank

Bio::DB::GenBank manpage

•

genpept

Bio::DB::GenPept manpage

•

RefSeq

Bio::DB::SwissProt manpage

•

swissprot

Bio::DB::RefSeq manpage

•

EMBL

Bio::DB::EMBL manpage

The retrieval of NCBI RefSeqs sequences is supported through a special module called Bio::DB::RefSeq which actually

queries an EBI server. Please see

the Bio::DB::RefSeq manpage

before using it as there are some caveats with RefSeq

retrieval. RefSeq ids in Genbank begin with ``NT_'', ``NC_'', ``NG_'', ``NM_'', ``NP_'', ``XM_'', ``XR_'', or ``XP_''.

Bio::DB::GenBank can be used to retrieve entries corresponding to these ids but bear in mind that these are not

Genbank entries, strictly speaking. See

the Bio::DB::GenBank manpage

for special details on retrieving entries

beginning with ``NT_'', these are specially formatted ``CONTIG'' entries.

Bioperl also supports retrieval from a remote Ace database. This capability requires the presence of the external AcePerl

module. The Lincoln Stein Modules can be found at:

http://stein.cshl.org/AcePerl/.

An additional module is available for accessing remote databases, BioFetch, which queries the dbfetch script at EBI.

The available databases are EMBL, GenBank, or SWALL, and the entries can be retrieved in different formats as

objects or streams (SeqIO objects), or as temporary files. See

the Bio::DB::BioFetch manpage

for the details.

Indexing and accessing local databases (Bio::Index::*, bp_index.pl, bp_fetch.pl, Bio::DB::*)

Alternately, bioperl permits indexing local sequence data files by means of the Bio::Index or Bio::DB::Fasta objects.

The following sequence data formats are supported by Bio::Index:

genbank, swissprot, pfam, embl

and

fasta.

Once

the set of sequences have been indexed using Bio::Index, individual sequences can be accessed using syntax very

similar to that described above for accessing remote databases. For example, if one wants to set up an indexed flat-file

database of fasta files, and later wants then to retrieve one file, one could write scripts like:

# script 1: create the index

use Bio::Index::Fasta;

# using fasta file format

use strict; # some users have reported that this is necessary

my $Index_File_Name = shift;

my $inx = Bio::Index::Fasta->new(

-filename => $Index_File_Name,

-write_flag => 1);

$inx->make_index(@ARGV);

# script 2: retrieve some files

use Bio::Index::Fasta;

use strict; # some users have reported that this is necessary

my $Index_File_Name = shift;

my $inx = Bio::Index::Fasta->new($Index_File_Name);

foreach my $id (@ARGV) {

my $seq = $inx->fetch($id); # Returns Bio::Seq object

# do something with the sequence

}

To facilitate the creation and use of more complex or flexible indexing systems, the bioperl distribution includes two

sample scripts in the scripts/index directory, bp_index.PLS and bp_fetch.PLS. These scripts can be used as templates to

develop customized local data-file indexing systems.

Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta format files. It's similar in spirit to

Bio::Index::Fasta but offers more methods, e.g.

use Bio::DB::Fasta;

use strict;

my $db = Bio::DB::Fasta->new($file);

my $seqstring = $db->seq($id);

my $seqobj = $db->get_Seq_by_id($id);

my $desc = $db->header($id);

one

get

file or many files

a sequence as string

a PrimarySeq obj

the header, or description line

See

the Bio::DB::Fasta manpage

for more information on this fully-featured module.

Both modules also offer the user the ability to designate a specific string within the fasta header as the desired id, such

as the gi number within the string ``gi|4556644|gb|X45555''. Consider the following fasta-formatted sequence, in

``test.fa'':

>gi|523232|emb|AAC12345|sp|D12567 titin fragment

MHRHHRTGYSAAYGPLKJHGYVHFIMCVVVSWWASDVVTYIPLLLNNSSAGWKRWWWIIFGGE

GHGHHRTYSALWWPPLKJHGSKHFILCVKVSWLAKKERTYIPKKILLMMGGWWAAWWWI

By default Bio::Index::Fasta and Bio::DB::Fasta will use the first ``word'' they encounter in the fasta header as the

retrieval key, in this case ``gi|523232|emb|AAC12345|sp|D12567''. What would be more useful as a key would be a

single id. The code below will index the ``test.fa'' file and create an index file called ``test.fa.idx'' where the keys are the

Swissprot, or ``sp'', identifiers.

$ENV{BIOPERL_INDEX_TYPE} = "SDBM_File";

# look for the index in the current directory

$ENV{BIOPERL_INDEX} = ".";

my $file_name = "test.fa";

my $inx = Bio::Index::Fasta->new( -filename

=> $file_name . ".idx",

-write_flag => 1 );

# pass a reference to the critical function to the Bio::Index object

$inx->id_parser(\&get_id);

# make the index

$inx->make_index($file_name);

# here is where the retrieval key is specified

sub get_id {

my $header = shift;

$header =~ /^>.*\bsp\|([A-Z]\d{5}\b)/;

$1;

}

Here is how you would retrieve the sequence, as a Bio::Seq object:

my $seq = $inx->fetch("D12567");

print $seq->seq;

What if you wanted to retrieve a sequence using either a Swissprot id or a gi number and the fasta header was actually a

concatenation of headers with multiple gi's and Swissprots?

>gi|523232|emb|AAC12345|sp|D12567|gi|7744242|sp|V11223 titin fragment

Modify the function that's passed to the id_parser method:

sub get_id {

my $header = shift;

my (@sps) = $header =~ /^>.*\bsp\|([A-Z]\d{5})\b/g;

my (@gis) = $header =~ /gi\|(\d+)\b/g;

return (@sps,@gis);

}

The Bio::DB::Fasta module uses the same principle, but the syntax is slightly different, for example:

my $db = Bio::DB::Fasta->new('test.fa', -makeid=>\&make_my_id);

my $seqobj = $db->get_Seq_by_id($id);

sub make_my_id {

my $description_line = shift;

$description_line =~ /gi\|(\d+)\|emb\|(\w+)/;

($1,$2);

}

The core bioperl installation does not support accessing sequences and data stored in relational databases. However, this

capability is available with the auxiliary bioperl-db library. See section

IV.3

for more information.

Transforming sequence files (SeqIO)

A common - and tedious - bioinformatics task is that of converting sequence data among the many widely used data

formats. Bioperl's SeqIO object, however, makes this chore a breeze. SeqIO can read a stream of sequences - located in

a single or in multiple files - in a number of formats including

Fasta, EMBL, GenBank, Swissprot, SCF, phd/phred,

Ace, fastq, exp, chado, or raw (plain sequence).

SeqIO can also parse tracefiles in alf, ztr, abi, ctf, and ctr format

Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq, PrimarySeq, or

RichSeq objects, depending on what the sequence source is. Moreover, the sequence objects can then be written to

another file using SeqIO in any of the supported data formats making data converters simple to implement, for example:

use Bio::SeqIO;

$in = Bio::SeqIO->new(-file => "inputfilename",

-format => 'Fasta');

$out = Bio::SeqIO->new(-file => ">outputfilename",

-format => 'EMBL');

while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

In addition, the perl ``tied filehandle'' syntax is available to SeqIO, allowing you to use the standard <> and print

operations to read and write sequence objects, eg:

$in

= Bio::SeqIO->newFh(-file => "inputfilename" ,

-format => 'fasta');

$out = Bio::SeqIO->newFh(-format => 'embl');

print $out $_ while <$in>;

If the ``-format'' argument isn't used then Bioperl will try to determine the format based on the file's suffix, in a case-

insensitive manner. If there's no suffix available then SeqIO will attempt to guess the format based on actual content. If

it can't determine the format then it will assume ``fasta''. A complete list of formats and suffixes can be found in the

SeqIO HOWTO (http://bioperl.org/HOWTOs/html/SeqIO.html).

Output the files from

(1)

in FASTA format to a file named [yourname]_bioperl.out

Transforming alignment files (AlignIO)

Data files storing multiple sequence alignments also appear in varied formats. AlignIO is the bioperl object for

conversion of alignment files. AlignIO is patterned on the SeqIO object and its commands have many of the same

names as the commands in SeqIO. Just as in SeqIO the AlignIO object can be created with ``-file'' and ``-format''

options:

use Bio::AlignIO;

my $io = Bio::AlignIO->new(-file

=> "receptors.aln",

-format => "clustalw" );

If the ``-format'' argument isn't used then Bioperl will try and determine the format based on the file's suffix, in a case-

insensitive manner. Here is the current set of suffixes:

Format

bl2seq

clustalw

emboss*

fasta

maf

mase

mega

meme

metafasta

msf

nexus

pfam

phylip

prodom

psi

selex

stockholm

Suffixes

aln

water|needle

fasta|fast|seq|fa|fsa|nt|aa

maf

Seaview

meg|mega

meme

msf|pileup|gcg

nexus|nex

pfam|pfm

phylip|phlp|phyl|phy|phy|ph

GCG

Comment

interleaved

POA

PSI-BLAST

HMMER

psi

selex|slx|selx|slex|sx

*water, needle, matcher, stretcher, merger, and supermatcher.

Unlike SeqIO AlignIO cannot create output files in every format. AlignIO currently supports output in these 7 formats:

bioperl(3).pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: