bioperl(3).pdf

(98 KB) Pobierz
Using bioperl
#!/usr/bin/perl –w
use Bio::Perl;
Bioperl provides software modules for many of the typical tasks of bioinformatics programming. These include:
Accessing sequence data from local and remote databases
Transforming formats of database/ file records
Manipulating individual sequences
Searching for similar sequences
Creating and manipulating sequence alignments
Searching for genes and other structures on genomic DNA
Developing machine readable sequence annotations
Not covered in this tutorial
Accessing sequence data from local and remote databases
Much of bioperl is focused on sequence manipulation. However, before bioperl can manipulate sequences, it needs to
have access to sequence data. Now one can directly enter data sequence data into a bioperl Seq object, eg:
$seq = Bio::Seq->new(-seq
-desc
-display_id
-accession_number
-alphabet
=>
=>
=>
=>
=>
'actgtggcgtcaact',
'Sample Bio::Seq object',
'something',
'accnum',
'dna' );
However, in most cases, you will probably be accessing sequence data from some online data file or database.
Bioperl supports accessing remote databases as well as creating indices for accessing local databases. There are two
general approaches to accomplishing this. If you know what kind of database the sequences are stored in (i.e. flat file,
local relational database or a database accessed remotely over the internet), you can write a script that specifically
accesses data from that kind of database. To explicitly access sequence data from a local relational database requires
installing and setting up the modules in the bioperl-db library and the BioSQL schema.
The other approach is to use the recently developed
OBDA
(Open Bioinformatics Data Access) Registry system. Using
OBDA it is possible to import sequence data from a database without your needing to know whether the required
database is flat-file or relational or even whether it is local or accessible only over the net. Descriptions of how to set up
the necessary registry configuration file and access sequence data with the registry in described at
http://bioperl.org/HOWTOs/html/OBDA_Access.html.
Accessing remote databases (Bio::DB::GenBank, etc)
I am putting this nifty box around items that will be required as part of today’s practice assignment.
1)
Write a bioperl script that will retrieve the GenBank files from a list of GenBank ID’s given on the
command line. For example:
>perl RetrieveGenBank.pl 62857694 16741142 4506102
Accessing sequence data from the principal molecular biology databases is straightforward in bioperl. Data can be
accessed by means of the sequence's accession number or id. Batch mode access is also supported to facilitate the
efficient retrieval of multiple sequences. For retrieving data from genbank, for example, the code could be as follows:
$gb = new Bio::DB::GenBank();
# this returns a Seq object :
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
# this also returns a Seq object :
$seq2 = $gb->get_Seq_by_acc('AF303112');
# this returns a SeqIO object, which can be used to get a Seq object :
$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);
$seq3 = $seqio->next_seq;
The SeqIO object is explained later.
Bioperl currently supports sequence data retrieval from:
genbank
Bio::DB::GenBank manpage
genpept
Bio::DB::GenPept manpage
RefSeq
Bio::DB::SwissProt manpage
swissprot
Bio::DB::RefSeq manpage
EMBL
Bio::DB::EMBL manpage
The retrieval of NCBI RefSeqs sequences is supported through a special module called Bio::DB::RefSeq which actually
queries an EBI server. Please see
the Bio::DB::RefSeq manpage
before using it as there are some caveats with RefSeq
retrieval. RefSeq ids in Genbank begin with ``NT_'', ``NC_'', ``NG_'', ``NM_'', ``NP_'', ``XM_'', ``XR_'', or ``XP_''.
Bio::DB::GenBank can be used to retrieve entries corresponding to these ids but bear in mind that these are not
Genbank entries, strictly speaking. See
the Bio::DB::GenBank manpage
for special details on retrieving entries
beginning with ``NT_'', these are specially formatted ``CONTIG'' entries.
Bioperl also supports retrieval from a remote Ace database. This capability requires the presence of the external AcePerl
module. The Lincoln Stein Modules can be found at:
http://stein.cshl.org/AcePerl/.
An additional module is available for accessing remote databases, BioFetch, which queries the dbfetch script at EBI.
The available databases are EMBL, GenBank, or SWALL, and the entries can be retrieved in different formats as
objects or streams (SeqIO objects), or as temporary files. See
the Bio::DB::BioFetch manpage
for the details.
Indexing and accessing local databases (Bio::Index::*, bp_index.pl, bp_fetch.pl, Bio::DB::*)
Alternately, bioperl permits indexing local sequence data files by means of the Bio::Index or Bio::DB::Fasta objects.
The following sequence data formats are supported by Bio::Index:
genbank, swissprot, pfam, embl
and
fasta.
Once
the set of sequences have been indexed using Bio::Index, individual sequences can be accessed using syntax very
similar to that described above for accessing remote databases. For example, if one wants to set up an indexed flat-file
database of fasta files, and later wants then to retrieve one file, one could write scripts like:
# script 1: create the index
use Bio::Index::Fasta;
# using fasta file format
use strict; # some users have reported that this is necessary
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new(
-filename => $Index_File_Name,
-write_flag => 1);
$inx->make_index(@ARGV);
# script 2: retrieve some files
use Bio::Index::Fasta;
use strict; # some users have reported that this is necessary
my $Index_File_Name = shift;
my $inx = Bio::Index::Fasta->new($Index_File_Name);
foreach my $id (@ARGV) {
my $seq = $inx->fetch($id); # Returns Bio::Seq object
# do something with the sequence
}
To facilitate the creation and use of more complex or flexible indexing systems, the bioperl distribution includes two
sample scripts in the scripts/index directory, bp_index.PLS and bp_fetch.PLS. These scripts can be used as templates to
develop customized local data-file indexing systems.
Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta format files. It's similar in spirit to
Bio::Index::Fasta but offers more methods, e.g.
use Bio::DB::Fasta;
use strict;
my $db = Bio::DB::Fasta->new($file);
my $seqstring = $db->seq($id);
my $seqobj = $db->get_Seq_by_id($id);
my $desc = $db->header($id);
#
#
#
#
one
get
get
get
file or many files
a sequence as string
a PrimarySeq obj
the header, or description line
See
the Bio::DB::Fasta manpage
for more information on this fully-featured module.
Both modules also offer the user the ability to designate a specific string within the fasta header as the desired id, such
as the gi number within the string ``gi|4556644|gb|X45555''. Consider the following fasta-formatted sequence, in
``test.fa'':
>gi|523232|emb|AAC12345|sp|D12567 titin fragment
MHRHHRTGYSAAYGPLKJHGYVHFIMCVVVSWWASDVVTYIPLLLNNSSAGWKRWWWIIFGGE
GHGHHRTYSALWWPPLKJHGSKHFILCVKVSWLAKKERTYIPKKILLMMGGWWAAWWWI
By default Bio::Index::Fasta and Bio::DB::Fasta will use the first ``word'' they encounter in the fasta header as the
retrieval key, in this case ``gi|523232|emb|AAC12345|sp|D12567''. What would be more useful as a key would be a
single id. The code below will index the ``test.fa'' file and create an index file called ``test.fa.idx'' where the keys are the
Swissprot, or ``sp'', identifiers.
$ENV{BIOPERL_INDEX_TYPE} = "SDBM_File";
# look for the index in the current directory
$ENV{BIOPERL_INDEX} = ".";
my $file_name = "test.fa";
my $inx = Bio::Index::Fasta->new( -filename
=> $file_name . ".idx",
-write_flag => 1 );
# pass a reference to the critical function to the Bio::Index object
$inx->id_parser(\&get_id);
# make the index
$inx->make_index($file_name);
# here is where the retrieval key is specified
sub get_id {
my $header = shift;
$header =~ /^>.*\bsp\|([A-Z]\d{5}\b)/;
$1;
}
Here is how you would retrieve the sequence, as a Bio::Seq object:
my $seq = $inx->fetch("D12567");
print $seq->seq;
What if you wanted to retrieve a sequence using either a Swissprot id or a gi number and the fasta header was actually a
concatenation of headers with multiple gi's and Swissprots?
>gi|523232|emb|AAC12345|sp|D12567|gi|7744242|sp|V11223 titin fragment
Modify the function that's passed to the id_parser method:
sub get_id {
my $header = shift;
my (@sps) = $header =~ /^>.*\bsp\|([A-Z]\d{5})\b/g;
my (@gis) = $header =~ /gi\|(\d+)\b/g;
return (@sps,@gis);
}
The Bio::DB::Fasta module uses the same principle, but the syntax is slightly different, for example:
my $db = Bio::DB::Fasta->new('test.fa', -makeid=>\&make_my_id);
my $seqobj = $db->get_Seq_by_id($id);
sub make_my_id {
my $description_line = shift;
$description_line =~ /gi\|(\d+)\|emb\|(\w+)/;
($1,$2);
}
The core bioperl installation does not support accessing sequences and data stored in relational databases. However, this
capability is available with the auxiliary bioperl-db library. See section
IV.3
for more information.
Transforming sequence files (SeqIO)
A common - and tedious - bioinformatics task is that of converting sequence data among the many widely used data
formats. Bioperl's SeqIO object, however, makes this chore a breeze. SeqIO can read a stream of sequences - located in
a single or in multiple files - in a number of formats including
Fasta, EMBL, GenBank, Swissprot, SCF, phd/phred,
Ace, fastq, exp, chado, or raw (plain sequence).
SeqIO can also parse tracefiles in alf, ztr, abi, ctf, and ctr format
Once the sequence data has been read in with SeqIO, it is available to bioperl in the form of Seq, PrimarySeq, or
RichSeq objects, depending on what the sequence source is. Moreover, the sequence objects can then be written to
another file using SeqIO in any of the supported data formats making data converters simple to implement, for example:
use Bio::SeqIO;
$in = Bio::SeqIO->new(-file => "inputfilename",
-format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename",
-format => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
In addition, the perl ``tied filehandle'' syntax is available to SeqIO, allowing you to use the standard <> and print
operations to read and write sequence objects, eg:
$in
= Bio::SeqIO->newFh(-file => "inputfilename" ,
-format => 'fasta');
$out = Bio::SeqIO->newFh(-format => 'embl');
print $out $_ while <$in>;
If the ``-format'' argument isn't used then Bioperl will try to determine the format based on the file's suffix, in a case-
insensitive manner. If there's no suffix available then SeqIO will attempt to guess the format based on actual content. If
it can't determine the format then it will assume ``fasta''. A complete list of formats and suffixes can be found in the
SeqIO HOWTO (http://bioperl.org/HOWTOs/html/SeqIO.html).
2)
Output the files from
(1)
in FASTA format to a file named [yourname]_bioperl.out
Transforming alignment files (AlignIO)
Data files storing multiple sequence alignments also appear in varied formats. AlignIO is the bioperl object for
conversion of alignment files. AlignIO is patterned on the SeqIO object and its commands have many of the same
names as the commands in SeqIO. Just as in SeqIO the AlignIO object can be created with ``-file'' and ``-format''
options:
use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file
=> "receptors.aln",
-format => "clustalw" );
If the ``-format'' argument isn't used then Bioperl will try and determine the format based on the file's suffix, in a case-
insensitive manner. Here is the current set of suffixes:
Format
bl2seq
clustalw
emboss*
fasta
maf
mase
mega
meme
metafasta
msf
nexus
pfam
phylip
po
prodom
psi
selex
stockholm
Suffixes
aln
water|needle
fasta|fast|seq|fa|fsa|nt|aa
maf
Seaview
meg|mega
meme
msf|pileup|gcg
nexus|nex
pfam|pfm
phylip|phlp|phyl|phy|phy|ph
GCG
Comment
interleaved
POA
PSI-BLAST
HMMER
psi
selex|slx|selx|slex|sx
*water, needle, matcher, stretcher, merger, and supermatcher.
Unlike SeqIO AlignIO cannot create output files in every format. AlignIO currently supports output in these 7 formats:
fasta, mase, selex, clustalw, msf/gcg, phylip (interleaved), and po.
Zgłoś jeśli naruszono regulamin