biopython_intro.pdf
(
169 KB
)
Pobierz
Bioinformatics using Python for Biologists
9.1 Biopython - Introduction
Biopython (http://biopython.org/wiki/Main_Page) is a collection of modules for
computational molecular biology, which allows performing most of the basic (and in
many cases also advanced) tasks required in a bioinformatics project.
The most common tasks that can be performed by using Biopython include:
−
parsing (i.e. extracting information) of the most common file formats for gene
and protein sequences, protein structures, PubMed records, etc.;
−
download files from repositories such as NCBI, ExPASy, etc.;
−
run (locally or remotely) popular bioinformatics algorithms such as Blast,
Clustalw, etc.;
−
run Biopython implementations of algorithms for clustering, machine learning,
data analysis, data visualization.
It also provides classes that the Python programmer can use to handle data (such as
sequences) and methods to perform operations on them (such as translation,
com pleme nt, e tc .) . Th e Bi o p y t h o n T u t or i a l a nd Cookbook
(http://biopython.org/DIST/docs/tutorial/Tutorial.html) is a good starting point to get
a grasp of what Biopython can do for you. More advanced tutorials can be found for
specific packages (for example the Structural Biopython package,
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf).
9.1.1 You can use Biopython in several ways
Biopython is not a program itself; it is a collection of tools for Python bioinformatics
programming. In most cases, the Python programmer can build a research pipeline
completely by using Biopython, or by writing new code for some specific tasks and
using Biopython for more standard operations, or can modify Biopython open-source
code to better adapt it to his/her own needs. Python excels as a "glue language" which
can stick together other peoples' programs, functions, classes, etc.
Write your own code when:
−
The algorithm implementation or coding is interesting to you;
−
Biopython data structure mapping is too complex for your task;
−
Biopython does not provide tools for your specific task;
−
You want to have a complete control of what your code does.
Use Biopython when:
−
Its modules and/or methods fit your needs;
−
Your task is unchallenging or boring: Why waste your time? Don't "re-invent
the wheel" unless you're doing it as a learning project;
−
Your task will take you a lot of effort to write.
Extend Biopython (i.e. modify the Biopython source code) when:
−
The Biopython modules and/or methods almost does what you need, but not
exactly fit your need;
−
But: it might be challenging for the beginner. It can be difficult to read and
understand someone else’s code.
1
Remember:
−
When doing bioinformatics, keep Biopython in mind; It is very powerful;
−
Browse the documentation; become familiar with its capabilities;
−
Use help(), type(), dir() and other built-in features to explore Biopython
modules.
9.1.2 Installing Biopython
Biopython is not part of the official Python distribution, and as such must be
downloaded (from
http://biopython.org/wiki/Download)
and installed independently.
Prerequisite for Biopython is the installation of the NumPy package (downloadable
from
http://numpy.scipy.org/).
While Biopython installation is usually pain-free (follow the instructions at
http://biopython.org/DIST/docs/install/Installation.html),
NumPy can be more
problematic, especially on Macs, for which there exist specially prepared installers (at
http://stronginference.com/scipy-superpack/).
Be careful: Biopython and NumPy
versions are coordinated, meaning that specific Biopython releases must be installed
for specific NumPy releases.
Additional packages can optionally be installed to allow Biopython to perform
additional tasks (mainly for graphical outputs and plots of various kind).
9.1.3 Let's get started
To show the basic logic behind most of the Biopython modules, let's create a
sequence object and see the methods associated to it. The module
Bio
is the main
Biopython module.
2
>>> import sys
>>> sys.path.append("/Users/fabrizio/source/biopython-1.57")
>>> import Bio
>>> dir(Bio)
['BiopythonDeprecationWarning', 'MissingExternalDependencyError',
'MissingPythonDependencyError', '__builtins__', '__doc__',
'__file__', '__name__', '__package__', '__path__', '__version__']
>>> Bio.__version__
'1.57'
>>> help(Bio)
Help on package Bio:
NAME
Bio - Collection of modules for dealing with biological data
in Python.
FILE
/Users/fabrizio/source/biopython-1.57/Bio/__init__.py
DESCRIPTION
The Biopython Project is an international association of
developers
of freely available Python tools for computational molecular
biology.
http://biopython.org
PACKAGE CONTENTS
Affy (package)
Align (package)
AlignIO (package)
Alphabet (package)
Application (package)
Blast (package)
CAPS (package)
Clustalw (package)
Cluster (package)
Compass (package)
Crystal (package)
Data (package)
DocSQL
Emboss (package)
:
If it is not automatically during the installation, you must set the PYTHONPATH to
point at the installation folder, or alternatively you can set the
sys.path
(like in the
previous example).
9.2 The
Seq
class
The
Bio
module
Seq
provides high-level data structures to store and process
sequence objects. Let's import Seq from Bio:
3
>>> from Bio import Seq
>>> dir(Seq)
['Alphabet', 'CodonTable', 'IUPAC', 'MutableSeq', 'Seq',
'UnknownSeq', '__builtins__', '__doc__', '__docformat__',
'__file__', '__name__', '__package__', '_dna_complement_table',
'_maketrans', '_rna_complement_table', '_test', '_translate_str',
'ambiguous_dna_complement', 'ambiguous_rna_complement', 'array',
'back_transcribe', 'reverse_complement', 'string', 'sys',
'transcribe', 'translate']
>>> my_seq = Seq.Seq("AGCATCGTAGCATGCAC")
>>> my_seq
Seq('AGCATCGTAGCATGCAC', Alphabet())
>>> print my_seq
AGCATCGTAGCATGCAC
>>> my_seq.alphabet
Alphabet()
>>> dir(my_seq)
['__add__', '__class__', '__cmp__', '__contains__',
'__delattr__', '__dict__', '__doc__', '__format__',
'__getattribute__', '__getitem__', '__hash__', '__init__',
'__len__', '__module__', '__new__', '__radd__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__',
'__str__', '__subclasshook__', '__weakref__', '_data',
'_get_seq_str_and_check_alphabet', 'alphabet', 'back_transcribe',
'complement', 'count', 'data', 'endswith', 'find', 'lower',
'lstrip', 'reverse_complement', 'rfind', 'rsplit', 'rstrip',
'split', 'startswith', 'strip', 'tomutable', 'tostring',
'transcribe', 'translate', 'ungap', 'upper']
The
Seq.Seq
objects are associated with an
alphabet
object that specifies the kind
of sequence stored into the object. In our example the alphabet is not set, meaning that
it is not specified whether the sequence is DNA or protein. Biopython contains a set
of precompiled alphabets that cover all biological sequence types. Usually, IUPAC
(http://www.chem.qmw.ac.uk/iupac) defined alphabets are the most used. They are:
−
IUPACUnambiguousDNA (basic GATC letters)
−
IUPACAmbiguousDNA (+ ambiguity letters)
−
ExtendedIUPACDNA (+ modified bases)
−
IUPACUnambiguousRNA
−
IUPACAmbiguousRNA
−
IUPACProtein (IUPAC standard AA)
−
ExtendedIUPACProtein (+ selenocysteine, X, etc)
4
>>> from Bio.Alphabet import IUPAC
>>> dir(IUPAC)
['Alphabet', 'ExtendedIUPACDNA', 'ExtendedIUPACProtein',
'IUPACAmbiguousDNA', 'IUPACAmbiguousRNA', 'IUPACData',
'IUPACProtein', 'IUPACUnambiguousDNA', 'IUPACUnambiguousRNA',
'__builtins__', '__doc__', '__file__', '__name__', '__package__',
'ambiguous_dna', 'ambiguous_rna', 'extended_dna',
'extended_protein', 'protein', 'unambiguous_dna',
'unambiguous_rna']
>>> IUPAC.unambiguous_dna.letters
'GATC'
>>> IUPAC.unambiguous_rna.letters
'GAUC'
>>> IUPAC.ambiguous_dna.letters
'GATCRYWSMKHBVDN'
>>> IUPAC.extended_dna.letters
'GATCBDSW'
>>> IUPAC.protein.letters
'ACDEFGHIKLMNPQRSTVWY'
Now we can create a new instance of
my_seq,
this time specifying that it is indeed a
DNA sequence:
>>> my_seq = Seq.Seq("AGCATCGTAGCATGCAC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGCATCGTAGCATGCAC', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
Methods associated with the
my_seq
object allow basic string manipulation. For
example, we can index, slice, split, convert the sequence upper or lower-case, count
occurrences of characters, and so on:
>>> my_seq[0]
'A'
>>> my_seq[5]
'C'
>>> my_seq[5:10]
Seq('CGTAG', IUPACUnambiguousDNA())
>>> my_seq.split("A")
[Seq('', IUPACUnambiguousDNA()), Seq('GC',
IUPACUnambiguousDNA()), Seq('TCGT', IUPACUnambiguousDNA()),
Seq('GC', IUPACUnambiguousDNA()), Seq('TGC',
IUPACUnambiguousDNA()), Seq('C', IUPACUnambiguousDNA())]
>>> my_seq.count("A")
5
>>> my_seq.count("A")/float(len(my_seq))
0.29411764705882354
Note that when you slice a
Seq
object, or you split it, the methods return not just
strings but other
Seq
objects.
Sequence objects can also be concatenated by adding them, but only if their alphabets
are compatible (unless the sequences are assigned generic alphabets):
5
Plik z chomika:
xyzgeo
Inne pliki z tego folderu:
biopython_intro.pdf
(169 KB)
Inne foldery tego chomika:
ksiazki
PYTHON
python (acabose)
tutoriale
wyklady
Zgłoś jeśli
naruszono regulamin