biopython_intro.pdf

(169 KB) Pobierz

Bioinformatics using Python for Biologists

9.1 Biopython - Introduction

Biopython (http://biopython.org/wiki/Main_Page) is a collection of modules for

computational molecular biology, which allows performing most of the basic (and in

many cases also advanced) tasks required in a bioinformatics project.

The most common tasks that can be performed by using Biopython include:

−

parsing (i.e. extracting information) of the most common file formats for gene

and protein sequences, protein structures, PubMed records, etc.;

−

download files from repositories such as NCBI, ExPASy, etc.;

−

run (locally or remotely) popular bioinformatics algorithms such as Blast,

Clustalw, etc.;

−

run Biopython implementations of algorithms for clustering, machine learning,

data analysis, data visualization.

It also provides classes that the Python programmer can use to handle data (such as

sequences) and methods to perform operations on them (such as translation,

com pleme nt, e tc .) . Th e Bi o p y t h o n T u t or i a l a nd Cookbook

(http://biopython.org/DIST/docs/tutorial/Tutorial.html) is a good starting point to get

a grasp of what Biopython can do for you. More advanced tutorials can be found for

specific packages (for example the Structural Biopython package,

http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf).

9.1.1 You can use Biopython in several ways

Biopython is not a program itself; it is a collection of tools for Python bioinformatics

programming. In most cases, the Python programmer can build a research pipeline

completely by using Biopython, or by writing new code for some specific tasks and

using Biopython for more standard operations, or can modify Biopython open-source

code to better adapt it to his/her own needs. Python excels as a "glue language" which

can stick together other peoples' programs, functions, classes, etc.

Write your own code when:

−

The algorithm implementation or coding is interesting to you;

−

Biopython data structure mapping is too complex for your task;

−

Biopython does not provide tools for your specific task;

−

You want to have a complete control of what your code does.

Use Biopython when:

−

Its modules and/or methods fit your needs;

−

Your task is unchallenging or boring: Why waste your time? Don't "re-invent

the wheel" unless you're doing it as a learning project;

−

Your task will take you a lot of effort to write.

Extend Biopython (i.e. modify the Biopython source code) when:

−

The Biopython modules and/or methods almost does what you need, but not

exactly fit your need;

−

But: it might be challenging for the beginner. It can be difficult to read and

understand someone else’s code.

Remember:

−

When doing bioinformatics, keep Biopython in mind; It is very powerful;

−

Browse the documentation; become familiar with its capabilities;

−

Use help(), type(), dir() and other built-in features to explore Biopython

modules.

9.1.2 Installing Biopython

Biopython is not part of the official Python distribution, and as such must be

downloaded (from

http://biopython.org/wiki/Download)

and installed independently.

Prerequisite for Biopython is the installation of the NumPy package (downloadable

from

http://numpy.scipy.org/).

While Biopython installation is usually pain-free (follow the instructions at

http://biopython.org/DIST/docs/install/Installation.html),

NumPy can be more

problematic, especially on Macs, for which there exist specially prepared installers (at

http://stronginference.com/scipy-superpack/).

Be careful: Biopython and NumPy

versions are coordinated, meaning that specific Biopython releases must be installed

for specific NumPy releases.

Additional packages can optionally be installed to allow Biopython to perform

additional tasks (mainly for graphical outputs and plots of various kind).

9.1.3 Let's get started

To show the basic logic behind most of the Biopython modules, let's create a

sequence object and see the methods associated to it. The module

Bio

is the main

Biopython module.

>>> import sys

>>> sys.path.append("/Users/fabrizio/source/biopython-1.57")

>>> import Bio

>>> dir(Bio)

['BiopythonDeprecationWarning', 'MissingExternalDependencyError',

'MissingPythonDependencyError', '__builtins__', '__doc__',

'__file__', '__name__', '__package__', '__path__', '__version__']

>>> Bio.__version__

'1.57'

>>> help(Bio)

Help on package Bio:

NAME

Bio - Collection of modules for dealing with biological data

in Python.

FILE

/Users/fabrizio/source/biopython-1.57/Bio/__init__.py

DESCRIPTION

The Biopython Project is an international association of

developers

of freely available Python tools for computational molecular

biology.

http://biopython.org

PACKAGE CONTENTS

Affy (package)

Align (package)

AlignIO (package)

Alphabet (package)

Application (package)

Blast (package)

CAPS (package)

Clustalw (package)

Cluster (package)

Compass (package)

Crystal (package)

Data (package)

DocSQL

Emboss (package)

If it is not automatically during the installation, you must set the PYTHONPATH to

point at the installation folder, or alternatively you can set the

sys.path

(like in the

previous example).

9.2 The

Seq

class

The

Bio

module

Seq

provides high-level data structures to store and process

sequence objects. Let's import Seq from Bio:

>>> from Bio import Seq

>>> dir(Seq)

['Alphabet', 'CodonTable', 'IUPAC', 'MutableSeq', 'Seq',

'UnknownSeq', '__builtins__', '__doc__', '__docformat__',

'__file__', '__name__', '__package__', '_dna_complement_table',

'_maketrans', '_rna_complement_table', '_test', '_translate_str',

'ambiguous_dna_complement', 'ambiguous_rna_complement', 'array',

'back_transcribe', 'reverse_complement', 'string', 'sys',

'transcribe', 'translate']

>>> my_seq = Seq.Seq("AGCATCGTAGCATGCAC")

>>> my_seq

Seq('AGCATCGTAGCATGCAC', Alphabet())

>>> print my_seq

AGCATCGTAGCATGCAC

>>> my_seq.alphabet

Alphabet()

>>> dir(my_seq)

['__add__', '__class__', '__cmp__', '__contains__',

'__delattr__', '__dict__', '__doc__', '__format__',

'__getattribute__', '__getitem__', '__hash__', '__init__',

'__len__', '__module__', '__new__', '__radd__', '__reduce__',

'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__',

'__str__', '__subclasshook__', '__weakref__', '_data',

'_get_seq_str_and_check_alphabet', 'alphabet', 'back_transcribe',

'complement', 'count', 'data', 'endswith', 'find', 'lower',

'lstrip', 'reverse_complement', 'rfind', 'rsplit', 'rstrip',

'split', 'startswith', 'strip', 'tomutable', 'tostring',

'transcribe', 'translate', 'ungap', 'upper']

The

Seq.Seq

objects are associated with an

alphabet

object that specifies the kind

of sequence stored into the object. In our example the alphabet is not set, meaning that

it is not specified whether the sequence is DNA or protein. Biopython contains a set

of precompiled alphabets that cover all biological sequence types. Usually, IUPAC

(http://www.chem.qmw.ac.uk/iupac) defined alphabets are the most used. They are:

−

IUPACUnambiguousDNA (basic GATC letters)

−

IUPACAmbiguousDNA (+ ambiguity letters)

−

ExtendedIUPACDNA (+ modified bases)

−

IUPACUnambiguousRNA

−

IUPACAmbiguousRNA

−

IUPACProtein (IUPAC standard AA)

−

ExtendedIUPACProtein (+ selenocysteine, X, etc)

>>> from Bio.Alphabet import IUPAC

>>> dir(IUPAC)

['Alphabet', 'ExtendedIUPACDNA', 'ExtendedIUPACProtein',

'IUPACAmbiguousDNA', 'IUPACAmbiguousRNA', 'IUPACData',

'IUPACProtein', 'IUPACUnambiguousDNA', 'IUPACUnambiguousRNA',

'__builtins__', '__doc__', '__file__', '__name__', '__package__',

'ambiguous_dna', 'ambiguous_rna', 'extended_dna',

'extended_protein', 'protein', 'unambiguous_dna',

'unambiguous_rna']

>>> IUPAC.unambiguous_dna.letters

'GATC'

>>> IUPAC.unambiguous_rna.letters

'GAUC'

>>> IUPAC.ambiguous_dna.letters

'GATCRYWSMKHBVDN'

>>> IUPAC.extended_dna.letters

'GATCBDSW'

>>> IUPAC.protein.letters

'ACDEFGHIKLMNPQRSTVWY'

Now we can create a new instance of

my_seq,

this time specifying that it is indeed a

DNA sequence:

>>> my_seq = Seq.Seq("AGCATCGTAGCATGCAC", IUPAC.unambiguous_dna)

>>> my_seq

Seq('AGCATCGTAGCATGCAC', IUPACUnambiguousDNA())

>>> my_seq.alphabet

IUPACUnambiguousDNA()

Methods associated with the

my_seq

object allow basic string manipulation. For

example, we can index, slice, split, convert the sequence upper or lower-case, count

occurrences of characters, and so on:

>>> my_seq[0]

'A'

>>> my_seq[5]

'C'

>>> my_seq[5:10]

Seq('CGTAG', IUPACUnambiguousDNA())

>>> my_seq.split("A")

[Seq('', IUPACUnambiguousDNA()), Seq('GC',

IUPACUnambiguousDNA()), Seq('TCGT', IUPACUnambiguousDNA()),

Seq('GC', IUPACUnambiguousDNA()), Seq('TGC',

IUPACUnambiguousDNA()), Seq('C', IUPACUnambiguousDNA())]

>>> my_seq.count("A")

>>> my_seq.count("A")/float(len(my_seq))

0.29411764705882354

Note that when you slice a

Seq

object, or you split it, the methods return not just

strings but other

Seq

objects.

Sequence objects can also be concatenated by adding them, but only if their alphabets

are compatible (unless the sequences are assigned generic alphabets):

Plik z chomika:

xyzgeo

Inne pliki z tego folderu:

biopython_intro.pdf (169 KB)

biopython_intro.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: