Freitag, 29. Juni 2012

Python 005: The so called "sliding window"

Say you find yourself required to locate the line number of the last atom of residue '2' in this PDB (following the PDB internal numbering scheme, of course we assume you don't know a priori its called H01):

ATOM      1  N   ALA     2      -0.677  -1.230  -0.491  1.00  0.00           N
ATOM      2  CA  ALA     2      -0.001   0.064  -0.491  1.00  0.00           C
ATOM      3  C   ALA     2       1.499  -0.110  -0.491  1.00  0.00           C
ATOM      4  O   ALA     2       2.065  -0.922   0.251  1.00  0.00           O
ATOM      5  CB  ALA     2      -0.509   0.856   0.727  1.00  0.00           C
ATOM      6  H   ALA     2      -0.131  -2.162  -0.491  1.00  0.00           H
ATOM      7  HA  ALA     2      -0.269   0.603  -1.418  1.00  0.00           H
ATOM      8 1HB  ALA     2      -1.605   1.006   0.691  1.00  0.00           H
ATOM      9 2HB  ALA     2      -0.285   0.342   1.681  1.00  0.00           H
ATOM     10 3HB  ALA     2      -0.053   1.861   0.784  1.00  0.00           H
ATOM     11  H01 ALA     2      -1.261  -1.244  -1.314  1.00  0.00           H
ATOM     12  N   ASN     3       2.311   0.711  -1.400  1.00  0.00           N
ATOM     13  CA  ASN     3       3.700   0.321  -1.173  1.00  0.00           C
ATOM     14  C   ASN     3       4.606   1.530  -1.169  1.00  0.00           C
ATOM     15  O   ASN     3       4.515   2.417  -2.017  1.00  0.00           O
ATOM     16  CB  ASN     3       4.140  -0.693  -2.267  1.00  0.00           C
ATOM     17  CG  ASN     3       3.309  -1.974  -2.400  1.00  0.00           C
ATOM     18  ND2 ASN     3       3.497  -2.939  -1.540  1.00  0.00           N
ATOM     19  OD1 ASN     3       2.462  -2.114  -3.270  1.00  0.00           O
ATOM     20  H   ASN     3       1.953   1.454  -2.096  1.00  0.00           H
ATOM     21  HA  ASN     3       3.774  -0.146  -0.174  1.00  0.00           H
ATOM     22 2HB  ASN     3       5.198  -0.981  -2.124  1.00  0.00           H
ATOM     23 3HB  ASN     3       4.123  -0.199  -3.258  1.00  0.00           H
ATOM     24 1HD2 ASN     3       2.793  -3.677  -1.623  1.00  0.00           H
ATOM     25 2HD2 ASN     3       4.145  -2.756  -0.775  1.00  0.00           H
ATOM     26  N   PRO     4       5.629   1.649  -0.120  1.00  0.00           N
ATOM     27  CA  PRO     4       6.360   2.889  -0.340  1.00  0.00           C
ATOM     28  C   PRO     4       6.031   3.505  -1.710  1.00  0.00           C
ATOM     29  O   PRO     4       5.228   2.956  -2.465  1.00  0.00           O
ATOM     30  CB  PRO     4       7.819   2.420  -0.227  1.00  0.00           C
ATOM     31  CG  PRO     4       7.769   0.962  -0.705  1.00  0.00           C
ATOM     32  CD  PRO     4       6.440   0.451  -0.143  1.00  0.00           C
ATOM     33  HA  PRO     4       6.122   3.617   0.439  1.00  0.00           H
ATOM     34 2HB  PRO     4       8.119   2.445   0.823  1.00  0.00           H
ATOM     35 3HB  PRO     4       8.509   3.035  -0.810  1.00  0.00           H
ATOM     36 2HG  PRO     4       8.619   0.375  -0.350  1.00  0.00           H
ATOM     37 3HG  PRO     4       7.737   0.935  -1.796  1.00  0.00           H
ATOM     38 2HD  PRO     4       6.564   0.075   0.875  1.00  0.00           H
ATOM     39 3HD  PRO     4       5.991  -0.316  -0.778  1.00  0.00           H
ATOM     40  H01 PRO     4       6.512   4.435  -2.015  1.00  0.00           H


As you can see, it's the Atom 11 (H01) of an Alanine. How would you do that? Choose a for-loop, choose a nicely indented block of if and else statements. Choose a number of different counters, each with their very own meaningful name. Choose endless meetings with IndexErrors, comparisons, type castings and list slicings. I say no, I choose a generator. In fact, a pair wise sliding window.

def make_sliding_window(val):
    ln_cnt = 0
    while ln_cnt < len(val):
        yield ln_cnt, ln_cnt-1
        ln_cnt += 1 
The 'yield' statement characterizes this as a generator. What does that mean? First, a generator is *not* a list, or array or something like that. It's a function. With special properties. When called through 'next(generator_function)', it executes until it reaches a 'yield' statement, from where it returns. When called again (with a new 'next(generator_function)') it resumes from where it left of. Yes, and it returns a generator object, when you call it as such (no 'next'), but then it does not execute.
So lets instantiate the object.
res=make_sliding_window(val)

'res' is now the generator. Only when you call 'next(res)', the generator executes, up until the 'yield' statement, from where it returns. In addition, it can be used directly in a for-loop. The for-loop takes care of implicitly calling the 'next' method. Here's an example:
def get_line_num_last_atom_res_2(val):
    # Form a generator 'res'.
    res = make_sliding_window(val)
    for a, b in res:
        if val[a][25] == '3':
            return val[b], a

print get_line_num_last_atom_res_2(open('randompdbfile.pdb', 'r').readlines())
As you can see, we're now using it as-if it were a list. But notice, if you think it's a list, you'll not understand what is happening. It has a 'next' method, that's the important point and it's being called, even if we dont see it explicitly. So anyway, upon reaching 'yield', we get apparently two line numbers back, where it's two times the same one, but just offset by one. Now we can easily check if the upcoming atom already is part of the next residue, and if not, return the current atom. And its line number.

Kommentare:

  1. You could also have used enumerate:

    lines = open('randompdbfile.pdb', 'r').readlines()
    for i, line in enumerate(lines):
    if (int(line[i][25]) == int(line[i+1][25]) - 1):
    print line

    AntwortenLöschen
  2. As far as I know, enumerate is just a built-in generator.

    AntwortenLöschen