Freitag, 29. Juni 2012

Python 005: The so called "sliding window"

Say you find yourself required to locate the line number of the last atom of residue '2' in this PDB (following the PDB internal numbering scheme, of course we assume you don't know a priori its called H01):

ATOM      1  N   ALA     2      -0.677  -1.230  -0.491  1.00  0.00           N
ATOM      2  CA  ALA     2      -0.001   0.064  -0.491  1.00  0.00           C
ATOM      3  C   ALA     2       1.499  -0.110  -0.491  1.00  0.00           C
ATOM      4  O   ALA     2       2.065  -0.922   0.251  1.00  0.00           O
ATOM      5  CB  ALA     2      -0.509   0.856   0.727  1.00  0.00           C
ATOM      6  H   ALA     2      -0.131  -2.162  -0.491  1.00  0.00           H
ATOM      7  HA  ALA     2      -0.269   0.603  -1.418  1.00  0.00           H
ATOM      8 1HB  ALA     2      -1.605   1.006   0.691  1.00  0.00           H
ATOM      9 2HB  ALA     2      -0.285   0.342   1.681  1.00  0.00           H
ATOM     10 3HB  ALA     2      -0.053   1.861   0.784  1.00  0.00           H
ATOM     11  H01 ALA     2      -1.261  -1.244  -1.314  1.00  0.00           H
ATOM     12  N   ASN     3       2.311   0.711  -1.400  1.00  0.00           N
ATOM     13  CA  ASN     3       3.700   0.321  -1.173  1.00  0.00           C
ATOM     14  C   ASN     3       4.606   1.530  -1.169  1.00  0.00           C
ATOM     15  O   ASN     3       4.515   2.417  -2.017  1.00  0.00           O
ATOM     16  CB  ASN     3       4.140  -0.693  -2.267  1.00  0.00           C
ATOM     17  CG  ASN     3       3.309  -1.974  -2.400  1.00  0.00           C
ATOM     18  ND2 ASN     3       3.497  -2.939  -1.540  1.00  0.00           N
ATOM     19  OD1 ASN     3       2.462  -2.114  -3.270  1.00  0.00           O
ATOM     20  H   ASN     3       1.953   1.454  -2.096  1.00  0.00           H
ATOM     21  HA  ASN     3       3.774  -0.146  -0.174  1.00  0.00           H
ATOM     22 2HB  ASN     3       5.198  -0.981  -2.124  1.00  0.00           H
ATOM     23 3HB  ASN     3       4.123  -0.199  -3.258  1.00  0.00           H
ATOM     24 1HD2 ASN     3       2.793  -3.677  -1.623  1.00  0.00           H
ATOM     25 2HD2 ASN     3       4.145  -2.756  -0.775  1.00  0.00           H
ATOM     26  N   PRO     4       5.629   1.649  -0.120  1.00  0.00           N
ATOM     27  CA  PRO     4       6.360   2.889  -0.340  1.00  0.00           C
ATOM     28  C   PRO     4       6.031   3.505  -1.710  1.00  0.00           C
ATOM     29  O   PRO     4       5.228   2.956  -2.465  1.00  0.00           O
ATOM     30  CB  PRO     4       7.819   2.420  -0.227  1.00  0.00           C
ATOM     31  CG  PRO     4       7.769   0.962  -0.705  1.00  0.00           C
ATOM     32  CD  PRO     4       6.440   0.451  -0.143  1.00  0.00           C
ATOM     33  HA  PRO     4       6.122   3.617   0.439  1.00  0.00           H
ATOM     34 2HB  PRO     4       8.119   2.445   0.823  1.00  0.00           H
ATOM     35 3HB  PRO     4       8.509   3.035  -0.810  1.00  0.00           H
ATOM     36 2HG  PRO     4       8.619   0.375  -0.350  1.00  0.00           H
ATOM     37 3HG  PRO     4       7.737   0.935  -1.796  1.00  0.00           H
ATOM     38 2HD  PRO     4       6.564   0.075   0.875  1.00  0.00           H
ATOM     39 3HD  PRO     4       5.991  -0.316  -0.778  1.00  0.00           H
ATOM     40  H01 PRO     4       6.512   4.435  -2.015  1.00  0.00           H


As you can see, it's the Atom 11 (H01) of an Alanine. How would you do that? Choose a for-loop, choose a nicely indented block of if and else statements. Choose a number of different counters, each with their very own meaningful name. Choose endless meetings with IndexErrors, comparisons, type castings and list slicings. I say no, I choose a generator. In fact, a pair wise sliding window.

def make_sliding_window(val):
    ln_cnt = 0
    while ln_cnt < len(val):
        yield ln_cnt, ln_cnt-1
        ln_cnt += 1 
The 'yield' statement characterizes this as a generator. What does that mean? First, a generator is *not* a list, or array or something like that. It's a function. With special properties. When called through 'next(generator_function)', it executes until it reaches a 'yield' statement, from where it returns. When called again (with a new 'next(generator_function)') it resumes from where it left of. Yes, and it returns a generator object, when you call it as such (no 'next'), but then it does not execute.
So lets instantiate the object.
res=make_sliding_window(val)

'res' is now the generator. Only when you call 'next(res)', the generator executes, up until the 'yield' statement, from where it returns. In addition, it can be used directly in a for-loop. The for-loop takes care of implicitly calling the 'next' method. Here's an example:
def get_line_num_last_atom_res_2(val):
    # Form a generator 'res'.
    res = make_sliding_window(val)
    for a, b in res:
        if val[a][25] == '3':
            return val[b], a

print get_line_num_last_atom_res_2(open('randompdbfile.pdb', 'r').readlines())
As you can see, we're now using it as-if it were a list. But notice, if you think it's a list, you'll not understand what is happening. It has a 'next' method, that's the important point and it's being called, even if we dont see it explicitly. So anyway, upon reaching 'yield', we get apparently two line numbers back, where it's two times the same one, but just offset by one. Now we can easily check if the upcoming atom already is part of the next residue, and if not, return the current atom. And its line number.

Mittwoch, 27. Juni 2012

Python 004: My first closure

Depending on the presence of a number of files, a couple of functions had to be called. But depending on the kind of file, a different function was required. However, for each file, the logic to determine if the corresponding function should be called or not is the same. For technical reasons, the respective functions are in different modules (executed through some Ajax).
So it seemed to me I would have to write the logic part specifically for each case, i.e. for each case where a file is available or not.
Something as below:

if file_exists('-ini'):
    function_one()
if not file_exists('-ini'):
    generate_ini_file()
    function_one()
if file_exits('-fin'):
    function_two()
if not file_exists('-fin'):
    generate_fin_file()
    function_two()

and so on. However, I remembered, in Python everything is an object. Even functions. So, why not prepare a function, which takes care of the logic and the execution of the function which I want to apply to each of the corresponding files. But then, where should the applied function come from? Exactly, from a closure. This basically is a function which generates functions at runtime. What kind of arguments does it take? Precicely, the function you actually want to call. Huh? Yes, of course the applied function needs to be defined somewhere, but thats another story, lets assume we have it. It could look something like this:

def translate_com(target):
    """Calling VMD and returning the output directly."""
    vmd_src = 'mol load pdb %s-fix.pdb\n' % (bio_lib.pdb_base_path + target)\
            + 'set HOME ' + bio_lib.vmd_base_path + '\n'\
            + 'molinfo 0 get center\n'\
            + 'env\n'
    comp = Popen([bio_lib.vmd_cmd_path, '-dispdev', 'text', '-eofexit'],
                  stdout=PIPE, stdin=PIPE, stderr=PIPE, shell=False)
    comp.stdin.write(vmd_src)
    vmd_com_result = comp.communicate()[0]

    for line in vmd_com_result.split('\n'):
        if len(line) > 0 and line[0] == '{':
            com_xyz = [float(i) for i in line[1:-1].split()]
    com_xyz = [i*(-1) for i in com_xyz] 
    # Calling the VMD reorientation script.
    reo_src = bio_lib.get_reo_src(com_xyz, target)
    vmdp = Popen([bio_lib.vmd_cmd_path, '-dispdev', 'text', '-eofexit'],
                  stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=False)
    vmdp.stdin.write(reo_src)
    print vmdp.communicate()[0]

It does something, other functions are doing similar things. This is, e.g., the 'function_two' above. So for each of these, there would be a logic block as this:

def check_availability_and_translate(target):
    # Use upload.
    if uploaded:
        # Force overwrite.
        if overwrite:
            translate_com(target)
        # Does not exist.
        if not os.path.exists(bio_lib.pdb_base_path + target + '-reo.pdb'):
            translate_com(target)
    # Download.
    else:
        # Use existing.
        if os.path.exists(bio_lib.pdb_base_path + target + '-reo.pdb'):
            print "Structure available." 
        # Fixed structure not available.
        else:
            reo.translate_com(target)

In the above, there's also a bit of user interaction coming in, but that's not the point. The point is, all this is the same for each of my functions. So, lets generate a function that returns this logic, but takes the 'translate_com' function as an argument and calls it whenever *a* function is called:

def closure(state):
    """
        'state': '-ini', '-fin', '-reo'
    """
    def check_availability_and_apply(target, uploaded, overwrite, func):
        # Use upload.
        if uploaded:
            print "Uploaded"
            # Force overwrite.
            if overwrite:
                print "Overwriting"
                func(target)
            # Does not exist.
            if not os.path.exists(pdb_base_path + target + state):
                print "Not found, fixing."
                func(target)
            # Use existing.
        # Download.
        else:
            # Use existing.
            if os.path.exists(pdb_base_path + target + state):
                print "Structure available."
            # Structure not available.
            else:
                func(target)
    return check_availability_and_apply

The fun is in the last line. As one can see, a function is returned which was built within the closure. So how can I now call (or maybe better "apply") the 'translate_com' function?

closure('-ini.pdb')(target, uploaded, overwrite, translate_com)


And how about another function?

closure('-fix.pdb')(target, uploaded, overwrite, fix_pdb)


This is really cool, because now file checks are in one place and application is in a separate place. So only one place to edit.
I would really like to hear my boss asking me now: "So how many lines of code did you write yesterday?" Then I could answer: "Well, actually like -200."