Tuesday, November 27, 2007

Re: the EMBOSS bioinformatics toolset, and the IEP utility in particular

The iep.c codebase is supposed to predict Isoelectric Point for an amino acid sequence. And it does this, but it also manages to serve as an object lesson in bad C code:

  • AjPFile, AjPStr, and ajint types: your very own custom integer type?? Really?
  • Put everything in global variables! Thus, you wind up w/ void functions that take 0 parameters and do something incredibly important; the challenge is to track down what that important thing might be! (Every programmer should be required to write code in a purely functional language for at least 3 months. By law.) Take void embIepPkRead() , for example.
  • Functions that simply wrap other functions. Code as treasure hunt!

Biologists shouldn't code.

If anybody is interested in the basic algorithm used by EMBOSS IEP to calculate Pi, it basically calculates the charge of the protein iteratively across potential pH's until it hits 0 charge, or the upper and lower bounds converge. The primary function can be found in embIepPhConverge. ajint *c holds a histogram of amino acid counts, double *K is an array of dissociation constants for each amino acid, and double *pro is the moneymaker -- it gets updated w/ a proton count for each amino acid as different pH's are tried. The loop starts w/ max 14.0 and min 1.0, sets pro for the midpoint pH, then sets either the upper or lower bound to the midpoint, depending upon whether the calculated charge is positive or negative.

No comments: