Using Machine Learning to Analyze Biological Macromolecular Crystallization Data
Abstract:
The crystallization of a new macromolecule is still very much a trial and error process. In an effort to uncover useful trends in the crystallization of new macromolecules, Samudzi, Fivash and Rosenberg[12] performed a cluster analysis on the Biological Macromolecule Crystallization Database(BMCD)[7]. The crystallization parameters that were studied in order to differentiate among the experiments were a subset of the BMCD parameters: pH, temperature, molecular weight, macromolecular concentration, precipitant type and crystallization method. Samudzi et al. performed a purely statistical analysis of the data, and reported the clusters by eye-balling the results. We have attempted to recreate their clusters using two different methods- SAS clustering (same as Samudzi's) and COBWEB (a machine learning and discovery program). We then applied RL, an inductive learning program, to the discovered clusters from each of the methods, and verified as well as expanded on the Samudzi results. Apart from using clusters as the data input to RL, we also used RL on the entire BMCD data in an attempt to learn interesting correlations among the various crystallization parameters. From the point of view of crystallography, we have discovered possibly significant new empirical relationships. From a machine learning perspective, our work has led to the refinement of existing methods for incorporating detailed domain knowledge into inductive analysis techniques. In this paper we report these initial experiments and findings from applying RL to the BMCD as well as the Samudzi and COBWEB clusters.

