Saturday, March 10, 2007

Fast 3D Similarity Search Using Descriptors

"Ultrafast Shape Recognition for Similarity Search in Molecular Databases" (Ballester, P.J and Richards, W.G. DOI:10.1002/jcc.20681. Also published here)

3D similarity searching is a well studied topic with a variety of methods available (1, 2, 3). Fundamentally there are two approaches to evaluating 3D similarity, viz., superposition methods (Rush et al.) which requires an alignment procedure and descriptor based methods which aim to characterize shape in terms of a small set of numerical values (essentially a dimension reduction procedure) and then evaluate similarity between molecules by using these descriptors.

Recently there was a splash on Slashdot as well as New Scientist regarding the work described in this paper which belongs to the second class of methods. I was a little surprised since there has been much work in this area of 3D similarity, probably the most similar being Zauhars Shape Signatures.

Ballester et al propose using a set of descriptors derived from the distribution of distances between atomic coordinates to 4 specific points: the centroid, the atom closest to the centroid, the atom furthest from the centroid and the atom furthest from the preceding atom. The distance distributions are then summarized using the first 3 moments (mean, variance and skewness). So the shape of each molecule is encoded by a 12-element vector. They then define a similarity measure based on the normalized Manhattan distance.

The method is clearly quite fast (the authors report a 1500x speedup compared to MOE's EShape3D and 2000x speedup compared to Shape signatures and 14000x speedup compared to ROCS) as well as space efficient. However, more importantly, is it accurate? This is important since the method is low dimensional representation of the molecular shape and thus some information is lost. To what extent does loss affect the methods utility in similarity search?

The authors use a visual comparison approach to justifying their claim that the method does indeed identify similar shapes (from a database of 2.4M compounds) - essentially looking at the molecules that had the highest similarity for 5 query molecules. The figures do indicate a high degree of similarity - but it's not really quantitative. However they also compare the most similar hits retrieved by their method as well as by MOE's EShape3D descriptor method. They report that in most cases, the methods got the same hits, though in some the hits returned by MOE were visually less similar. The authors also considered the issue of retrieving hits from a collection of conformations - and they indicate that their method retrieves conformers that are more similar to the query than does EShape3D.

One aspect that would be interesting to see is whether using a larger number of moments would increase the efficacy of the method.

Overall an interesting method, especially due to its simplicity, though the validation could be a little more rigorous.

Sidenote: An implementation of this method that allows one to generate the 12-element vectors from SD files as well as query an SD file to find similar hits is available here as a standalone program. The main similarity code has been incorporated into the CDK.


Comments: Post a Comment

Subscribe to Post Comments [Atom]

<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]