Sunday, December 06, 2009
Molecular Fingerprint Recombination: Generating Hybrid Fingerprints for Similarity Searching from Different Fingerprint Types
Author: B. Nisius, J. Bajorath
Source: ChemMedChem, 2009, 4, 1859 – 1863
Source: ChemMedChem, 2009, 4, 1859 – 1863
Even though fingerprints are very popular in similarity searching and a variety of different designs were introduced over the years, the combination of different fingerprints into “hybrid fingerprints” is an unexplored design strategy. As bit subsets of fingerprints are often responsible for the search performance of fingerprints, B. Nisius explored the potential to identify of these preferred bit subsets in fingerprints of different designs and recombined these bit segments into “hybrid fingerprints” representations. In this way, structural fragments/patterns (MACCS; 166 bit) as well as pharmacophoric features (2D topological, TGD; 2-point, 420 bits representing atom pairs & TGT; 3-point, 1704 bits, representing atom triangles) were combined to increase the search performance compared to the original fingerprints.
The selection of the individual fingerprint bit positions was performed by ranking the positions based on the Kullback-Leiber divergence analysis. This analysis yields a measure of difference in the bit distributions of active and inactive/database compounds; selecting bit positions that discriminate between active compounds and background noise. The test set consisted of 27 different compound classes (30 – 160 actives each) and 3.7 Mio. compounds from the ZINC collection as background database. The 100 top-ranking bit positions for each fingerprint were selected for all activity classes (300 bits) and then compared with the performance of the parent fingerprints as well as the complete combination 2290 bits), using k-nearest neighbor analysis. These activity class-directed fingerprints produced in almost all cases higher recall rates than the parental representations and the recovery rate increases were often significant. Interestingly, the recall of the control fingerprint, the combination of the parental FP’s was overall much smaller, that the ones achieved for the much smaller hybrid FP’s.
Because the different bit positions or features from the separate FP-designs are independently selected based on discriminatory power, instead by information theoretic approaches, the capacity to categorize and the search performance of hybrid FP’s is in most cases higher than the performance of their parents. The combination of the different fingerprint designs, merging representations of substructure and pharmacophoric patterns leads to an emphasis of compound class-specific molecular features or gain in chemical information resulting in an overall improved performance.
This publication shows an interesting new approach to the design of new fingerprints that could find significant application possibilities in virtual screening projects, where the discriminatory power of one fingerprint class is not sufficient to classify and enrich active compounds in an activity-class database selection.
The selection of the individual fingerprint bit positions was performed by ranking the positions based on the Kullback-Leiber divergence analysis. This analysis yields a measure of difference in the bit distributions of active and inactive/database compounds; selecting bit positions that discriminate between active compounds and background noise. The test set consisted of 27 different compound classes (30 – 160 actives each) and 3.7 Mio. compounds from the ZINC collection as background database. The 100 top-ranking bit positions for each fingerprint were selected for all activity classes (300 bits) and then compared with the performance of the parent fingerprints as well as the complete combination 2290 bits), using k-nearest neighbor analysis. These activity class-directed fingerprints produced in almost all cases higher recall rates than the parental representations and the recovery rate increases were often significant. Interestingly, the recall of the control fingerprint, the combination of the parental FP’s was overall much smaller, that the ones achieved for the much smaller hybrid FP’s.
Because the different bit positions or features from the separate FP-designs are independently selected based on discriminatory power, instead by information theoretic approaches, the capacity to categorize and the search performance of hybrid FP’s is in most cases higher than the performance of their parents. The combination of the different fingerprint designs, merging representations of substructure and pharmacophoric patterns leads to an emphasis of compound class-specific molecular features or gain in chemical information resulting in an overall improved performance.
This publication shows an interesting new approach to the design of new fingerprints that could find significant application possibilities in virtual screening projects, where the discriminatory power of one fingerprint class is not sufficient to classify and enrich active compounds in an activity-class database selection.
Subscribe to Posts [Atom]