Wednesday, December 15, 2010

Pocket Similarity: Are α Carbons Enough?

By Howard J Feldman and Paul Labute
Source: J. Chem. Inf. Model., 2010, 50(8),1466–1475.
DOI: 10.1021/ci100210c
Publication Date (Web): August 6, 2010

The authors were from the Chemical Computing Group, Inc. in Canada. In this paper, they talked about a novel method to measure protein pocket similarity, which used only the α carbon positions of the pocket residues.

Similarity between protein structures can be used to predict what ligands a protein under study is likely to bind with. Many times proteins with similar sequence and structures share similar functions. However, proteins with seemingly different structure can still have highly conserved binding sites and function.In these cases, methods to identify and predict similar protein binding pockets are required.

Known methods of measuring similarity included graph theory, geometric, physicochemical properties of the site atoms,and so on.In this paper, the work was focused on measurement of pocket similarity by Cα positions and residue identities.

First a pocket database needed to be bulit for search. Structures from the Protein Data Bank was used. A pocket was defined as the set of residues within 4.5 Å of a ligand heavy atom. A ligand here meant any nonprotein, non-nucleic acid molecule containing at least one torsion angle. The pocket database contained a total of 133,800 ligand pockets. The coordinates of the Cα atoms and identities of the residues of the pocket were also included in the database.

In similarity search, an exhaustive three-dimensional (3D) Cα common subset search was performed. Only atom pairs that were within 1 Å of each other and share the same residue class were defined as a ‘match’. A minimum of five matches were required for a pocket to be retrived as a ‘hit’ in the search query.

Except the number of matches, a more sensitive scoring function was needed, especially for weak hits with such as five or six matches.In this paper, an extreme value or Gumbel distribution (EVD) was used to derive a statistics-based scoring function.An EVD is defined by only two parameters: μ and β, wher μ is the mode of the distribution, and β is proportional to the standard deviation. To simplifize, for each pocket in query, the μ and β for its EVD were computed first. Next, the pocket was superposed pairwise against every pocket in the database, trying to maximize the number of matched residues. If at least five residue matches were retrieved, then the score S was computed.A pocket is identified as a hit only when its S score exceeds the cutoff of 26.

Several proteins, including tyrosyl−tRNA synthetase, chorismate mutase and protein kinase, were used to test the performance of the scoring function. The results indicated that the pocket search algorithm and the scoring function could identify similar pockets with high accuracy.

The method developed was further tested to cluster protein families.In clusering, the scores S was converted into a distance, D, which then was clustered with the hierarchical agglomerative clustering method. Results indicated that the proteins clustered clearly into their respective families, with two exceptions.

In conclusion,in this paper a simple energy-like scoring function for protein pocket superposition was described, which was based on the extreme value distribution and an exhaustive search of all possible Cα superpositions. This model could accurately identify similar protein pockets from a large database. The method was also able to produce standard protein clustering, at both the family and subfamily levels. The search was fast and can be run on a single CPU over the whole PDB within a few minutes.

Thursday, December 09, 2010

Rational Approaches for the Design of Effective Human Immunodeficiency Virus Type 1 Nonnucleoside Reverse Transcriptase Inhibitors

Authors: Sergio R. Ribone, Mario A. Quevedo, Marcela Madrid, and Margarita C. Brin˜o´n

Journal: J. Chem. Inf. Model., Article ASAP

DOI: 10.1021/ci1001636

Publication Date (Web): December 6, 2010

Mutation in viral strains is becoming a major global health problem. A fast mutation in viruses makes it difficult to find the correct drugs for the new structure. It is vital to know what kinds of interactions are important in the complex of target-protein and drug-inhibitor compound. This understanding may help finding better drug candidates, or in case the target protein mutates knowing what physico-chemical properties to be sought for a new drug.

human immunodeficiency virus type 1 (HIV-1) is one of the widely studied virus. HIV-1 reverse transcriptases are the enzymes responsible for the replication of the virus and hence have been considered one of the main therapeutic targets for anti-HIV drugs used in the treatment of AIDS. Varieties of inhibitors are known to-date that bind to these enzymes. In the present paper authors S Ribone et al study the binding of several classes of nonnucleoside reverse transcriptase inhibitors (NNRTIs) to two types of HIV-1RTases: wild-type (wtRT) and a certain (K103N) mutant (mRT). For this exhaustive study, they use methods of molecular dynamics and energy decomposition. Then through a careful comparison between properties of these NNRTIs such as: hydrogen bonding, quantitative free energy analyses, molecular interactions in the binding pockets they studied each drug’s potency. The molecular basis of the interaction between NNRTIs and RT presented here provides a novel quantitative approach for the design of novel effective anti-HIV drugs and may be used as a general approach for other drug-discoveries.

The authors also mention that one of the mRT-compund complexes they studied was in fact later crystallographically resolved and deposited in the RCSB protein data bank (PDB). This leads to a situation where computational studies help finding new drug. Authors study the nonnucleoside reverse transcriptase inhibitors (NNRTIs) that elicit RT inhibition by binding to a pocket identified as the nonnucleoside inhibitors binding pocket (NNIBP). The NNRTIs studied here range from the first-generation inhibitors that are voluminous and rigid structures (e.g. nevirapine) to the imidoylthiourea (ITU) and subsequently to the more potent diaryltriazine (DATA) and diarylpyrimidine (DAPY) analogues. DATA and DAPY inhibitors are similar in volume but possess a central ring joining two flexible lateral rings. Crystallographic studies show that nevirapine binds in a butterfly-like conformation while ITU, most DATA, and DAPY inhibitors bind in a horseshoe conformation. The DAPYs have increased flexibility inside the binding site and are very potent inhibitors against both wild-type and mutated RT. The internal surface of NNIBP is predominantly hydrophobic. NNRTIs are known to establish extensive hydrophobic interactions with NNIBP residues, including van der Waals and π-π stacking interactions with aromatic amino acid side chains, and form hydrogen bonds with the backbones of hydrophilic residues.

The available experimental data before this paper suggested that mutations cause drug resistance by different mechanisms, among them impairment of inhibitor accessibility to the NNIBP, modification of critical intermolecular interactions, or introduction of steric hindrance. In the present work, authors present a detailed analysis of the structure, molecular interactions in the binding pocket, energetics, and dynamics of drugs ranging from first- to third-generation compounds, bound to wtRT and K103N mutated RT (mRT). The molecular dynamics simulations were followed by free energy decomposition analyses, targeting towards a quantitative structure-activity model correlating the binding energetics to the reported anti-HIV activities (EC50).

During the 3 ns molecular dynamics (MD) simulation, authors studied the hydrogen bonding either direct or solvent-water mediated between the compound and the residues of the binding pocket. They also studied the energy decomposition during the MD trajectories and found that the antiviral potencies of compounds against the enzymes were closely related to their van der Waals energetic components. The decrease in the potency/activity correlates with the increase in the van der Waals energy which is also in agreement in the experimental findings. The authors also found that the K103N mutation in the binding pocket results in an increase in the electrostatic and van der Waals energies, in agreement with the marked loss in inhibitory activity against the mutated RT strain. They establish a strong correlation between the van der Waals energetic component and the reported antiviral activity against wtR. The correlation suggests that higher potencies are expected for those compounds that establish hydrophobic contacts with the residues in the binding pocket and is in agreement with the high hydrophobicity of the NNIBP of wtRT. Finally, the authors studied in detail the torsion angle distributions and showed to enhance antiviral activity torsions at the central ring of the compounds have to maximize the π-π stacking interactions with the residues in the binding pocket.

In conclusion, the authors have studied in large details, the molecular interactions and the origins of the potency of the anti-HIV drugs. They have applied molecular docking and molecular dynamics techniques to study the binding interactions, flexibility, and free energies of binding of several NNRTIs complexed to wtRT and K103N mRT, and correlate these properties to their reported biological activities. Authors found that all ITU, DATA, and DAPY compounds studied show sustained hydrogen bonds with Lys101 residue in the binding pocket and in some cases also with Glu138 residues (However, the existence of this additional hydrogen bond does not correlate with a higher biological activity against wtRT). These two residues can therefore be regarded as important for the future drug-compound discoveries. They establish a linear relationship between the biological activity of all complexes and the van der Waals energetic component when bound to wtRT. This in their view can be regarded as a predictive tool for the design of effective inhibitors against wtRT. The K103N mutation modifies the electrostatic properties of the binding pocket. The authors show that RT inhibitors with activity against both wild-type and K103N mutated HIV strains show (a) hydrophobic interactions with the NNIBP, and (b) specific intermolecular interactions with a hydrophilic region in the lower part of the binding pocket. According to the correlation observed in this work, a potent inhibitor of wtRT must maximize its van der Waals interactions in the binding pocket.

The study presented here provides a valuable methodology for the rational design of effective inhibitors with better therapeutic profiles for the treatment of AIDS and in general for other pandemic diseases.


Wednesday, December 08, 2010

Learning from the Data: Mining of Large High-Throughput Screening Databases


Authors: Frank Yan, Frederick J. King, Yun He, Jeremy S. Caldwell, and Yingyao Zhou.
Source: J. Chem. Inf. Model., 2006, 46 (6), pp 2381-2395

This paper was written by a group from Novartis based on mining of their internal knowledge base which included over 200 ultrahigh throughput screening campaigns, 74 of which were complete enough for the detailed mining. They report that over 1 million compounds were screened in each of their HTS campaigns.

The authors applied a custom algorithm called “Ontology-based Pattern Identification (OPI)”. Their OPI algorithm identified approximately 1,500 scaffold families with significant structure-HTS activity profile relationships. Using their OPI algorithm they identified four types of compound scaffolds from their database, tumor cytotoxic, general toxic, potential reporter gene assay artifact and target family specific. Their OPI algorithm was able to identify compounds that were alike structurally and also shared statistically significant biological activity profiles. Classification of their compounds resulted in them annotating their library on a scaffold basis which may be a useful resource for identifying further study of the data.

Their OPI data mining algorithm was based on research performed on gene function prediction. They adapted the algorithm to find large scale correlations between chemical scaffolds and biological profiles from the HTS data set. They were able to identify structurally similar compounds and demonstrated strong neighborhood behavior. They describe their approach as replacing HTS activities against individual targets by activity profiles against a battery of assays. They applied multigroup statistical validation tests from the realm of bioinformatics.

They summarized their OPI algorithm to identify core members of a compound family as:

1. For each compound family C

2. Construct a representative biological profile Qc

3. Score compound I based on the similarity Si = Sim(Qc,Qi)

4. Rank all compounds based on the score Si in descending order

5. For each possible similarity cutoff S

6. Calculate probability P = P(S)

7. S* that leads to a minimum P s chosen, i.e. S* = argsmin P(S)

8. Family members with Si > S* and I € C are identified as the core

One of their goals was to allow their researchers to “fail early, fail cheap” by being able to mine their data to filter undesirable scaffolds during the lead identification phase.

The authors cite several specific examples of their results including:

· Two anticancer chemotypes were correctly annotated as tumor cytotoxic, falling into a cluster of six compounds including other known anticancer compounds and diastereoisomers. This 6 member group also had a common anthracycline scaffold.

· A tellurium containing scaffold showed general toxicity in proliferation and reporter gene assays, other members of this compounds group contained tellurium.

· A spironolactone-like scaffold demonstrated target family specific activities against a nuclear receptor family.


Tuesday, December 07, 2010

Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of Commercially Available Organic Compounds Provided by 29 Suppliers

Authors: A. Chuprina, O. Lukin, R. Demoiseaux, A. Buzko, and A. Shivanyuk
Source: J. Chem. Inf. Model, 2010, 50, pp. 470 – 479.

The purpose of this paper is to determine how many molecules have drug- or lead-likeness in the chemical suppliers' repertoire of compounds. The dataset starts from 7.9 million commercially available compounds whereas 5.2 million compounds were considered distinct. MySQL is the database used to store the structures, calculations parameters, and suppliers' information. The compounds were represented by SMILES strings that were converted using the JChem program. The physicochemical properties calculation was done using LigPrep and QikProp whereas biological activities used the PASS software. PASS software is a predictive algorithm based on a structural similarity search against approximately 60,000 known biologically active compounds. JChem chemical fingerprint software was utilized to perform the cluster analysis using "sphere exclusion" method. The 5.2 million compounds underwent specific filtering criteria depending on whether the inquiry was drug-like or lead-like feasibility.


The drug-like filters were based on Lipinski and Veber rules. These include logP, logS, membrane permeability, molecular weight, number of hydrogen-bond donors and acceptors, lipophilicity, and available polar surface area and rotatable bonds. Also, reactive or toxic functional groups were filtered as well. The suppliers' collection of compounds was delineated based on molecular weight, log S, and Clog P of the filtered compounds. In other words, comparison of each suppliers own assemblage showed that there was a hierarchy based on the percentage of molecules with passable scores. Overall, the percentage of compounds with drug-like characteristics has increased since 2004 which was the last time a similar search was reported.

The lead-like filters were chosen based on properties from the Hann and Oprea paper. The criteria are 200 < MW < 460, -4 < ClogP < 4.2, Hacc ≤ 9, Hdon ≤ 5, rotating bonds ≤ 10, PSA ≤ 170, CACO-2 ≥ 100, and -5 < log S < 0.5. Following the criteria for drug-likeness, the selection of any compounds containing toxic or reactive functional groups was excluded as well. The cutoff for lead-like filters did not leave room for any variation when compared to the drug-like filters. Chemical leads do not necessarily have significant biological activity and will undergo improvement of physical properties through SAR (structural activity relationship) in the lab. This strict screening of chemical attributes lead to a smaller number of compounds.

In conclusion, the 10 largest suppliers possess about 90% of the lead-like and drug-like compounds available. With respect to biological activities, the suppliers' molecules have a propensity to be active ones. It was evident that the suppliers also focused on providing more structural diversity in their stock. In addition, suppliers also directed their efforts to producing more drug-like compounds rather than lead-like compounds. The accessibility of having a wide and variable choice of readily available compounds to test will hopefully aid the researcher to be more efficient when finding appropriate leads in new research projects.

Wednesday, December 09, 2009

Mining large heterogenous datasets in drug discovery. Expert Opinion on Drug Discovery. 2009

Wild, D.J. Mining large heterogenous datasets in drug discovery. Expert Opinion on Drug Discovery. 2009; 4(10), pp 995-1004

“Date mining is defined as the process of identifying valid, novel, potentially useful, and ultimately understandable patterns from large collections of data. The applications of data mining in drug discovery include pattern discovery in databases.” (Quoted from this paper).

This paper is a detailed review about data mining applied in drug discovery. It provides a review of the publicly available large-scale databases relevant to drug discovery, described the data mining approaches that can be applied to them and discusses the recent work in integrative data mining that looks association span multiple resources, including the Semantic Web techniques. The author claims that data mining of large heterogeneous data sets require the intelligent web technologies and semantics.

After reading this paper, I summarize information in term of three aspects, i.e. publicly large-scale information sources, data mining tools, as well as web-based technologies.

1)Publicly large-scale information sources:

Chemical information (PubChem,ChemSpider that is a chemisty search engine); protein information (sequence databases,i.e.UniProt, 3D protein structure databases—Protein Data Bank), genomic and nucleotide information, disease level information(less work done in this area, but it is significant which bridge the gap between gene and patient, called ‘disease informatics’; resources: disease-genes database, DrugBank, Genotype-phenotype databases); scholarly publication (difficulties: not many open access journals, PDF, not machine-readable; resources: PubMed,PubMed Central)

2) Tools for data mining: (a) Knowledge Discovery in Databases (KDD). Similar with Data Mining, it is usually defined as the process of identifying valid, novel, potentially useful and ultimately understandable patterns from large collections of data. The most common model of KDD has 7-step process, i.e. data cleaning, data integration, data selection, data transformation, data mining, patter evaluation and knowledge presentation. Knowledge discovery goals include descriptive and predicative goals. (b)Searching for a query or items similar to a query: it can be classified into structure searching, substructure searching, and similarity searching. BLAST is a common used tool for calculating sequence similarity. (c)Unsupervised learning (detecting data patterns without training data). The most common unsupervised learning method is clustering (K-means, hierarchical clustering). Unsupervised clustering has been widely applied to drug discovery databases, including organization of chemical structures into series for analysis, visualization or predication; organization of microarray data ; and analysis and display of genome-wide expression patterns inter alia. (d) Supervised learning (use data with known properties to train models or classifiers that are able to make predications for data with unknown properties). Supervised learning methods include Bayesian Inferences, decision tree, etc. (e) Association rule mining. Compared with other machine learning methods, ARM is less used in drug discovery, but it is becoming increasingly important. ARM is to discover statistical relationships between data elements to be discerned.

3)Web-based technologies are important in the integrated data mining systems, including semantic and ontologic languages (XML,OWL,RSS,RDF), web services and intelligent agents and inference tools. XML is a markup language intended to convey metadata (i.e. information about data) and is thus useful for describing different kinds of data and database. XML is particularly useful when used in conjunction with languages for describing the valid entities in a particular domain (ontologies) and rules about how these relationships relate to each other. These languages include OWL and RDF. RSS is a by-product of XML that can discover new information that may be of interest to users. This paper also talked the advantages of intelligent agents that offer the possibility of automated mediation of the large amounts of information that come from multiple databases. Intelligent agents exhibit four properties: autonomy, social ability (communicate with other agents or humans by an agent), reactivity (react to changes in their environment), and pro-activeness (they can take initiative in acting, not necessarily just as a response to external stimulus). Furthermore, the importance of semantic web in conjunction with chemical informatics and drug discovery are talked in this paper. Finally, this author proposed to consider a new filed of drug discovery informatics in order to maximize the use of electronic information and computation for the discovery of the next generation of therapies and medicines.

Labels:


Linguistic feature analysis for protein interaction extraction

Authors: Timur Fayruzov, Martine De Cock, Chris Cornelis and Veronique Hoste BMC Bioinformatics, 2009


Most text mining approaches rely implicitly or explicitly on linguistic data extracted from text, but only few attempts have been made to evaluate the contribution of the different feature types. In this article, the authors contribute to this evaluation by studying the relative importance of deep syntactic feature, shallow syntactic features, and lexical features.


The authors use a dependency tree that represents the syntactic structure of a sentence. The nodes of the tree are the words of the sentence and the edges represent the dependencies between words. The most relevant part of the dependency tree to collect information about the relation between the two proteins is the subtree corresponding to the shortest path between these proteins. The authors also mention same related works, such as kernel that naturally emerges from the subsequence kernel and obtain good results on the AIMed corpus, the authors focus on sentence structure and use dependency trees to extract the local contexts of the protein names, they proposed to abstract from lexical features and use only syntactic information to obtain a more general classifier that would be suitable for different data sets without retraining, apply a structured kernel to the protein-protein interaction domain, and propose to use the whole dependency tree to build a classifier. The authors performed their experiment on five benchmark data sets, being AIMed, BioInfer, IEPA, HPRD50 and LLL. In this work, the authors use two evaluation metrics, namely recall-precision and receiver operating characteristic curves, to provide a more comprehensive analysis of the achieved results.


There are two important observations from the results, one of them is that by using only grammatical relations (syntactic kernel) we can obtain a similar performance as with an extended feature set(lexical kernel), another one is that when the training set is much smaller than the test set, then the syntactic kernel performs better. The authors conclude that the syntactic kernel provides the best results, whereas the lexical kernel provides the worst results.

Sunday, December 06, 2009

Molecular Fingerprint Recombination: Generating Hybrid Fingerprints for Similarity Searching from Different Fingerprint Types

Author: B. Nisius, J. Bajorath
Source: ChemMedChem, 2009, 4, 1859 – 1863

Even though fingerprints are very popular in similarity searching and a variety of different designs were introduced over the years, the combination of different fingerprints into “hybrid fingerprints” is an unexplored design strategy. As bit subsets of fingerprints are often responsible for the search performance of fingerprints, B. Nisius explored the potential to identify of these preferred bit subsets in fingerprints of different designs and recombined these bit segments into “hybrid fingerprints” representations. In this way, structural fragments/patterns (MACCS; 166 bit) as well as pharmacophoric features (2D topological, TGD; 2-point, 420 bits representing atom pairs & TGT; 3-point, 1704 bits, representing atom triangles) were combined to increase the search performance compared to the original fingerprints.
The selection of the individual fingerprint bit positions was performed by ranking the positions based on the Kullback-Leiber divergence analysis. This analysis yields a measure of difference in the bit distributions of active and inactive/database compounds; selecting bit positions that discriminate between active compounds and background noise. The test set consisted of 27 different compound classes (30 – 160 actives each) and 3.7 Mio. compounds from the ZINC collection as background database. The 100 top-ranking bit positions for each fingerprint were selected for all activity classes (300 bits) and then compared with the performance of the parent fingerprints as well as the complete combination 2290 bits), using k-nearest neighbor analysis. These activity class-directed fingerprints produced in almost all cases higher recall rates than the parental representations and the recovery rate increases were often significant. Interestingly, the recall of the control fingerprint, the combination of the parental FP’s was overall much smaller, that the ones achieved for the much smaller hybrid FP’s.
Because the different bit positions or features from the separate FP-designs are independently selected based on discriminatory power, instead by information theoretic approaches, the capacity to categorize and the search performance of hybrid FP’s is in most cases higher than the performance of their parents. The combination of the different fingerprint designs, merging representations of substructure and pharmacophoric patterns leads to an emphasis of compound class-specific molecular features or gain in chemical information resulting in an overall improved performance.

This publication shows an interesting new approach to the design of new fingerprints that could find significant application possibilities in virtual screening projects, where the discriminatory power of one fingerprint class is not sufficient to classify and enrich active compounds in an activity-class database selection.

Saturday, December 05, 2009

QSAR and Drug Design

By David R. Bevan

This journal article is about the disciplines like drug design and environmental risk assessment in which QSAR is currently being applied. QSAR attempts to correlate the activities with the structural descriptors / physicochemical properties like hydrophobicity, electronic and steric effects, topology etc. which are determined empirically / by computational methods. However, activities used in here include chemical measurements and biological assays.

Louis Hammett contributed to the development of QSAR by correlating electronic properties of organic acids and bases with their equilibrium constants and reactivity. Hammett Equation encountered difficulties when investigators attempted to apply Hammett-type relationships to biological systems, indicating that other structural descriptors were necessary. The author, herein this paper, gives some examples of reactions to describe the equation and graph for a linear free energy relationship. Later, Hansch recognized the importance of the lipophilicity, expressed as the octanol-water partition coefficient, on biological activity. Author also gives the correlation between Hammett's electronic parameters and Hansch's measure of lipophilicity using this equation. QSAR are now developed using a variety of parameters as descriptors of the structural properties of molecules. Hammett sigma values are often used for electronic parameters, but quantum mechanically derived electronic parameters also may be used. Other descriptors to account for the shape, size, lipophilicity, polarizability, and other structural properties also have been devised.

Researchers’ attempts to develop drugs based on QSAR primarily consisted of statistical correlations of structural descriptors with biological activities. However with easy and high-speed access to computational resources it evolved into rational drug design / CADD that attempts to find a ligand interacting favorably with target site on a receptor which may include hydrophobic, electrostatic and hydrogen-bonding interactions, solvation energies, etc. But the optimized fit of ligand in a target site does neither guarantee that the desired activity of the drug will be enhanced or that undesired side effects will be diminished nor consider the pharmacokinetics of the drug. There are two main approaches of CADD. First, the ligand-based approach is applicable when the structure of the receptor site is unknown and structurally similar compounds with high activity, with no activity, and with a range of intermediate activities have been identified. This requires conformational analysis depending on flexibility of the compounds under investigation with a strategy to find the lowest energy conformers of the most rigid compounds and superimpose them to generate the pharmacophore. This template may then be used to develop new compounds with functional groups in the desired positions with an assumption that the minimum energy conformers will bind most favorably in the receptor site. Second, the receptor-based approach to CADD applies when a reliable model of the receptor site is available, as from X-ray diffraction, NMR, or homology modeling. But the problem lies with designing the ligands that favorably interact at the site. Once potential drugs have been identified other molecular modeling techniques may then be applied e.g. geometry optimization may be used to stabilize the structures and to identify low energy orientations of drugs in receptor sites. Molecular dynamics may assist in exploring the energy landscape, and free energy simulations can be used to compute the relative binding free energies of a series of putative drugs.

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]