<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-37358649</id><updated>2011-09-14T08:00:00.180-07:00</updated><category term='pharmacological network'/><category term='QSAR model validation applicability domain cross-validation'/><category term='I571 - Assignment #3'/><category term='I-571 coffee break'/><category term='Labels: I571 (Nov-2008) - Assignment #3'/><category term='Userscript'/><category term='–( 708 | SEPTEMBER 2009 | VOLUME 8 )'/><category term='chirality descriptor QSAR topological'/><category term='metabolism fingerprint prediction reaction xenobiotics'/><category term='ligand similarity'/><category term='Review'/><category term='Drug Discovery Today (2008)'/><category term='I571 - Homework 3'/><category term='cheminformatics'/><category term='Docking'/><category term='oi:10.1016/j.drudis.2009.09.001'/><category term='MQL SMARTS substructure query language'/><category term='QSAR applicability domain toxicology predictive model'/><category term='3D search similarity descriptors'/><category term='Drug Design'/><category term='scaffold hopping review'/><title type='text'>IU Chemical Informatics Journal Club</title><subtitle type='html'>For article summaries and other discussions for the Indiana University Chemical Informatics Journal Club in Bloomington, IN</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>David</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>80</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-37358649.post-3352204013420103082</id><published>2010-12-15T08:22:00.001-08:00</published><updated>2010-12-15T10:53:08.049-08:00</updated><title type='text'>Pocket Similarity: Are α Carbons Enough?</title><content type='html'>By Howard J Feldman and Paul Labute&lt;br /&gt;Source: J. Chem. Inf. Model., 2010, 50(8),1466–1475.&lt;br /&gt;DOI: 10.1021/ci100210c&lt;br /&gt;Publication Date (Web): August 6, 2010&lt;br /&gt;&lt;br /&gt;The authors were from the Chemical Computing Group, Inc. in Canada. In this paper, they talked about a novel method to measure protein pocket similarity, which used only the α carbon positions of the pocket residues.&lt;br /&gt;&lt;br /&gt;Similarity between protein structures can be used to predict what ligands a protein under study is likely to bind with. Many times proteins with similar sequence and structures share similar functions. However, proteins with seemingly different structure can still have highly conserved binding sites and function.In these cases, methods to identify and predict similar protein binding pockets are required.&lt;br /&gt;&lt;br /&gt;Known methods of measuring similarity included graph theory, geometric, physicochemical properties of the site atoms,and so on.In this paper, the work was focused on measurement of pocket similarity by Cα positions and residue identities.&lt;br /&gt;&lt;br /&gt;First a pocket database needed to be bulit for search. Structures from the Protein Data Bank was used. A pocket was defined as the set of residues within 4.5 Å of a ligand heavy atom. A ligand here meant any nonprotein, non-nucleic acid molecule containing at least one torsion angle. The pocket database contained a total of 133,800 ligand pockets. The coordinates of the Cα atoms and identities of the residues of the pocket were also included in the database.&lt;br /&gt;&lt;br /&gt;In similarity search, an exhaustive three-dimensional (3D) Cα common subset search was performed. Only atom pairs that were within 1 Å of each other and share the same residue class were defined as a ‘match’. A minimum of five matches were required for a pocket to be retrived as a ‘hit’ in the search query.&lt;br /&gt;&lt;br /&gt;Except the number of matches, a more sensitive scoring function was needed, especially for weak hits with such as five or six matches.In this paper, an extreme value or Gumbel distribution (EVD) was used to derive a statistics-based scoring function.An EVD is defined by only two parameters: μ and β, wher μ is the mode of the distribution, and β is proportional to the standard deviation. To simplifize, for each pocket in query, the μ and β for its EVD were computed first. Next, the pocket was superposed pairwise against every pocket in the database, trying to maximize the number of matched residues. If at least five residue matches were retrieved, then the score S was computed.A pocket is identified as a hit only when its S score exceeds the cutoff of 26.&lt;br /&gt;&lt;br /&gt;Several proteins, including tyrosyl−tRNA synthetase, chorismate mutase and protein kinase, were used to test the performance of the scoring function. The results indicated that the pocket search algorithm and the scoring function could identify similar pockets with high accuracy.&lt;br /&gt;&lt;br /&gt;The method developed was further tested to cluster protein families.In clusering,  the scores S was converted into a distance, D, which then was clustered with the hierarchical agglomerative clustering method. Results indicated that the proteins clustered clearly into their respective families, with two exceptions.&lt;br /&gt;&lt;br /&gt;In conclusion,in this paper a simple energy-like scoring function for protein pocket superposition was described, which was based on the extreme value distribution and an exhaustive search of all possible Cα superpositions. This model could accurately identify similar protein pockets from a large database. The method was also able to produce standard protein clustering, at both the family and subfamily levels. The search was fast and can be run on a single CPU over the whole PDB within a few minutes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3352204013420103082?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3352204013420103082/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3352204013420103082' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3352204013420103082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3352204013420103082'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2010/12/pocket-similarity-are-carbons-enough.html' title='Pocket Similarity: Are α Carbons Enough?'/><author><name>YZ</name><uri>http://www.blogger.com/profile/14677336996665816779</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-4036467991367491652</id><published>2010-12-09T20:59:00.000-08:00</published><updated>2010-12-09T21:02:16.478-08:00</updated><title type='text'>Rational Approaches for the Design of Effective Human Immunodeficiency Virus Type 1 Nonnucleoside Reverse Transcriptase Inhibitors</title><content type='html'>&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;o:officedocumentsettings&gt;   &lt;o:allowpng/&gt;  &lt;/o:OfficeDocumentSettings&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:view&gt;Normal&lt;/w:View&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves/&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:donotpromoteqf/&gt;   &lt;w:lidthemeother&gt;EN-US&lt;/w:LidThemeOther&gt;   &lt;w:lidthemeasian&gt;X-NONE&lt;/w:LidThemeAsian&gt;   &lt;w:lidthemecomplexscript&gt;X-NONE&lt;/w:LidThemeComplexScript&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:snaptogridincell/&gt;    &lt;w:wraptextwithpunct/&gt;    &lt;w:useasianbreakrules/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:splitpgbreakandparamark/&gt;    &lt;w:enableopentypekerning/&gt;    &lt;w:dontflipmirrorindents/&gt;    &lt;w:overridetablestylehps/&gt;   &lt;/w:Compatibility&gt;   &lt;m:mathpr&gt;    &lt;m:mathfont val="Cambria Math"&gt;    &lt;m:brkbin val="before"&gt;    &lt;m:brkbinsub val="&amp;#45;-"&gt;    &lt;m:smallfrac val="off"&gt;    &lt;m:dispdef/&gt;    &lt;m:lmargin val="0"&gt;    &lt;m:rmargin val="0"&gt;    &lt;m:defjc val="centerGroup"&gt;    &lt;m:wrapindent val="1440"&gt;    &lt;m:intlim val="subSup"&gt;    &lt;m:narylim val="undOvr"&gt;   &lt;/m:mathPr&gt;&lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" defunhidewhenused="true" defsemihidden="true" defqformat="false" defpriority="99" latentstylecount="267"&gt;   &lt;w:lsdexception locked="false" priority="0" semihidden="false" unhidewhenused="false" qformat="true" name="Normal"&gt;   &lt;w:lsdexception locked="false" priority="9" semihidden="false" unhidewhenused="false" qformat="true" name="heading 1"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 2"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 3"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 4"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 5"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 6"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 7"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 8"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 9"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 1"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 2"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 3"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 4"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 5"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 6"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 7"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 8"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 9"&gt;   &lt;w:lsdexception locked="false" priority="35" qformat="true" name="caption"&gt;   &lt;w:lsdexception locked="false" priority="10" semihidden="false" unhidewhenused="false" qformat="true" name="Title"&gt;   &lt;w:lsdexception locked="false" priority="1" name="Default Paragraph Font"&gt;   &lt;w:lsdexception locked="false" priority="11" semihidden="false" unhidewhenused="false" qformat="true" name="Subtitle"&gt;   &lt;w:lsdexception locked="false" priority="22" semihidden="false" unhidewhenused="false" qformat="true" name="Strong"&gt;   &lt;w:lsdexception locked="false" priority="20" semihidden="false" unhidewhenused="false" qformat="true" name="Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="59" semihidden="false" unhidewhenused="false" name="Table Grid"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Placeholder Text"&gt;   &lt;w:lsdexception locked="false" priority="1" semihidden="false" unhidewhenused="false" qformat="true" name="No Spacing"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Revision"&gt;   &lt;w:lsdexception locked="false" priority="34" semihidden="false" unhidewhenused="false" qformat="true" name="List Paragraph"&gt;   &lt;w:lsdexception locked="false" priority="29" semihidden="false" unhidewhenused="false" qformat="true" name="Quote"&gt;   &lt;w:lsdexception locked="false" priority="30" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Quote"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="19" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="21" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="31" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Reference"&gt;   &lt;w:lsdexception locked="false" priority="32" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Reference"&gt;   &lt;w:lsdexception locked="false" priority="33" semihidden="false" unhidewhenused="false" qformat="true" name="Book Title"&gt;   &lt;w:lsdexception locked="false" priority="37" name="Bibliography"&gt;   &lt;w:lsdexception locked="false" priority="39" qformat="true" name="TOC Heading"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */  table.MsoNormalTable  {mso-style-name:"Table Normal";  mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;  mso-style-priority:99;  mso-style-parent:"";  mso-padding-alt:0in 5.4pt 0in 5.4pt;  mso-para-margin-top:0in;  mso-para-margin-right:0in;  mso-para-margin-bottom:10.0pt;  mso-para-margin-left:0in;  line-height:115%;  mso-pagination:widow-orphan;  font-size:11.0pt;  font-family:"Calibri","sans-serif";  mso-ascii-font-family:Calibri;  mso-ascii-theme-font:minor-latin;  mso-hansi-font-family:Calibri;  mso-hansi-theme-font:minor-latin;  mso-bidi-font-family:"Times New Roman";  mso-bidi-theme-font:minor-bidi;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt;Authors: Sergio R. Ribone, Mario A. Quevedo, Marcela Madrid, and Margarita C. Brin˜o´n&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;i&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt;Journal: J. Chem. Inf. Model.&lt;/span&gt;&lt;/i&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt;, Article ASAP&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;b&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt;DOI: &lt;/span&gt;&lt;/b&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt;10.1021/ci1001636&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt;Publication Date (Web): December 6, 2010&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; Mutation in viral strains is becoming a major global health problem. A fast mutation in viruses makes it difficult to find the correct drugs for the new structure. It is vital to know what kinds of interactions are important in the complex of target-protein and drug-inhibitor compound. This understanding may help finding better drug candidates, or in case the target protein mutates knowing what physico-chemical properties to be sought for a new drug.&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; human immunodeficiency virus type 1 (HIV-1) is one of the widely studied virus. HIV-1 reverse transcriptases are the enzymes responsible for the replication of the virus and hence have been considered one of the main therapeutic targets for anti-HIV drugs used in the treatment of AIDS. Varieties of inhibitors are known to-date that bind to these enzymes. In the present paper authors S Ribone et al study the binding of several classes of nonnucleoside reverse transcriptase inhibitors (NNRTIs) to two types of HIV-1RTases: wild-type (wtRT) and a certain (K103N) mutant (mRT). For this exhaustive study, they use methods of molecular dynamics and energy decomposition. Then through a careful comparison between properties of these NNRTIs such as: hydrogen bonding, quantitative free energy analyses, molecular interactions in the binding pockets they studied each drug’s potency. The molecular basis of the interaction between NNRTIs and RT presented here provides a novel quantitative approach for the design of novel effective anti-HIV drugs and may be used as a general approach for other drug-discoveries.&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; The authors also mention that one of the mRT-compund complexes they studied was in fact later crystallographically resolved and deposited in the RCSB protein data bank (PDB). This leads to a situation where computational studies help finding new drug. Authors study the nonnucleoside reverse transcriptase inhibitors (NNRTIs) that elicit RT inhibition by binding to a pocket identified as the nonnucleoside inhibitors binding pocket (NNIBP). The NNRTIs studied here range from the first-generation inhibitors that are voluminous and rigid structures (e.g. nevirapine) to the imidoylthiourea (ITU) and subsequently to the more potent diaryltriazine (DATA) and diarylpyrimidine (DAPY) analogues. DATA and DAPY inhibitors are similar in volume but possess a central ring joining two flexible lateral rings. Crystallographic studies show that nevirapine binds in a &lt;i&gt;butterfly&lt;/i&gt;-like conformation while ITU, most DATA, and DAPY inhibitors bind in a &lt;i&gt;horseshoe &lt;/i&gt;conformation. The DAPYs have increased flexibility inside the binding site and are very potent inhibitors against both wild-type and mutated RT. The internal surface of NNIBP is predominantly hydrophobic. NNRTIs are known to establish extensive hydrophobic interactions with NNIBP residues, including van der Waals and &lt;i&gt;π&lt;/i&gt;-&lt;i&gt;π &lt;/i&gt;stacking interactions with aromatic amino acid side chains, and form hydrogen bonds with the backbones of hydrophilic residues.&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; The available experimental data before this paper suggested that mutations cause drug resistance by different mechanisms, among them impairment of inhibitor accessibility to the NNIBP, modification of critical intermolecular interactions, or introduction of steric hindrance. In the present work, authors present a detailed analysis of the structure, molecular interactions in the binding pocket, energetics, and dynamics of drugs ranging from first- to third-generation compounds, bound to wtRT and K103N mutated RT (mRT). The molecular dynamics simulations were followed by free energy decomposition analyses, targeting towards a quantitative structure-activity model correlating the binding energetics to the reported anti-HIV activities (EC50).&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; During the 3 ns molecular dynamics (MD) simulation, authors studied the hydrogen bonding either direct or solvent-water mediated between the compound and the residues of the binding pocket. They also studied the energy decomposition during the MD trajectories and found that the antiviral potencies of compounds against the enzymes were closely related to their van der Waals energetic components. The decrease in the potency/activity correlates with the increase in the van der Waals energy which is also in agreement in the experimental findings. The authors also found that the K103N mutation in the binding pocket results in an increase in the electrostatic and van der Waals energies, in agreement with the marked loss in inhibitory activity against the mutated RT strain. They establish a strong correlation between the van der Waals energetic component and the reported antiviral activity against wtR. The correlation suggests that higher potencies are expected for those compounds that establish hydrophobic contacts with the residues in the binding pocket and is in agreement with the high hydrophobicity of the NNIBP of wtRT. Finally, the authors studied in detail the torsion angle distributions and showed to enhance antiviral activity torsions at the central ring of the compounds have to maximize the &lt;i&gt;π&lt;/i&gt;-&lt;i&gt;π &lt;/i&gt;stacking interactions with the residues in the binding pocket. &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; In conclusion, the authors have studied in large details, the molecular interactions and the origins of the potency of the anti-HIV drugs. They have applied molecular docking and molecular dynamics techniques to study the binding interactions, flexibility, and free energies of binding of several NNRTIs complexed to wtRT and K103N mRT, and correlate these properties to their reported biological activities. Authors found that all ITU, DATA, and DAPY compounds studied show sustained hydrogen bonds with Lys101 residue in the binding pocket and in some cases also with Glu138 residues (However, the existence of this additional hydrogen bond does not correlate with a higher biological activity against wtRT). These two residues can therefore be regarded as important for the future drug-compound discoveries. They establish a linear relationship between the biological activity of all complexes and the van der Waals energetic component when bound to wtRT. This in their view can be regarded as a predictive tool for the design of effective inhibitors against wtRT. The K103N mutation modifies the electrostatic properties of the binding pocket. The authors show that RT inhibitors with activity against both wild-type and K103N mutated HIV strains show (a) hydrophobic interactions with the NNIBP, and (b) specific intermolecular interactions with a hydrophilic region in the lower part of the binding pocket. According to the correlation observed in this work, a potent inhibitor of wtRT must maximize its van der Waals interactions in the binding pocket. &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style="font-size: 12pt; font-family: &amp;quot;Times New Roman&amp;quot;,&amp;quot;serif&amp;quot;;"&gt; The study presented here provides a valuable methodology for the rational design of effective inhibitors with better therapeutic profiles for the treatment of AIDS and in general for other pandemic diseases.&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-4036467991367491652?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/4036467991367491652/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=4036467991367491652' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/4036467991367491652'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/4036467991367491652'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2010/12/rational-approaches-for-design-of.html' title='Rational Approaches for the Design of Effective Human Immunodeficiency Virus Type 1 Nonnucleoside Reverse Transcriptase Inhibitors'/><author><name>Harshad.Joshi</name><uri>http://www.blogger.com/profile/17646431964257375196</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5695432433334782103</id><published>2010-12-08T23:32:00.000-08:00</published><updated>2010-12-08T23:33:54.207-08:00</updated><title type='text'>Learning from the Data: Mining of Large High-Throughput Screening Databases</title><content type='html'>&lt;p class="MsoNormal"&gt;&lt;b style="mso-bidi-font-weight:normal"&gt;&lt;span style="font-size:12.0pt;mso-bidi-font-size:11.0pt;line-height:115%;mso-ascii-font-family: Arial;mso-hansi-font-family:Arial;mso-bidi-font-family:Arial"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial; mso-bidi-font-family:Arial"&gt;Authors: Frank Yan, Frederick J. King, Yun He, Jeremy S. Caldwell, and Yingyao Zhou.&lt;br /&gt;Source: J. Chem. Inf. Model., 2006, 46 (6), pp 2381-2395&lt;br /&gt;&lt;br /&gt;This paper was written by a group from Novartis based on mining of their internal knowledge base which included over 200 ultrahigh throughput screening campaigns, 74 of which were complete enough for the detailed mining.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;They report that over 1 million compounds were screened in each of their HTS campaigns.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;The authors applied a custom algorithm called “Ontology-based Pattern Identification (OPI)”.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;Their OPI algorithm identified approximately 1,500 scaffold families with significant structure-HTS activity profile relationships. Using their OPI algorithm they identified four types of compound scaffolds from their database, tumor cytotoxic, general toxic, potential reporter gene assay artifact and target family specific.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;Their OPI algorithm was able to identify compounds that were alike structurally and also shared statistically significant biological activity profiles.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;Classification of their compounds resulted in them annotating their library on a scaffold basis which may be a useful resource for identifying further study of the data.&lt;span style="mso-spacerun:yes"&gt;     &lt;/span&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;Their OPI data mining algorithm was based on research performed on gene function prediction.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;They adapted the algorithm to find large scale correlations between chemical scaffolds and biological profiles from the HTS data set. &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;They were able to identify structurally similar compounds and demonstrated strong neighborhood behavior.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;They describe their approach as replacing HTS activities against individual targets by activity profiles against a battery of assays.&lt;span style="mso-spacerun:yes"&gt;   &lt;/span&gt;They applied multigroup statistical validation tests from the realm of bioinformatics.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;They summarized their OPI algorithm to identify core members of a compound family as:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpFirst" style="margin-left:1.0in;mso-add-space:auto; text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;1.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;For each compound family &lt;b style="mso-bidi-font-weight:normal"&gt;C&lt;/b&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;2.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;Construct a representative biological profile &lt;b style="mso-bidi-font-weight: normal"&gt;Qc&lt;/b&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;3.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;Score compound I based on the similarity S&lt;sub&gt;i&lt;/sub&gt; = Sim(Q&lt;sub&gt;c&lt;/sub&gt;,Q&lt;sub&gt;i&lt;/sub&gt;)&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;4.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;Rank all compounds based on the score S&lt;sub&gt;i&lt;/sub&gt; in descending order&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;5.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;For each possible similarity cutoff S&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;6.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;Calculate probability P = P(S)&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;7.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;S&lt;sup&gt;*&lt;/sup&gt; that leads to a minimum P s chosen, i.e. S&lt;sup&gt;*&lt;/sup&gt; = arg&lt;sub&gt;s&lt;/sub&gt;min P(S)&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left:1.0in;mso-add-space: auto;text-indent:-.25in;mso-list:l0 level1 lfo2"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-fareast-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-list:Ignore"&gt;8.&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;Family members with S&lt;sub&gt;i&lt;/sub&gt; &gt; S&lt;sup&gt;*&lt;/sup&gt; and I € &lt;b style="mso-bidi-font-weight:normal"&gt;C&lt;/b&gt; are identified as the core&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpLast" style="margin-left:1.0in;mso-add-space:auto"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family:Arial;mso-bidi-font-family: Arial"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;One of their goals was to allow their researchers to “fail early, fail cheap” by being able to mine their data to filter undesirable scaffolds during the lead identification phase.&lt;span style="mso-spacerun:yes"&gt;   &lt;/span&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;&lt;span style="mso-ascii-font-family:Arial;mso-hansi-font-family: Arial;mso-bidi-font-family:Arial"&gt;&lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;The authors cite several specific examples of their results including:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpFirst" style="text-indent:-.25in;mso-list:l1 level1 lfo1"&gt;&lt;span style="font-size:11.0pt;font-family:Symbol;mso-fareast-font-family:Symbol; mso-bidi-font-family:Symbol"&gt;&lt;span style="mso-list:Ignore"&gt;·&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:11.0pt;mso-ascii-font-family: Arial;mso-hansi-font-family:Arial;mso-bidi-font-family:Arial"&gt;Two anticancer chemotypes were correctly annotated as tumor cytotoxic, falling into a cluster of six compounds including other known anticancer compounds and diastereoisomers.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;This 6 member group also had a common anthracycline scaffold.&lt;/span&gt;&lt;span style="font-size:11.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent:-.25in;mso-list:l1 level1 lfo1"&gt;&lt;span style="font-size:11.0pt;font-family:Symbol;mso-fareast-font-family:Symbol; mso-bidi-font-family:Symbol"&gt;&lt;span style="mso-list:Ignore"&gt;·&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:11.0pt;mso-ascii-font-family: Arial;mso-hansi-font-family:Arial;mso-bidi-font-family:Arial"&gt;A tellurium containing scaffold showed general toxicity in proliferation and reporter gene assays, other members of this compounds group contained tellurium.&lt;/span&gt;&lt;span style="font-size:11.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpLast" style="text-indent:-.25in;mso-list:l1 level1 lfo1"&gt;&lt;span style="font-size:11.0pt;font-family:Symbol;mso-fareast-font-family:Symbol; mso-bidi-font-family:Symbol"&gt;&lt;span style="mso-list:Ignore"&gt;·&lt;span style="font:7.0pt &amp;quot;Times New Roman&amp;quot;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:11.0pt;mso-ascii-font-family: Arial;mso-hansi-font-family:Arial;mso-bidi-font-family:Arial"&gt;A spironolactone-like scaffold demonstrated target family specific activities against a nuclear receptor family.&lt;/span&gt;&lt;span style="font-size:11.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5695432433334782103?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5695432433334782103/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5695432433334782103' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5695432433334782103'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5695432433334782103'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2010/12/learning-from-data-mining-of-large-high.html' title='Learning from the Data: Mining of Large High-Throughput Screening Databases'/><author><name>AIT</name><uri>http://www.blogger.com/profile/07821544509281116832</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8587656662790602812</id><published>2010-12-07T05:58:00.000-08:00</published><updated>2010-12-07T06:08:42.518-08:00</updated><title type='text'>Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of Commercially Available Organic Compounds Provided by 29 Suppliers</title><content type='html'>&lt;span style="font-family:georgia;"&gt;&lt;span style="color:#000000;"&gt;&lt;strong&gt;Authors&lt;/strong&gt;:  A. Chuprina, O. Lukin, R. Demoiseaux, A. Buzko, and A. Shivanyuk&lt;br /&gt;&lt;strong&gt;Source:&lt;/strong&gt;  &lt;em&gt;J. Chem. Inf. Model&lt;/em&gt;, &lt;strong&gt;2010&lt;/strong&gt;, &lt;em&gt;50&lt;/em&gt;, pp. 470 – 479.&lt;br /&gt;&lt;br /&gt;       &lt;span style="font-size:85%;"&gt;    &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:85%;color:#000000;"&gt; The purpose of this paper is to determine how many molecules have drug- or lead-likeness in the chemical suppliers' repertoire of compounds.  The dataset starts from 7.9 million commercially available compounds whereas 5.2 million compounds were considered distinct.  MySQL is the database used to store the structures, calculations parameters, and suppliers' information.  The compounds were represented by SMILES strings that were converted using the JChem program.  The physicochemical properties calculation was done using LigPrep and QikProp whereas biological activities used the PASS software.  PASS software is a predictive algorithm based on a structural similarity search against approximately 60,000 known biologically active compounds.  JChem chemical fingerprint software was utilized to perform the cluster analysis using "sphere exclusion" method.  The 5.2 million compounds underwent specific filtering criteria depending on whether the inquiry was drug-like or lead-like feasibility.  &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;&lt;span style="font-size:85%;color:#000000;"&gt;&lt;br /&gt;            The drug-like filters were based on Lipinski and Veber rules.  These include logP, logS, membrane permeability, molecular weight, number of hydrogen-bond donors and acceptors, lipophilicity, and available polar surface area and rotatable bonds.  Also, reactive or toxic functional groups were filtered as well.  The suppliers' collection of compounds was delineated based on molecular weight, log S, and Clog P of the filtered compounds.  In other words, comparison of each suppliers own assemblage showed that there was a hierarchy based on the percentage of molecules with passable scores.  Overall, the percentage of compounds with drug-like characteristics has increased since 2004 which was the last time a similar search was reported. &lt;br /&gt;&lt;br /&gt;            The lead-like filters were chosen based on properties from the Hann and Oprea paper.  The criteria are 200 &lt; MW &lt; 460, -4 &lt; ClogP &lt; 4.2, Hacc ≤ 9, Hdon ≤ 5, rotating bonds ≤ 10, PSA ≤ 170, CACO-2 ≥ 100, and -5 &lt; log S &lt; 0.5.   Following the criteria for drug-likeness, the selection of any compounds containing toxic or reactive functional groups was excluded as well.  The cutoff for lead-like filters did not leave room for any variation when compared to the drug-like filters.  Chemical leads do not necessarily have significant biological activity and will undergo improvement of physical properties through SAR (structural activity relationship) in the lab.  This strict screening of chemical attributes lead to a smaller number of compounds.  &lt;br /&gt;&lt;br /&gt;            In conclusion, the 10 largest suppliers possess about 90% of the lead-like and drug-like compounds available.  With respect to biological activities, the suppliers' molecules have a propensity to be active ones.  It was evident that the suppliers also focused on providing more structural diversity in their stock.  In addition, suppliers also directed their efforts to producing more drug-like compounds rather than lead-like compounds.  The accessibility of having a wide and variable choice of readily available compounds to test will hopefully aid the researcher to be more efficient when finding appropriate leads in new research projects. &lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8587656662790602812?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8587656662790602812/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8587656662790602812' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8587656662790602812'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8587656662790602812'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2010/12/drug-and-lead-likeness-target-class-and.html' title='Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of Commercially Available Organic Compounds Provided by 29 Suppliers'/><author><name>Rowena</name><uri>http://www.blogger.com/profile/10319022086827212091</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8131574177127977016</id><published>2009-12-09T08:39:00.000-08:00</published><updated>2009-12-09T08:45:49.780-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Review'/><title type='text'>Mining large heterogenous datasets in drug discovery. Expert Opinion on Drug Discovery. 2009</title><content type='html'>   &lt;meta name="Title" content=""&gt; &lt;meta name="Keywords" content=""&gt; &lt;meta equiv="Content-Type" content="text/html; charset=utf-8"&gt; &lt;meta name="ProgId" content="Word.Document"&gt; &lt;meta name="Generator" content="Microsoft Word 2008"&gt; &lt;meta name="Originator" content="Microsoft Word 2008"&gt; &lt;link rel="File-List" href="file://localhost/Users/huinmao/Library/Caches/TemporaryItems/msoclip/0clip_filelist.xml"&gt; &lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;o:documentproperties&gt;   &lt;o:template&gt;Normal.dotm&lt;/o:Template&gt;   &lt;o:revision&gt;0&lt;/o:Revision&gt;   &lt;o:totaltime&gt;0&lt;/o:TotalTime&gt;   &lt;o:pages&gt;1&lt;/o:Pages&gt;   &lt;o:words&gt;29&lt;/o:Words&gt;   &lt;o:characters&gt;170&lt;/o:Characters&gt;   &lt;o:company&gt;Informatics/Indiana University&lt;/o:Company&gt;   &lt;o:lines&gt;1&lt;/o:Lines&gt;   &lt;o:paragraphs&gt;1&lt;/o:Paragraphs&gt;   &lt;o:characterswithspaces&gt;208&lt;/o:CharactersWithSpaces&gt;   &lt;o:version&gt;12.0&lt;/o:Version&gt;  &lt;/o:DocumentProperties&gt;  &lt;o:officedocumentsettings&gt;   &lt;o:allowpng/&gt;  &lt;/o:OfficeDocumentSettings&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves&gt;false&lt;/w:TrackMoves&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:drawinggridhorizontalspacing&gt;18 pt&lt;/w:DrawingGridHorizontalSpacing&gt;   &lt;w:drawinggridverticalspacing&gt;18 pt&lt;/w:DrawingGridVerticalSpacing&gt;   &lt;w:displayhorizontaldrawinggridevery&gt;0&lt;/w:DisplayHorizontalDrawingGridEvery&gt;   &lt;w:displayverticaldrawinggridevery&gt;0&lt;/w:DisplayVerticalDrawingGridEvery&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:dontautofitconstrainedtables/&gt;    &lt;w:dontvertalignintxbx/&gt;    &lt;w:usefelayout/&gt;   &lt;/w:Compatibility&gt;  &lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" latentstylecount="276"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt; &lt;style&gt; &lt;!--  /* Font Definitions */ @font-face 	{font-family:宋体; 	mso-font-charset:80; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:1 0 16778254 0 262144 0;} @font-face 	{font-family:Calibri; 	panose-1:2 15 5 2 2 2 4 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 0 0 0 1 0;} @font-face 	{font-family:Cambria; 	panose-1:2 4 5 3 5 4 6 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 0 0 0 1 0;}  /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal 	{mso-style-parent:""; 	margin:0in; 	margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman"; 	mso-ascii-font-family:Cambria; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:宋体; 	mso-hansi-font-family:Cambria; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} @page Section1 	{size:8.5in 11.0in; 	margin:1.0in 1.25in 1.0in 1.25in; 	mso-header-margin:.5in; 	mso-footer-margin:.5in; 	mso-paper-source:0;} div.Section1 	{page:Section1;} --&gt; &lt;/style&gt; &lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */ table.MsoNormalTable 	{mso-style-name:"Table Normal"; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-parent:""; 	mso-padding-alt:0in 5.4pt 0in 5.4pt; 	mso-para-margin:0in; 	mso-para-margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman"; 	mso-ascii-font-family:Cambria; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:"Times New Roman"; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Cambria; 	mso-hansi-theme-font:minor-latin;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;!--StartFragment--&gt;  &lt;p class="MsoNormal" style="text-align: justify; text-indent: 27pt;"&gt;   &lt;meta name="Title" content=""&gt; &lt;meta name="Keywords" content=""&gt; &lt;meta equiv="Content-Type" content="text/html; charset=utf-8"&gt; &lt;meta name="ProgId" content="Word.Document"&gt; &lt;meta name="Generator" content="Microsoft Word 2008"&gt; &lt;meta name="Originator" content="Microsoft Word 2008"&gt;  &lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;o:documentproperties&gt;   &lt;o:template&gt;Normal.dotm&lt;/o:Template&gt;   &lt;o:revision&gt;0&lt;/o:Revision&gt;   &lt;o:totaltime&gt;0&lt;/o:TotalTime&gt;   &lt;o:pages&gt;1&lt;/o:Pages&gt;   &lt;o:words&gt;685&lt;/o:Words&gt;   &lt;o:characters&gt;3910&lt;/o:Characters&gt;   &lt;o:company&gt;Informatics/Indiana University&lt;/o:Company&gt;   &lt;o:lines&gt;32&lt;/o:Lines&gt;   &lt;o:paragraphs&gt;7&lt;/o:Paragraphs&gt;   &lt;o:characterswithspaces&gt;4801&lt;/o:CharactersWithSpaces&gt;   &lt;o:version&gt;12.0&lt;/o:Version&gt;  &lt;/o:DocumentProperties&gt;  &lt;o:officedocumentsettings&gt;   &lt;o:allowpng/&gt;  &lt;/o:OfficeDocumentSettings&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves&gt;false&lt;/w:TrackMoves&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:drawinggridhorizontalspacing&gt;18 pt&lt;/w:DrawingGridHorizontalSpacing&gt;   &lt;w:drawinggridverticalspacing&gt;18 pt&lt;/w:DrawingGridVerticalSpacing&gt;   &lt;w:displayhorizontaldrawinggridevery&gt;0&lt;/w:DisplayHorizontalDrawingGridEvery&gt;   &lt;w:displayverticaldrawinggridevery&gt;0&lt;/w:DisplayVerticalDrawingGridEvery&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:dontautofitconstrainedtables/&gt;    &lt;w:dontvertalignintxbx/&gt;    &lt;w:usefelayout/&gt;   &lt;/w:Compatibility&gt;  &lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" latentstylecount="276"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt; &lt;style&gt; &lt;!--  /* Font Definitions */ @font-face 	{font-family:宋体; 	mso-font-charset:80; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:1 0 16778254 0 262144 0;} @font-face 	{font-family:Calibri; 	panose-1:2 15 5 2 2 2 4 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 0 0 0 1 0;} @font-face 	{font-family:Cambria; 	panose-1:2 4 5 3 5 4 6 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 0 0 0 1 0;}  /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal 	{mso-style-parent:""; 	margin:0in; 	margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman"; 	mso-ascii-font-family:Cambria; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:宋体; 	mso-hansi-font-family:Cambria; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} @page Section1 	{size:8.5in 11.0in; 	margin:1.0in 1.25in 1.0in 1.25in; 	mso-header-margin:.5in; 	mso-footer-margin:.5in; 	mso-paper-source:0;} div.Section1 	{page:Section1;} --&gt; &lt;/style&gt; &lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */ table.MsoNormalTable 	{mso-style-name:"Table Normal"; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-parent:""; 	mso-padding-alt:0in 5.4pt 0in 5.4pt; 	mso-para-margin:0in; 	mso-para-margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman"; 	mso-ascii-font-family:Cambria; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:"Times New Roman"; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Cambria; 	mso-hansi-theme-font:minor-latin;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;!--StartFragment--&gt;   &lt;meta name="Title" content=""&gt; &lt;meta name="Keywords" content=""&gt; &lt;meta equiv="Content-Type" content="text/html; charset=utf-8"&gt; &lt;meta name="ProgId" content="Word.Document"&gt; &lt;meta name="Generator" content="Microsoft Word 2008"&gt; &lt;meta name="Originator" content="Microsoft Word 2008"&gt; &lt;link rel="File-List" href="file://localhost/Users/huinmao/Library/Caches/TemporaryItems/msoclip/0/clip_filelist.xml"&gt; &lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;o:documentproperties&gt;   &lt;o:template&gt;Normal.dotm&lt;/o:Template&gt;   &lt;o:revision&gt;0&lt;/o:Revision&gt;   &lt;o:totaltime&gt;0&lt;/o:TotalTime&gt;   &lt;o:pages&gt;1&lt;/o:Pages&gt;   &lt;o:words&gt;29&lt;/o:Words&gt;   &lt;o:characters&gt;170&lt;/o:Characters&gt;   &lt;o:company&gt;Informatics/Indiana University&lt;/o:Company&gt;   &lt;o:lines&gt;1&lt;/o:Lines&gt;   &lt;o:paragraphs&gt;1&lt;/o:Paragraphs&gt;   &lt;o:characterswithspaces&gt;208&lt;/o:CharactersWithSpaces&gt;   &lt;o:version&gt;12.0&lt;/o:Version&gt;  &lt;/o:DocumentProperties&gt;  &lt;o:officedocumentsettings&gt;   &lt;o:allowpng/&gt;  &lt;/o:OfficeDocumentSettings&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves&gt;false&lt;/w:TrackMoves&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:drawinggridhorizontalspacing&gt;18 pt&lt;/w:DrawingGridHorizontalSpacing&gt;   &lt;w:drawinggridverticalspacing&gt;18 pt&lt;/w:DrawingGridVerticalSpacing&gt;   &lt;w:displayhorizontaldrawinggridevery&gt;0&lt;/w:DisplayHorizontalDrawingGridEvery&gt;   &lt;w:displayverticaldrawinggridevery&gt;0&lt;/w:DisplayVerticalDrawingGridEvery&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:dontautofitconstrainedtables/&gt;    &lt;w:dontvertalignintxbx/&gt;    &lt;w:usefelayout/&gt;   &lt;/w:Compatibility&gt;  &lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" latentstylecount="276"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt; &lt;style&gt; &lt;!--  /* Font Definitions */ @font-face 	{font-family:宋体; 	mso-font-charset:80; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:1 0 16778254 0 262144 0;} @font-face 	{font-family:Calibri; 	panose-1:2 15 5 2 2 2 4 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 0 0 0 1 0;} @font-face 	{font-family:Cambria; 	panose-1:2 4 5 3 5 4 6 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:auto; 	mso-font-pitch:variable; 	mso-font-signature:3 0 0 0 1 0;}  /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal 	{mso-style-parent:""; 	margin:0in; 	margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman"; 	mso-ascii-font-family:Cambria; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:宋体; 	mso-hansi-font-family:Cambria; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} @page Section1 	{size:8.5in 11.0in; 	margin:1.0in 1.25in 1.0in 1.25in; 	mso-header-margin:.5in; 	mso-footer-margin:.5in; 	mso-paper-source:0;} div.Section1 	{page:Section1;} --&gt; &lt;/style&gt; &lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */ table.MsoNormalTable 	{mso-style-name:"Table Normal"; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-parent:""; 	mso-padding-alt:0in 5.4pt 0in 5.4pt; 	mso-para-margin:0in; 	mso-para-margin-bottom:.0001pt; 	mso-pagination:widow-orphan; 	font-size:12.0pt; 	font-family:"Times New Roman"; 	mso-ascii-font-family:Cambria; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:"Times New Roman"; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Cambria; 	mso-hansi-theme-font:minor-latin;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;!--StartFragment--&gt;  &lt;/p&gt;  &lt;!--EndFragment--&gt; &lt;p class="MsoNormal" style="text-align: justify; text-indent: 27pt;"&gt;&lt;b style=""&gt;&lt;i style=""&gt;&lt;span style="font-family:Calibri;"&gt;Wild, D.J.  &lt;a href="http://www.informahealthcare.com/doi/abs/10.1517/17460440903233738"&gt;&lt;span style="text-decoration: none;color:#000000;" &gt;Mining large heterogenous datasets in drug discovery&lt;/span&gt;&lt;/a&gt;. Expert Opinion on Drug Discovery. 2009; 4(10), pp 995-1004&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify; text-indent: 27pt;"&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;“Date mining is defined as the process of identifying valid, novel, potentially useful, and ultimately understandable patterns from large collections of data. The applications of data mining in drug discovery include pattern discovery in databases.” (Quoted from this paper).&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;This paper is a detailed review about data mining applied in drug discovery. It provides a review of the publicly available large-scale databases relevant to drug discovery, described the data mining approaches that can be applied to them and discusses the recent work in integrative data mining that looks association span multiple resources, including the Semantic Web techniques. The author claims that data mining of large heterogeneous data sets require the intelligent web technologies and semantics. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;After reading this paper, I summarize information in term of three aspects, i.e. publicly large-scale information sources, data mining tools, as well as web-based technologies. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;1&lt;/span&gt;&lt;span style=";font-family:宋体;font-size:14pt;"  &gt;)&lt;/span&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;Publicly large-scale information sources:&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;Chemical information (PubChem,ChemSpider that&lt;span style=""&gt;  &lt;/span&gt;is a chemisty search engine); protein information (sequence databases,i.e.UniProt, 3D protein structure databases—Protein Data Bank), genomic and nucleotide information, disease level information(less work done in this area, but it is significant which bridge the gap between gene and patient, called ‘disease informatics’; resources: disease-genes database, DrugBank, Genotype-phenotype databases); scholarly publication (difficulties: not many open access journals, PDF, not machine-readable; resources: PubMed,PubMed Central)&lt;span style=""&gt;   &lt;/span&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;2) Tools for data mining: (a) Knowledge Discovery in Databases (KDD). Similar with Data Mining, it is usually defined as the process of identifying valid, novel, potentially useful and ultimately understandable patterns from large collections of data. The most common model of KDD has 7-step process, i.e. data cleaning, data integration, data selection, data transformation, data mining, patter evaluation and knowledge presentation. Knowledge discovery goals include descriptive and predicative goals. (b)Searching for a query or items similar to a query: it can be classified into structure searching, substructure searching, and similarity searching. BLAST is a common used tool for calculating sequence similarity. (c)Unsupervised learning (detecting data patterns without training data). The most common unsupervised learning method is clustering (K-means, hierarchical clustering). Unsupervised clustering has been widely applied to drug discovery databases, including organization of chemical structures into series for analysis, visualization or predication; organization of microarray data ; and analysis and display of genome-wide expression patterns inter alia. (d) Supervised learning (use data with known properties to train models or classifiers that are able to make predications for data with unknown properties). Supervised learning methods include Bayesian Inferences, decision tree, etc. (e) Association rule mining. Compared with other machine learning methods, ARM is less used in drug discovery, but it is becoming increasingly important. ARM is to discover statistical relationships between data elements to be discerned. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;span style=";font-family:宋体;font-size:14pt;"  &gt;3)&lt;/span&gt;&lt;span style=";font-family:Calibri;font-size:14pt;"  &gt;Web-based technologies are important in the integrated data mining systems, including semantic and ontologic languages (XML,OWL,RSS,RDF), web services and intelligent agents and inference tools. XML is a markup language intended to convey metadata (i.e. information about data) and is thus useful for describing different kinds of data and database. XML is particularly useful when used in conjunction with languages for describing the valid entities in a particular domain (ontologies) and rules about how these relationships relate to each other. These languages include OWL and RDF. RSS is a by-product of XML that can discover new information that may be of interest to users. This paper also talked the advantages of intelligent agents that offer the possibility of automated mediation of the large amounts of information that come from multiple databases. Intelligent agents exhibit four properties: autonomy, social ability (communicate with other agents or humans by an agent), reactivity (react to changes in their environment), and pro-activeness (they can take initiative in acting, not necessarily just as a response to external stimulus). Furthermore, the importance of semantic web in conjunction with chemical informatics and drug discovery are talked in this paper. Finally, this author proposed to consider a new filed of drug discovery informatics in order to maximize the use of electronic information and computation for the discovery of the next generation of therapies and medicines.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;!--EndFragment--&gt; &lt;p&gt;&lt;/p&gt;  &lt;!--EndFragment--&gt; &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8131574177127977016?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8131574177127977016/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8131574177127977016' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8131574177127977016'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8131574177127977016'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/12/mining-large-heterogenous-datasets-in.html' title='Mining large heterogenous datasets in drug discovery. Expert Opinion on Drug Discovery. 2009'/><author><name>Huina Mao</name><uri>http://www.blogger.com/profile/10311907471962195255</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2229792290072954113</id><published>2009-12-09T08:34:00.000-08:00</published><updated>2009-12-09T08:36:14.312-08:00</updated><title type='text'>Linguistic feature analysis for protein interaction extraction</title><content type='html'>Authors: Timur Fayruzov, Martine De Cock, Chris Cornelis and Veronique Hoste BMC Bioinformatics, 2009&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;         Most text mining approaches rely implicitly or explicitly on linguistic data extracted from text, but only few attempts have been made to evaluate the contribution of the different feature types. In this article, the authors contribute to this evaluation by studying the relative importance of deep syntactic feature, shallow syntactic features, and lexical features.&lt;br /&gt;        &lt;br /&gt;&lt;br /&gt;       The authors use a dependency tree that represents the syntactic structure of a sentence. The nodes of the tree are the words of the sentence and the edges represent the dependencies between words. The most relevant part of the dependency tree to collect information about the relation between the two proteins is the subtree corresponding to the shortest path between these proteins.  The authors also mention same related works, such as kernel that naturally emerges from the subsequence kernel and obtain good results on the AIMed corpus, the authors focus on sentence structure and use dependency trees to extract the local contexts of the protein names, they proposed to abstract from lexical features and use only syntactic information to obtain a more general classifier that would be suitable for different data sets without retraining, apply a structured kernel to the protein-protein interaction domain, and propose to use the whole dependency tree to build a classifier. The authors performed their experiment on five benchmark data sets, being AIMed, BioInfer, IEPA, HPRD50 and LLL. In this work, the authors use two evaluation metrics, namely recall-precision and receiver operating characteristic curves, to provide a more comprehensive analysis of the achieved results. &lt;br /&gt;      &lt;br /&gt;      &lt;br /&gt;      There are two important observations from the results, one of them is that by using only grammatical relations (syntactic kernel) we can obtain a similar performance as with an extended feature set(lexical kernel), another one is that when the training set is much smaller than the test set, then the syntactic kernel performs better. The authors conclude that the syntactic kernel provides the best results, whereas the lexical kernel provides the worst results.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2229792290072954113?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2229792290072954113/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2229792290072954113' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2229792290072954113'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2229792290072954113'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/12/linguistic-feature-analysis-for-protein.html' title='Linguistic feature analysis for protein interaction extraction'/><author><name>Kuochung Peng</name><uri>http://www.blogger.com/profile/04963167070928354912</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-999255790195008599</id><published>2009-12-06T06:38:00.000-08:00</published><updated>2009-12-06T06:41:00.609-08:00</updated><title type='text'>Molecular Fingerprint Recombination: Generating Hybrid Fingerprints for Similarity Searching from Different Fingerprint Types</title><content type='html'>&lt;span style="font-family:arial;"&gt;Author: B. Nisius, J. Bajorath&lt;br /&gt;Source: ChemMedChem, 2009, 4, 1859 – 1863&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;div align="justify"&gt;&lt;span style="font-family:arial;"&gt;Even though fingerprints are very popular in similarity searching and a variety of different designs were introduced over the years, the combination of different fingerprints into “hybrid fingerprints” is an unexplored design strategy. As bit subsets of fingerprints are often responsible for the search performance of fingerprints, B. Nisius explored the potential to identify of these preferred bit subsets in fingerprints of different designs and recombined these bit segments into “hybrid fingerprints” representations. In this way, structural fragments/patterns (MACCS; 166 bit) as well as pharmacophoric features (2D topological, TGD; 2-point, 420 bits representing atom pairs &amp;amp; TGT; 3-point, 1704 bits, representing atom triangles) were combined to increase the search performance compared to the original fingerprints.&lt;br /&gt;The selection of the individual fingerprint bit positions was performed by ranking the positions based on the Kullback-Leiber divergence analysis. This analysis yields a measure of difference in the bit distributions of active and inactive/database compounds; selecting bit positions that discriminate between active compounds and background noise. The test set consisted of 27 different compound classes (30 – 160 actives each) and 3.7 Mio. compounds from the ZINC collection as background database. The 100 top-ranking bit positions for each fingerprint were selected for all activity classes (300 bits) and then compared with the performance of the parent fingerprints as well as the complete combination 2290 bits), using k-nearest neighbor analysis. These activity class-directed fingerprints produced in almost all cases higher recall rates than the parental representations and the recovery rate increases were often significant. Interestingly, the recall of the control fingerprint, the combination of the parental FP’s was overall much smaller, that the ones achieved for the much smaller hybrid FP’s.&lt;br /&gt;Because the different bit positions or features from the separate FP-designs are independently selected based on discriminatory power, instead by information theoretic approaches, the capacity to categorize and the search performance of hybrid FP’s is in most cases higher than the performance of their parents. The combination of the different fingerprint designs, merging representations of substructure and pharmacophoric patterns leads to an emphasis of compound class-specific molecular features or gain in chemical information resulting in an overall improved performance.&lt;br /&gt;&lt;br /&gt;This publication shows an interesting new approach to the design of new fingerprints that could find significant application possibilities in virtual screening projects, where the discriminatory power of one fingerprint class is not sufficient to classify and enrich active compounds in an activity-class database selection. &lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-999255790195008599?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/999255790195008599/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=999255790195008599' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/999255790195008599'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/999255790195008599'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/12/molecular-fingerprint-recombination.html' title='Molecular Fingerprint Recombination: Generating Hybrid Fingerprints for Similarity Searching from Different Fingerprint Types'/><author><name>Stefan Furrer</name><uri>http://www.blogger.com/profile/05065895922703854994</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5160625106981901977</id><published>2009-12-05T17:26:00.000-08:00</published><updated>2009-12-08T17:02:53.228-08:00</updated><title type='text'>QSAR and Drug Design</title><content type='html'>By David R. Bevan&lt;br /&gt;&lt;div&gt;&lt;div&gt;&lt;a href="http://1.bp.blogspot.com/_WClVaWA9d5k/SxvWNEunlUI/AAAAAAAAALI/W_lO4iXm-FM/s1600-h/Untitled.jpg"&gt;&lt;/a&gt;&lt;div align="justify"&gt;&lt;span style="color:#333333;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div align="justify"&gt;&lt;span style="color:#333333;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;span style="color:#333333;"&gt;This journal article is about the disciplines like drug design and environmental risk assessment in which QSAR is currently being applied. QSAR attempts to correlate the activities with the structural descriptors / physicochemical properties like hydrophobicity, electronic and steric effects, topology etc. which are determined empirically / by computational methods. However, activities used in here include chemical measurements and biological assays. &lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;span style="color:#333333;"&gt;Louis Hammett contributed to the development of QSAR by correlating electronic properties of organic acids and bases with their equilibrium constants and reactivity. Hammett Equation &lt;/span&gt;&lt;a href="http://1.bp.blogspot.com/_WClVaWA9d5k/SxvWNEunlUI/AAAAAAAAALI/W_lO4iXm-FM/s1600-h/Untitled.jpg"&gt;&lt;/a&gt;&lt;a href="http://1.bp.blogspot.com/_WClVaWA9d5k/SxvWNEunlUI/AAAAAAAAALI/W_lO4iXm-FM/s1600-h/Untitled.jpg"&gt;&lt;/a&gt;&lt;a href="http://2.bp.blogspot.com/_WClVaWA9d5k/SxvWfP9YywI/AAAAAAAAALQ/hkcvHeRe9wY/s1600-h/Untitled.jpg"&gt;&lt;span style="color:#333333;"&gt;&lt;img id="BLOGGER_PHOTO_ID_5412155209564080898" style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 76px; CURSOR: hand; HEIGHT: 31px" alt="" src="http://2.bp.blogspot.com/_WClVaWA9d5k/SxvWfP9YywI/AAAAAAAAALQ/hkcvHeRe9wY/s320/Untitled.jpg" border="0" /&gt;&lt;/span&gt;&lt;/a&gt;&lt;span style="color:#333333;"&gt;encountered difficulties when investigators attempted to apply Hammett-type relationships to biological systems, indicating that other structural descriptors were necessary. The author, herein this paper, gives some examples of reactions to describe the equation and graph for a linear free energy relationship. Later, Hansch recognized the importance of the lipophilicity, expressed as the octanol-water partition coefficient, on biological activity. Author also give&lt;/span&gt;&lt;a href="http://1.bp.blogspot.com/_WClVaWA9d5k/SxvXPIY7A5I/AAAAAAAAALY/aVw-MAC-zGY/s1600-h/Untitled.jpg"&gt;&lt;span style="color:#333333;"&gt;&lt;img id="BLOGGER_PHOTO_ID_5412156032165806994" style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 183px; CURSOR: hand; HEIGHT: 34px" alt="" src="http://1.bp.blogspot.com/_WClVaWA9d5k/SxvXPIY7A5I/AAAAAAAAALY/aVw-MAC-zGY/s320/Untitled.jpg" border="0" /&gt;&lt;/span&gt;&lt;/a&gt;&lt;span style="color:#333333;"&gt;s the correlation between Hammett's electronic parameters and Hansch's measure of lipophilicity using this equation. QSAR are now developed using a variety of parameters as descriptors of the structural properties of molecules. Hammett sigma values are often used for electronic parameters, but quantum mechanically derived electronic parameters also may be used. Other descriptors to account for the shape, size, lipophilicity, polarizability, and other structural properties also have been devised.&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;span style="color:#333333;"&gt;Researchers’ attempts to develop drugs based on QSAR primarily consisted of statistical correlations of structural descriptors with biological activities. However with easy and high-speed access to computational resources it evolved into rational drug design / CADD that attempts to find a ligand interacting favorably with target site on a receptor which may include hydrophobic, electrostatic and hydrogen-bonding interactions, solvation energies, etc. But the optimized fit of ligand in a target site does neither guarantee that the desired activity of the drug will be enhanced or that undesired side effects will be diminished nor consider the pharmacokinetics of the drug. There are two main approaches of CADD. First, the ligand-based approach is applicable when the structure of the receptor site is unknown and structurally similar compounds with high activity, with no activity, and with a range of intermediate activities have been identified. This requires conformational analysis depending on flexibility of the compounds under investigation with a strategy to find the lowest energy conformers of the most rigid compounds and superimpose them to generate the pharmacophore. This template may then be used to develop new compounds with functional groups in the desired positions with an assumption that the minimum energy conformers will bind most favorably in the receptor site. Second, the receptor-based approach to CADD applies when a reliable model of the receptor site is available, as from X-ray diffraction, NMR, or homology modeling. But the problem lies with designing the ligands that favorably interact at the site. Once potential drugs have been identified other molecular modeling techniques may then be applied e.g. geometry optimization may be used to stabilize the structures and to identify low energy orientations of drugs in receptor sites. Molecular dynamics may assist in exploring the energy landscape, and free energy simulations can be used to compute the relative binding free energies of a series of putative drugs.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5160625106981901977?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5160625106981901977/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5160625106981901977' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5160625106981901977'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5160625106981901977'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/12/qsar-and-drug-design.html' title='QSAR and Drug Design'/><author><name>Rohan Patil</name><uri>http://www.blogger.com/profile/15299634532963258412</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-KgOOWFrKiIA/TWfZi2IQXWI/AAAAAAAACXI/sgFoH2sDGsM/s220/rspatil_l.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_WClVaWA9d5k/SxvWfP9YywI/AAAAAAAAALQ/hkcvHeRe9wY/s72-c/Untitled.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6116348538391285408</id><published>2009-11-18T07:16:00.000-08:00</published><updated>2009-11-18T07:20:18.833-08:00</updated><title type='text'>STITCH 2: an interaction network database for small molecules and proteins</title><content type='html'>Author: Michael Kuhn, Damian Szklarczyk, Andrea Franceschini, Monica Campillos, Christian von Mering, Lars Juhl Jensen, Andreas Beyer and Peer Bork&lt;br /&gt;Source: Nucleic Acids Research, 2009, 1–5&lt;br /&gt;&lt;br /&gt;In this article, the authors present STITCH 2(&lt;a href="http://stitch.embl.de/"&gt;http://stitch.embl.de&lt;/a&gt;) containing small molecules and proteins interaction information using network analysis.&lt;br /&gt;The STITCH database aggregates various publicly accessible database such as PDSP Ki Database,Protein Data Bank, KEGG,  Reactome, NCI-Nature Pathway Interaction Database, DrugBank, MATADOR, GLIDA, PharmGKB, Comparative Toxicogenomics Database, and BindingDB.  According to the article, STITCH covers 650 organisms, and 74,000 chemicals including 2200 drugs. STITCH network also contains "actions" information such as activator, inhibitor. The specified 'actions' is implemented by natural language process (NLP), and pathway and interaction database.  In the chemical structure part of STITCH, they implemented SMILES and InChI for chemical structures imported from PubChem. Chemical structures from search results are linked to Google and ChemSpider.&lt;br /&gt;Lastly, users cannot only interactively search networks of chemicals and proteins, but also download complete set of interactions. In addition, STITCH provides API in order to retrieve interaction networks.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;a href="http://2.bp.blogspot.com/_IJCpekjdhLQ/SwQO5RgWe7I/AAAAAAAADJE/jmgEb8JW1KE/s1600/gkp937v11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://2.bp.blogspot.com/_IJCpekjdhLQ/SwQO5RgWe7I/AAAAAAAADJE/jmgEb8JW1KE/s200/gkp937v11.png" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6116348538391285408?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6116348538391285408/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6116348538391285408' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6116348538391285408'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6116348538391285408'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/11/stitch-2-interaction-network-database.html' title='STITCH 2: an interaction network database for small molecules and proteins'/><author><name>JaeHong Shin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://3.bp.blogspot.com/_IJCpekjdhLQ/SwQl3BHImII/AAAAAAAADJQ/gDR7Zr_kiTo/s1600-R/profile_picture-full%3Binit:.JPG'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_IJCpekjdhLQ/SwQO5RgWe7I/AAAAAAAADJE/jmgEb8JW1KE/s72-c/gkp937v11.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-112181367780030846</id><published>2009-11-04T06:54:00.000-08:00</published><updated>2009-11-04T06:55:38.989-08:00</updated><title type='text'>Finding Key Members in Compound Libraries by Analyzing Networks of Molecules Assembled by Structural-Similarity</title><content type='html'>&lt;div align="justify"&gt;Author: Zsolt Lepp, Chunfei Huang, Takashi Okada&lt;br /&gt;Source: J. Chem. Inf. Model. ASAP&lt;br /&gt;&lt;br /&gt;This paper investigated the used of networks to analyze chemical libraries with regard of finding central key members and diversity. Finding central molecules in compound libraries is often performed by clustering based on structural similarity and the cluster centers correspond to the central structures. In the same way, molecular similarity networks were explored in this publication to identify the central nodes that correspond to the molecules in the central positions of the chemical space. To do this different types of molecular similarity (Russel-Rao, Tanimoto, Baroni-Urbani and Yule coefficients on fingerprint similarity) based networks were investigated, such as Minimum Spanning Trees, Threshold Network and various network layouts were investigated.&lt;br /&gt;Using similarity and dissimilarity as the distance between the nodes, either central, similar compounds were identified (similarity) in the centers or most unique structures (dissimilarity).&lt;br /&gt;While the visualization capability of the networks/trees is the most evident advantage to analyze molecular libraries, some of the minimum spanning tree method has also some drawbacks. In this network, only the minimum of connections are drawn, leading to a loss of information a all the network edges might contain important information. A solution for this problem is to create a threshold network, drawing edges that have a certain weight. Such use of molecular networks could be applied to clustering and virtual screening, library analysis and the selection of compound series.&lt;br /&gt;On the example of adenosine antagonists, it was shown that network-based clustering can be effectively used for clustering compound libraries with a performance comparable or better than Ward clustering.&lt;br /&gt;In conclusion the paper offers an interesting approach of applying graph layouts of molecular networks to visualize relationships among compounds and the hierarchical structure of a library. The studies were performed in the open soured Cytoscape format and a plugin was developed to integrate Marvin view to visualize structures (supporting information). &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-112181367780030846?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/112181367780030846/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=112181367780030846' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/112181367780030846'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/112181367780030846'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/11/finding-key-members-in-compound.html' title='Finding Key Members in Compound Libraries by Analyzing Networks of Molecules Assembled by Structural-Similarity'/><author><name>Stefan Furrer</name><uri>http://www.blogger.com/profile/05065895922703854994</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3699651971230610464</id><published>2009-10-21T06:27:00.000-07:00</published><updated>2009-11-04T06:45:54.706-08:00</updated><title type='text'>Combining docking with pharmacophore filtering for improved virtual screening</title><content type='html'>Author: Megan L Peach and Marc C Nicklaus&lt;br /&gt;Source: Journal of Cheminformatics 2009, 1:6 doi:10.1186/1758-2946-1-6&lt;br /&gt;&lt;br /&gt;The main idea of this article is to evaluate "pharmacophore filtering" that is used as a post processing docking (ranking molecules) method in virtual screening.&lt;br /&gt;&lt;br /&gt;One of the main issues in virtual screening is that it still gives too many false positives. Virtual screening aims to select active molecules from large database but predicted active molecules are often not active in real experiment.  According to the article, there are two theories of the lack of docking based virtual screening. One is that docking program is not accurate enough to generate correct binding poses and the other is that the scoring functions don't succeed to rank molecules obtained from docking.&lt;br /&gt;&lt;br /&gt;The authors evaluated "pharmacophore filtering" method and compared it to other scoring functions in terms of post processing.  Firstly, they performed docking by Gold and Glide and combined all the obtained poses into one structure file. Secondly, the performed post processing for both scoring function and "pharmacophore filtering.  They determined pharmacophore models for "pharmacophore filtering" in MOE. GScore in Glide and GoldScore and Chemscore in Gold, and Affinity score are used as scoring functions.  Three different target proteins were used for evaluation, such as Neuraminidase A,  CDK2, and PKC C1 domain which have various binding characteristics. The authors also tested the performance of "pharmacophore filtering" method itself with different pharmacophore models using active molecules and decoys.  &lt;br /&gt;&lt;br /&gt;In results, the comparison of "pharmacophore filtering" with scoring function, "pharmacophore filtering" method gave better results with all three target proteins. In addition, the authors discussed visual inspection and intervention for filtering is essential in docking and virtual screening.&lt;br /&gt;&lt;br /&gt;In conclusion, "pharmacophore filtering" method improved on discriminating active molecules from large database. The authors also suggested that they used docking as a conformation generator in place of pre-generated conformer database used in general pharmacophore based virtual screening. In addition, pharmacophore filtering which is used as a post-processing has an advantage to reduce complexity of docking process since it enables to bypass several known difficulties of scoring function.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3699651971230610464?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3699651971230610464/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3699651971230610464' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3699651971230610464'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3699651971230610464'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/combining-docking-with-pharmacophore.html' title='Combining docking with pharmacophore filtering for improved virtual screening'/><author><name>JaeHong Shin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://3.bp.blogspot.com/_IJCpekjdhLQ/SwQl3BHImII/AAAAAAAADJQ/gDR7Zr_kiTo/s1600-R/profile_picture-full%3Binit:.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6163840171080133594</id><published>2009-10-20T19:55:00.000-07:00</published><updated>2009-10-20T19:57:30.561-07:00</updated><title type='text'>Alpha Shapes Applied to Molecular Shape Characterization Exhibit Novel Properties Compared to Established Shape Descriptors</title><content type='html'>&lt;span class="Apple-style-span" style="font-family: Helvetica; font-size: medium; "&gt;&lt;p class="MsoNormal"&gt;&lt;b&gt;Authors:&lt;/b&gt; J. Anthony Wilson, Andreas Bender, Taner Kaya and Paul A. Clemons&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b&gt;Source:&lt;/b&gt; J. Chem. Inf. Model., Article ASAP, September 23, 2009&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify;"&gt;Even though multiple molecular shape descriptions have been devised over the paste 20 years, it is still largely an unsolved problem. From early molecular shape descriptions, such as CoMFA, the more recent ROCS (Rapid Overlay of Chemical Structures) to the “ultrafast shape recognition” (USR), none of these descriptions of the molecular shape were ideal, both from a predictive power as well as as ease of handling point of view. This inspite of efforts to create hybrid descriptors, such as thee MACCS/USR, a hybrid of the conventional MACCS keys and the 3D USR descriptor, which outperformed the pure USR descriptors in a series of virtual screening experiments.&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify;"&gt;The authors therefore propose a novel application of alpha shapes to the description of small molecules. Alpha shapes are parametrized (alpha) generalizations of the convex hull. As alpha approaches infinity, the alpha shape is identical to the convex hull. On the other hand, as alpha decreases, the shape shrinks, gets dents and voids and when alpha corresponds to the spheres of a space filling model, the alpha shape is said to be the geometric dual of the space filling model. Applications of alpha shapes include visualization of irregular shape boundaries of clusters in 3D, characterize simple properties of Brownian motion and the study of protein structures, pockets, surface area and packing.&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify;"&gt;The investigation of the behavior of the alpha shapes for the description of molecular shape was performed using 3 sets of compounds: multiple conformers of structural isomers of octane, a collection of 388 biologically active compounds and a diverse set of 22,831 compounds from ChemBank. The calculation of comparative descriptors was performed using commercially available software: normalized principal moments-of-inertia (PMIs) ratio method, Ultrafast Shape Recognition (USR), functional class fingerprints (FCFP6s, Pipeline Pilot) and descriptors from MOE.&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify;"&gt;After the calculation of the alpha shape indices from the 3D coordinates in the SDF file, and resolving the facets of the surface for handedness, the Euclidian distance to all facets was calculated using facet intercentroid distances. Together with the angle between the facet normals, a joint probability function was generated. A method termed Earth-Mover’s distance was used to assess the similarity/dissimilarity among the joint probability functions. These distances were then further used for a pairwise comparison to the calculated reference descriptors.&lt;/p&gt;&lt;p class="MsoNormal" style="text-align: justify;"&gt;In general the method is sensitive to resolve constitutional isomers and enantiomers. Even though the proposed method does share information with existing descriptors, the authors found also considerable disagreement for many pairs of molecules, indicating that previously neglected shape information is captured. The proposed method is nearly size independent and more sensitive to changes in the overall shape than the number of atoms. Future optimization will involve hybrid descriptors, combining the alpha shape score with other descriptors that determine molecular size.&lt;span&gt;  &lt;/span&gt;The authors plan also to investigate how such a method could e applied to the characterization of protein properties, such as the shape of potential binding sites.&lt;span&gt;  &lt;/span&gt;&lt;/p&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6163840171080133594?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6163840171080133594/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6163840171080133594' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6163840171080133594'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6163840171080133594'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/alpha-shapes-applied-to-molecular-shape.html' title='Alpha Shapes Applied to Molecular Shape Characterization Exhibit Novel Properties Compared to Established Shape Descriptors'/><author><name>Stefan Furrer</name><uri>http://www.blogger.com/profile/05065895922703854994</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2961285774901321604</id><published>2009-10-20T15:48:00.000-07:00</published><updated>2009-10-21T07:49:19.459-07:00</updated><title type='text'>Molecular similarity including chirality</title><content type='html'>&lt;div align="justify"&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-size:78%;"&gt;By M. Stuart Armstrong, Garrett M. Morris, Paul W. Finn, Raman Sharma, W. Graham Richards&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;This paper speaks about the usefulness of molecular similarity in computer aided drug discovery, and highlights most recently developed non-superpositional approach &lt;em&gt;USR (Ultra-fast Shape Recognition)&lt;/em&gt;. In this paper we are introduced to a novel feature &lt;em&gt;CSR (Chiral Shape Recognition)&lt;/em&gt; to position the centroids in such a way that it clearly distinguishes between &lt;a href="http://en.wikipedia.org/wiki/Enantiomers"&gt;enantiomers&lt;/a&gt;. Also, their individual/combined performance, comparison using computational complexity and generalization are discussed in brief.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;Molecular similarity has been used to explore large databases of molecular structures to find the compounds which resemble a given natural product, and also to search for the compounds of similar activity without any side-effects, based on the idea that structurally similar compounds possess similar biological properties. A wide variety of algorithms have been constructed including approaches comparing molecules at 2D (chemical connectivity) and 3D (shape) levels. For our interests of 3D level, the author presents three distinguishing methods, namely Superposition, USR and CSR.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;em&gt;&lt;u&gt;Superposition&lt;/u&gt;&lt;/em&gt;: Molecules being compared be structurally aligned and superposed in a common coordinate framework. Difficulties were in finding the global optimum in the search space and also to have very high computational requirements, hence being quite slow.&lt;/div&gt;&lt;div align="justify"&gt;&lt;em&gt;&lt;u&gt;USR&lt;/u&gt;&lt;/em&gt;: (1) Assigning four centroids to the molecule, (2) Computing distance distributions from the centroids, (3) Computing moment information from the distributions, (4) Comparing molecules by determining the distance between these twelve-dimensional vectors. However, no possible scope for distinguishing between enantiomers.&lt;/div&gt;&lt;div align="justify"&gt;&lt;em&gt;&lt;u&gt;CSR&lt;/u&gt;&lt;/em&gt;: (1) Assigning three centroids such that 1st is geometric centre, and rest are furthest to previous ones, (2) Assign fourth centroid as cross product of magnitudes and vectors of already assigned centorids, (3) Then replacing the molecule with a mirror image, via the central symmetry by the action of translations and rotations, so that all the atom positions get inverted, but not the position of the fourth centroid and hence the fourth distance distribution differs for a given molecule and its enantiomer.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;Inference is that both USR and CSR are most useful for ranking and filtering molecules. As compared to USR, CSR just requires additional computation and renormalisation of a single cross product and hence both should be considered to be of roughly comparable speed.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2961285774901321604?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2961285774901321604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2961285774901321604' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2961285774901321604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2961285774901321604'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/molecular-similarity-including.html' title='Molecular similarity including chirality'/><author><name>Rohan Patil</name><uri>http://www.blogger.com/profile/15299634532963258412</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-KgOOWFrKiIA/TWfZi2IQXWI/AAAAAAAACXI/sgFoH2sDGsM/s220/rspatil_l.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1556445184117366506</id><published>2009-10-20T12:10:00.001-07:00</published><updated>2009-10-20T12:10:29.337-07:00</updated><title type='text'>Review: Kano, Y. et al. U-Compare: share and compare text mining tools with UIMA.</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;This is a short application note published on Oxford Bioinformatics. The paper described U-Compare, a system based on UIMA(Unstructured Information Management Architecture), and offers out-of-box text mining tools and evaluation platforms. The U-Compare workflow management tool allows easy creation of workflow with drag-and-drops. They also provide a special parallel flow component, with which workflow comparison can be executed. A common task in text mining research is evaluation. U-Compare runs the evaluation process automatically after an execution of a workflow. Results can also be visualized, using various visualization components, even some complex ones. In addition, U-Compare provide programming interfaces (APIs), which allows developers access to workflows via I/O streams. Developers' native tools can be integrated into U-Compare workflow via the APIs. &lt;br /&gt;&lt;br /&gt;The paper itself is not very interesting, yet I found that U-Compare and UIMA could be very good tools in bio/chem related text mining project. Right now our research on text mining is project based, and lack of a good platform for everyone to work on. I had tried using other scientific workflow tools such as Taverna, but because those tools are not aiming at text-mining researches, there is also a high barrier. I am planning to install UIMA, U-Compare with other UIMA components locally to facilitate our research in text mining.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1556445184117366506?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1556445184117366506/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1556445184117366506' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1556445184117366506'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1556445184117366506'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/review-kano-y-et-al-u-compare-share-and.html' title='Review: Kano, Y. et al. U-Compare: share and compare text mining tools with UIMA.'/><author><name>djiao</name><uri>http://www.blogger.com/profile/06486651456900333839</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-221698097964859294</id><published>2009-10-20T07:28:00.000-07:00</published><updated>2009-10-20T07:35:35.856-07:00</updated><title type='text'>Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks.</title><content type='html'>Authors: Allan Peter Davis, Cynthia G. Murphy, Cynthia A. Saraceni-Richards,Michael C. Rosenstein, Thomas C. Wiegers and Carolyn J. Mattingly&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In this paper, the authors discuss the comparative toxicogenomics database (CTD). The authors have developed the CTD &lt;a href="http://ctd.mdibl.org/"&gt;http://ctd.mdibl.org/&lt;/a&gt; as a unique tool to provide connections between chemicals, genes/proteins and diseases and to provide the basis for testable hypotheses about the mechanisms underlying the etiology of environmental diseases. CTD is a curated database that promotes understanding about the effects of environmental chemicals on human health. Biocurators at CTD manually curate chemical-gene interactions, chemical-disease relationships and gene-disease relationships from the literature.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;CTD is distinct from these databases in three ways: (1) it focuses on environmental chemicals; (2) it integrates curated and imported data, allowing users to explore connections between chemicals, genes, and diseases; and (3) it function not only as a repository for information, but also as a resource for generating novel hypotheses about environmental diseases and chemical actions. Environmental chemicals can affect genes in multiple ways, including mutagenesis, altered methylation, physical interaction and influencing gene expression or protein function so CTD focuses its manual curation effort on environmental chemicals (e.g. arsenic, heavy metals and dioxins), how those chemicals interact with genes or proteins in different species and how they relate to human diseases. CTD biocurators capture three types of core data from the literature: chemical-gene interactions, chemical-disease relationship and gene-disease relationships. These data are curated in a structured format using controlled vocabularies and are integrated to establish a triad of chemicals, genes and diseases. CTD enhances its core data pages with links to the external resources. A powerful feature of CTD is the integration of curated chemical, gene and disease core data from the literature to generate new, putative discoveries. For example, if chemical a interacts with gene B (via a curated chemical-gene interaction) and independently gene B is associated with disease C (via a curated gene-disease relationship), then it may be relationship with disease C (inferred via gene B). This approach was recently supported by analyzing the CTD arsenic data set, wherein CTD correctly predicted types of diseases that may be associated with arsenic exposure and set of genes that may be involved in modulating arsenic-relate disease, such as lung cancer and diabetes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Future development of CTD will aim to further expand the depth of its curated data and enhance the data query and visualization capabilities.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-221698097964859294?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/221698097964859294/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=221698097964859294' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/221698097964859294'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/221698097964859294'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/comparative-toxicogenomics-database.html' title='Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks.'/><author><name>Kuochung Peng</name><uri>http://www.blogger.com/profile/04963167070928354912</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1930271142432334032</id><published>2009-10-19T17:04:00.000-07:00</published><updated>2009-10-19T17:21:35.162-07:00</updated><title type='text'>Bio2RDF: Towards a mashup to build bioinformatics knowledge systems</title><content type='html'>&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Authors: François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault and Jean Morissette&lt;br /&gt;&lt;a href="http://dx.doi.org/10.1016/j.jbi.2008.03.004"&gt;full article&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The Semantic Web is an evolving development of the World Wide Web in which the meaning (semantics) of information and services on the web is defined, making it possible for the web to understand and satisfy the requests of people and machines to use the web content[from wiki]. Instead of HTML, semantic web proposes to use machine readable format to present web content. Resource Description Framework (RDF) is a general format for conceptual description or modeling of information that is implemented in web resources. All the world sources ideally are able to be presented as object, predicate and subject. This paper made the first endeavor to convert all bioinformatics related data into RDF and demonstrated the power of semantic in the data integration.&lt;br /&gt;&lt;br /&gt;Here is the description of RDF representation. If we are going to show that the primary topic of the Wikipedia page is a "Person" whose name is &lt;a href="http://3.bp.blogspot.com/_jCRs95ht3Qc/St0CcPZDDtI/AAAAAAAAAAU/L-o8RRJ5bMQ/s1600-h/rdf.bmp"&gt;&lt;img style="MARGIN: 0px 10px 10px 0px; WIDTH: 320px; FLOAT: left; HEIGHT: 94px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5394470612850904786" border="0" alt="" src="http://3.bp.blogspot.com/_jCRs95ht3Qc/St0CcPZDDtI/AAAAAAAAAAU/L-o8RRJ5bMQ/s320/rdf.bmp" /&gt;&lt;/a&gt;"Tony Benn", the RDF can be written as:&lt;?xml:namespace prefix = rdf /&gt;&lt;rdf:description about="http://en.wikipedia.org/wiki/Tony_Benn"&gt;&lt;?xml:namespace prefix = foaf /&gt;&lt;foaf:primarytopic&gt;&lt;foaf:person&gt;&lt;br /&gt;&lt;/foaf:person&gt;&lt;/foaf:primarytopic&gt;&lt;/rdf:description&gt;&lt;a href="http://2.bp.blogspot.com/_jCRs95ht3Qc/St0CA7an1fI/AAAAAAAAAAM/2hxh5TNDu-M/s1600-h/rdf.bmp"&gt;&lt;/a&gt;&lt;a href="http://2.bp.blogspot.com/_jCRs95ht3Qc/St0CA7an1fI/AAAAAAAAAAM/2hxh5TNDu-M/s1600-h/rdf.bmp"&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_jCRs95ht3Qc/St0CA7an1fI/AAAAAAAAAAM/2hxh5TNDu-M/s1600-h/rdf.bmp"&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;The tags show the meaning of every element in the file that is able to be recognized the computer. For example, The computer can understand Tony Benn is the name of one person since it is annotated by foaf:name. URI &lt;a href="http://en.wikipedia.org/wiki/Tony_Benn"&gt;http://en.wikipedia.org/wiki/Tony_Benn&lt;/a&gt; is the universe resource identifier of Tony_Benn. Like the above example, Bio2RDF have built a system to convert all the biological data into RDF. It begins with the building of a list of namespaces (URI) for different data providers, and then it retrieves the data from provider and makes a parser automatically converting them into RDF. The data is scattered over the web with different formats such as text file, XML or relational database. This paper has converted many bioinformatics database like KEGG, PDB, MGI, HGNC and several NCBI’s database and deposits the RDF into local server. At the end, it applies SPARQL to search the integrated RDF source to answer specific questions in Parkinson disease study. For example, which Go Terms describe our four gens of interest (Rxr, Nurr1, Nur77, and Nor-1)? The results are showed in a particular semantic web browser. Ideally, the providers like KEGG, PDB publish the data as RDF, no effort is then needed for conversion and the question can be answered directly after integration.&lt;br /&gt;&lt;br /&gt;Given a question, generally we have to turn to the website using the search engine like google or using a professional database following the instructions. If we would like to mine or link the data, we have to download them into local relational database. This is a very time consuming process, particularly when we are confronting more and more data sources as the recent advances of data generation is emerging in the biological domain. Meanwhile, the large amount of chemical data also emerge in the recent years, however, the integration of chemicals still has not been well developed. First it is because of the properties of chemical itself, no universal chemical identifier has been agreed. Currently we have CID, CAS number, InChi and so on. Second, no advanced knowledge is developed to present the data efficiently. Semantic web might be a promising solution.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1930271142432334032?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1930271142432334032/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1930271142432334032' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1930271142432334032'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1930271142432334032'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/bio2rdf-towards-mashup-to-build.html' title='Bio2RDF: Towards a mashup to build bioinformatics knowledge systems'/><author><name>Bin Chen</name><uri>http://www.blogger.com/profile/03736801389411661562</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_jCRs95ht3Qc/St0CcPZDDtI/AAAAAAAAAAU/L-o8RRJ5bMQ/s72-c/rdf.bmp' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-7537762647533538289</id><published>2009-10-11T23:41:00.000-07:00</published><updated>2009-10-19T23:49:06.541-07:00</updated><title type='text'>Review: Molecular dynamics simulation of S100B protein to explore ligand blockage of the interaction with p53 protein by Zhigang Zhou and Yumin Li</title><content type='html'>When p53 binds to S100, its biological function as a tumor suppressor has been disabled. In paper, authors showed their works of simulating the ligand binding at the interface of S100B with P53 using molecular dynamics. Have found some ligands having the potential to block the S100-p53 interaction to keep p53 functioning. Some noticeable Molecular Modeling methods used here are,&lt;br /&gt;&lt;br /&gt;1, Minimization to relax added hydrogen (OPLS-AA force field, Glide protocol) when preparing protein coordinates from S100B-p53 complex.&lt;br /&gt;2, Docking to molecular screen chemical library.&lt;br /&gt;3, Empirical score, GlideScore in particular, for selecting compound.&lt;br /&gt;4, Molecular dynamics simulation (AMBER8) to explore the blocking potential.&lt;br /&gt;&lt;br /&gt;I don’t have much experience in molecular modeling. This paper detailed methods and steps of their study, and the discussions sound reasonable to me. Free energy of binding calculation carried out by AMBER using MM-PBSA (Molecular Mechanic-Poisson Boltzmann Surface Area) approach have found a few selected compound pushed two protein apart compare to unligated complex sounds promising. Further study needed to verify the findings.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-7537762647533538289?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/7537762647533538289/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=7537762647533538289' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/7537762647533538289'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/7537762647533538289'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/review-molecular-dynamics-simulation-of.html' title='Review: Molecular dynamics simulation of S100B protein to explore ligand blockage of the interaction with p53 protein by Zhigang Zhou and Yumin Li'/><author><name>Yi</name><uri>http://www.blogger.com/profile/01986022360906099290</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2700628482389586594</id><published>2009-10-06T21:14:00.000-07:00</published><updated>2009-10-07T07:13:30.831-07:00</updated><title type='text'>Exploiting Ordered Waters in Molecular Docking</title><content type='html'>Author: Niu Huang and Brian K. Shoichet&lt;br /&gt;Source: J. Med. Chem. 2008, 51, 4862–4865&lt;br /&gt;&lt;br /&gt;In this article, the authors investigated the importance of water molecules mediated protein-ligand interactions in docking study. They tested all possible configurations of water molecules with on(displace-able)and off(retained) switching with 24 target proteins which are accessible in DUD database(http://dud.docking.org/). According to the paper, although, a water molecule plays an important role in protein-ligand recognition, it is rarely clear to be decided which water molecule should be fixed  and which water molecule should be displace-able in protein-ligand docking study.  The authors applied several methods for the docking screening evaluation. 1)They applied an quality assessment for docking screening with DUD decoys by enrichment factor. 2)They evaluated the docking screening improvement by the comparison of no water molecule and displace-able water molecules. 3)Also the authors evaluated a computation cost since as the number of water molecule increases, the number of configuration is exponentially increases.&lt;br /&gt;&lt;br /&gt;In the results, 12 out of 24 targets show substantial improvement of docking screening with the treatment of water molecule displacement. 11 proteins are largely unaffected because 7 out of 11 targets' enrichment factors are already high enough without water molecules. As for the computation time, it increases linearly with the number of water molecule instead of exponential increase. For this reason, this study can be practically applicable.&lt;br /&gt;&lt;br /&gt;In conclusion, the treatment of water displacement could enhance the improvement of docking screening which is substantial in this study although decoys also tends to show the improvement. One intriguing thing is that only a few water configurations dominate in ligand-protein docking  whereas there are several ligand water configurations.The authors suggested a modeling ordered and displace-able water molecule enable to amplify the understanding for ligands specificity without greatly affecting decoy molecules.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2700628482389586594?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2700628482389586594/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2700628482389586594' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2700628482389586594'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2700628482389586594'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/exploiting-ordered-waters-in-molecular.html' title='Exploiting Ordered Waters in Molecular Docking'/><author><name>JaeHong Shin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://3.bp.blogspot.com/_IJCpekjdhLQ/SwQl3BHImII/AAAAAAAADJQ/gDR7Zr_kiTo/s1600-R/profile_picture-full%3Binit:.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8839339456102291601</id><published>2009-10-06T18:25:00.000-07:00</published><updated>2009-10-06T19:07:14.974-07:00</updated><title type='text'>Fragment-based drug discovery</title><content type='html'>&lt;div align="left"&gt;Author: Wendy A. Warr&lt;br /&gt;Source: Journal of Computer-Aided Molecular Design (2009) 23:453–458 DOI 10.1007/s10822-009-9292-1&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;br /&gt;This paper provides information about Fragment Based Drug Discovery (FBDD) that author has received from Rob Hubbard of Vernalis and Mark Murko of Vertex. The paper contains a list of selected FBDD companies like Vernalis. It states that the low cost of market entry has led to the formation of many specialized companies, and few universities and research institutes have also adopted FBDD. The enthusiasm, ability and requirements of big pharma companies are discussed in the paper. Also, the screening methods for fragments, examination of novel targets, optimization of hits and finally future prospective are mentioned in brief. &lt;/div&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;NMR and X-ray crystallography techniques had been used for screening fragments. NMR and X-ray analyses provide structural information about the binding site of a hit. Surface plasmon resonance (SPR) is a relatively recent technology for distinguishing good hits from ones which bind nonspecifically. SPR has the advantage of providing quantitative dynamics data on the binding interaction, such as binding constants, which are complementary to the structural information from X-ray and NMR screens. Some companies have also used thermal methods (isothermal titration calorimetry or protein thermal unfolding), mass spectrometry (MS) and high concentration bioassays. However it’s resulted, in some cases, the use of multiple methods will probably be necessary.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;FBDD approaches enable the efficient development of novel leads as these technologies are design intensive rather than resource intensive, as compared to HTS. HTS sometimes fails to find hits that interfere with new targets, such as protein–protein interaction surfaces. FBDD can also find new binding modes and allosteric sites. The only problem of FBDD lies in ligand specificity that fragment-based approaches identify and characterize only ‘‘hot spots’’, i.e. the regions of a protein surface that are major contributors to the ligand binding free energy. Unfortunately many binding sites in the active site that are responsible for target specificity and/or selectivity are not included in these ‘‘hot spots’’. It is stated that a fragment screen provides a rapid and reliable means of interrogating a protein target for druggability before investing in further discovery research. Structure-based drug design (SBDD) can then be used in optimization of the fragment. Once a hit has been found, it is optimized into a lead by one of three approaches: linking (to find second site fragments), growing (by structure guided medicinal chemistry or by using the fragment binding motif to search for similar compounds that can be purchased), or merging. Efficiency and potency are the criteria used in optimization but we also need to consider lipophilicity, polarity, charge, stability etc.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;The big pharmaceutical companies that used to be critical are now bringing fragment-based approaches to the fore, having the required skill sets and requiring researchers from different disciplines. They are countering some problems by partnering with smaller biotechs. The complementary use of HTS and FBDD is a common theme in the larger companies. Author, here also, gives the opinions of few people from FBDD research sectors.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;For next they are finding many new opportunities for new development, such as improving the novelty, structural diversity and physicochemical properties of fragment libraries, then plenty of scope for increasing the efficiency of fragment optimization, and further developments in modeling and computational chemistry including methods for predicting binding modes, and examination of entropy and desolvation. One challenge is deciding which fragments to progress, other than using the subjective decisions of a medicinal chemist. Most of the publications and presentations on FBDD are remarkably positive but many are produced by advocates of the technology or by authors with a business bias.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8839339456102291601?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8839339456102291601/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8839339456102291601' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8839339456102291601'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8839339456102291601'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/fragment-based-drug-discovery.html' title='Fragment-based drug discovery'/><author><name>Rohan Patil</name><uri>http://www.blogger.com/profile/15299634532963258412</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-KgOOWFrKiIA/TWfZi2IQXWI/AAAAAAAACXI/sgFoH2sDGsM/s220/rspatil_l.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1352407593570603947</id><published>2009-10-06T16:12:00.000-07:00</published><updated>2009-10-06T16:14:27.711-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='oi:10.1016/j.drudis.2009.09.001'/><category scheme='http://www.blogger.com/atom/ns#' term='Drug Discovery Today (2008)'/><title type='text'>Title: Open Innovation: share or die...</title><content type='html'>Author: Patrice Talaga; &lt;br /&gt;&lt;br /&gt;The overall productivity gap of the pharmaceutical industry can actually be attributed to numerous factors including: the fact the spring of the “low-hanging fruits” (mainly in terms of targets) has now dried up. Technologies (e.g. HTS) and/or methodologies (e.g. combinatorial chemistry) that have not delivered as originally expected.  Poor clinical validations of genomic targets, an excessive reductionism of our drug discovery approaches (e.g. focus on single targets to cure complex disorders such as Alzheimer’s or Parkinson’s disease), the inability to demonstrate significant efficacy, poorly innovative management of an evermore complex drug discovery process.&lt;br /&gt;&lt;br /&gt;Moreover, increasing R&amp;D costs, decreasing revenues due to patent expirations, impact of generics competition, reimbursement driven by medical and economic outcomes, enhanced regulatory hurdles and rapidly evolving standard of care have thus forced drug discovery organizations to adopt new strategies and business models to improve productivity and innovation of R&amp;D.&lt;br /&gt;&lt;br /&gt;The creation of mega-organizations will, in any case, not help to solve the problem of the so-called innovation gap. But what is actually needed? I guess innovative products rather than bigger and bigger companies. Biotech’s are currently considered to be a major source of innovative products. Integrating a Biotech into a Pharma company may actually not be the best solution to generate a successful outcome for both partners. On the contrary, by collaborating while maintaining autonomy, each partner could make the best use of its specific strengths - innovation, fast decision-making of the Biotech; financial, resources, and drug development experience of the Pharma.&lt;br /&gt;&lt;br /&gt;The future of Research and early Development will have to rely on new business models&lt;br /&gt;that should emphasize on capital efficiency, lean infrastructure, flexibility, financial risk and reward sharing, and Open Innovation (OI) as well. Open Innovation has been defined as a paradigm that assumes companies should use external as well as internal ideas, paths to market to advance innovative technologies, products to markets via a spectrum of traditional and new business models, e.g. licensing in or out, spinning in or out, joint venturing, setting-up Pharma-Academic consortia etc... [6]. More recently OI, in the field of drug discovery, has been increasingly implemented through “corporate venturing”, equity investments in university spinoffs, or via governmental funding through Public Private Partnerships (PPPs). &lt;br /&gt;&lt;br /&gt;OI is actually based on networking and a number of solution providers are offering relevant tools to help partners to increase their business connections .&lt;br /&gt; Examples include InnoCentive -(www.innocentive.com), NineSigma(www.ninesigma.com), YourEncore (www.yourencore.com. These open networks, harnessing the collective talent accessible through the Internet, actually enables organizations to more rapidly search out the most appropriate partners for their projects. For example, the business networking platform LinkedIn (www.linkedin.com) hosts an interesting network, which allows more than 700 members to share their ideas and thoughts about OI.&lt;br /&gt;&lt;br /&gt;Lilly’s open collaboration platform “PD2” (Phenotypic Drug Discovery initiative) uses Lilly’s disease-state assays to evaluate the therapeutic potential of compounds coming either from universities, Biotechs or CROs (www.pd2.lilly.com). Lilly provides the partners a complete biological profile of the compounds tested in four phenotypic screening assays. In return for the data, Lilly has first rights to exclusively negotiate a collaboration with the submitters of the compounds that demonstrate interesting biological activities.&lt;br /&gt;&lt;br /&gt;Transitioning to OI is not an easy task. OI is still in its infancy as far as the drug discovery business is concerned and thus there is no real proof of benefits today. Industry-biotech-academic-CROs-drug discovery centers- collaborations will continue to be of utmost importance in the discovery &amp; development of breakthrough therapies The OI alliances should be managed in a similar way to how Pharma is managing its project portfolio in order to allow its successful implementation. The global integration of breakthrough ideas, knowledge and expertise from world-wide sources via Open Innovation will significantly contribute to solve the “innovation gap” issue, and thus to enable the launch of breakthrough medicines that will be of real benefit to the entire healthcare.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1352407593570603947?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1352407593570603947/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1352407593570603947' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1352407593570603947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1352407593570603947'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/title-open-innovation-share-or-die.html' title='Title: Open Innovation: share or die...'/><author><name>Hari</name><uri>http://www.blogger.com/profile/06039523722658462025</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1220753747850415875</id><published>2009-10-06T15:36:00.000-07:00</published><updated>2009-10-06T15:43:46.662-07:00</updated><title type='text'>Making medicinal chemistry more effective-application of Lean Sigma to improve processes, speed and quality</title><content type='html'>Authors: Shalini Andersson, Alan Armstrong, Annika Bjo re, Sue Bowker,Steve Chapman, Rob Davies, Craig Donald, Bryan Egner, Thomas Elebring,Sara Holmqvist, Tord Inghardt, Petra Johannesson, Magnus Johansson,Craig Johnstone, Paul Kemmitt, Jan Kihlberg, Pernilla Korsgren,Malin Lemurell, Jane Moore, Jonas A. Pettersson, Helen Pointon,Fritiof Ponte´n, Paul Schofield, Nidhal Selmi and Paul Whittamore&lt;br /&gt;&lt;br /&gt;Examining the medicinal chemistry contributions to the iterative improvement process of design-make-test-analyses (DMTA) from a Lean Sigma perspective revealed that major improvements could be made. Thus, the cycle time of synthesis, as well as compound analysis and purification, were reduced dramatically. Lean Sigma: Application of Lean Sigma in Pharmaceutical R&amp;amp;D (research and development) focuses on identification of common processes. In a drug discovery context Lean Sigma also focuses on reduction of variation and defects in these processes. Because Lean Sigma provides a structured and data driven way for improvements, it is well suited for the highly scientific environment of R&amp;amp;D, and usually leads to high engagement of co-workers.&lt;br /&gt;&lt;br /&gt;In this article, authors examined the lead optimization (LO) process at a high level. The LO can be described as consisting of two separate subphases, the fist one being the iterative process of improving lead compounds through the DMTA cycle. Once a quality compound has been identified in this iterative phase, it is assessed in more advanced models to identify any risks before proceeding to clinical development. How to make the contribution of medicinal chemistry to the DMTA cycle more effective in terms of speed and quality? The authors show some views in DMTA cycle. Testing: improvement in high throughput screening (HTS) technologies and reductions in cost per test have been well delivered across the pharmaceutical industry, Analysis of data: we can able to better to handle the large volume of data, Design of new compounds and Make-synthesis of compounds.&lt;br /&gt;&lt;br /&gt;Within ‘make’, that is, synthesis of novel compounds for biological evaluation, the main steps are; (1) deciding on the route for synthesis of the target compound, (2) ordering and assembly of reagents and starting materials, (3) carrying out the synthetic sequence, and (4) final purification and analysis of the product. For example, historical synthesis lead times, recorded before performing a Lean Sigma analysis, often reached up to 35 working days or more. After implementing improvements that controlled the work in progress of chemists and increased team work the synthesis lead times were reduced significantly. Thus, most synthetic targets were completed in less than 15 working days, and very required more than 30 days. The authors also mention the other methods such as improving analysis and purification, additional considerations when ‘thinking lean’-customer focus, and objective-setting, reward, engagement, and motivation.&lt;br /&gt;&lt;br /&gt;Lean Sigma offers powerful tools and interventions that have given rise to dramatic and sustainable improvements in speed, consistency and quality of work.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1220753747850415875?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1220753747850415875/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1220753747850415875' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1220753747850415875'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1220753747850415875'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/making-medicinal-chemistry-more.html' title='Making medicinal chemistry more effective-application of Lean Sigma to improve processes, speed and quality'/><author><name>Kuochung Peng</name><uri>http://www.blogger.com/profile/04963167070928354912</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5068294007090762899</id><published>2009-10-06T14:02:00.001-07:00</published><updated>2009-10-06T14:02:42.564-07:00</updated><title type='text'>Drug Discovery Using Chemical Systems Biology: Repositioning the Safe Medicine Comtan to Treat Multi-Drug and Extensively Drug Resistant Tuberculosis</title><content type='html'>&lt;div xmlns='http://www.w3.org/1999/xhtml'&gt;This paper is the second paper published by the Bourne and Xie group in "Drug Discovery Using Chemical Systems Biology". It's a very interesting topic: the authors explain how they used chemical systems biology to discover entacapone and tolcapone, commercial available drugs for the treatment of Parkinson't disease, are good candidates for Multi-Drug and Extensively Drug Resistant Tuberculosis (MDR-TB and XDR-TB). These drugs can inhibit the enzyme InhA, which has a similar binding sites with COMT which is their primary target for treatment of the Parkinson's disease. InhA is essential for type II fatty acid biosynthesis and the subsequent synthesis of the bacterial cell wall. It is the common target of the anti-tubercular drugs. Their discoveries are validated by &lt;i&gt;in vitro&lt;/i&gt; and InhA kinetic assays using tablets of Comtan, whose active component is entacapone. &lt;br /&gt;&lt;br /&gt;The authors describe their strategy as follows:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;1. The binding sites of a commercially available drug is extracted or predicted from a 3D structure or model of the target protein. &lt;br /&gt;2. Off-targets with similar ligand binding sites are identified across the proteome using an efficient and accurate functional site search algorithm. &lt;br /&gt;3. Atomic interactions between the putative off-targets and the drug are evaluated using protein-ligand docking. Only those off-targets that do not experience serious atomic clashes with the drug are selected for further analysis. &lt;br /&gt;4. The drug is further optimized to enhance its potency, selectivity and ADME properties by taking into account both the primary target and the off-targets across the genome. &lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;In short, the authors try to find other targets of a drug in the whole human proteome, based on binding site similarity, and docking. Then they optimize the drug based on all the targets in the proteome. &lt;br /&gt;&lt;br /&gt;The result is interesting and inspiring. Using this strategy, they found what is described earlier: entacapone and tolcapone, drugs currently in the market for Parkinson's disease can be good candidates for treatment of TB. &lt;br /&gt;&lt;br /&gt;The authors elaborated on details of their research. The primary target of these drugs is human catehol-&lt;i&gt;O&lt;/i&gt;-mehyltransferase (COMT). Using the SOIPPA algorithm, developed by the same group, they detect the common binding sites among proteins. Then they docked entacapone and tolcapone into these proteins and find that InhA were highly ranked. &lt;br /&gt;&lt;br /&gt;Interestingly, the authors pointed out that when if comparing 2D similarity, these two drugs are not similar the current known InhA inhibitors. When docking 20K drug-like compounds into InhA, entacapone ranked very low. In other words, using virtual screening, they will not be found as potential InhA inhibitors. The logP of these two drugs violates the Linpinski's rule of 5, and is very different from current aanti-tubercular drugs. So these two drugs would be quite unlikely to be selected as lead compounds for the inhibition of InhA, using common drug discovery methods. &lt;br /&gt;&lt;br /&gt;The authors also compared the difference of the binding poses of the compounds to COMT and InhA, and pointed out possible ways to optimize them so that they can have weaker affinity to the original target COMT. &lt;br /&gt;&lt;br /&gt;To summarize, this paper successfully presents a case study of using chemical systems biology, in particular using protein-ligand interactions, to assist drug discovery in a new multi-target-multi-drug paradigm.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5068294007090762899?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5068294007090762899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5068294007090762899' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5068294007090762899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5068294007090762899'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/drug-discovery-using-chemical-systems_06.html' title='Drug Discovery Using Chemical Systems Biology: Repositioning the Safe Medicine Comtan to Treat Multi-Drug and Extensively Drug Resistant Tuberculosis'/><author><name>djiao</name><uri>http://www.blogger.com/profile/06486651456900333839</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8491362841412726588</id><published>2009-10-06T08:27:00.000-07:00</published><updated>2009-10-06T08:31:56.559-07:00</updated><title type='text'>Drug Discovery Using Chemical Systems Biology: Identification of the Protein-Ligand Binding Network To Explain the Side Effects of CETP Inhibitors</title><content type='html'>&lt;div align="justify"&gt;Authors: Xie L, Li J, Xie L, Bourne PE.&lt;/div&gt;&lt;div align="justify"&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/19436720"&gt;Full article&lt;/a&gt;&lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt;Chemical systems biology which studies how the chemical interacts with the biological system as a whole is a new area in the recent years as the development of chemogenomics and proteomic. It has been widely applied in the side effect research, where the structures of drugs are different but they have same side effect. Their approaches are very similar, basically, the targets of drug are observed or predicted and then mapped into the systems (i.e. Pathway) to see the affection while the targets are inhibited or activated. One of the critical parts is target fishing. This paper demonstrated a novel target prediction methods using protein sequence alignment and docking method and explained why the drugs have different behavior in the molecular level.&lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;The work started from the study of Torcetrapib, a CETP inhibitor, withdrawn from phase 3 clinical trials due to the deadly off-target effect, viz. hypertension. Meanwhile, another two CETP inhibitors (Anacetrapib, JTT705) without hypertension were compared against to Torcetrapib.  The approach proceeds as follows. 1) Four binding sites of CETP were predicted from 3D experimental structures or homology model. 2) 276 off-targets with a similar binding site to CETP were identified across all the 5,985 proteins available in Protein Data Bank using Sequence Order Independent Profile-Profile Alignment. 3) The interaction between putative off-targets and inhibitors were characterized using docking methods including Surflex, eHits and AutoDock, and the high-ranking off-targets were further investigated. 4) The identified off-targets were applied to structural and functional cluster analysis and incorporated into the biological systems like metabolic pathway, signal transduction and gene regulation pathways. For example, the effects of CETP inhibitors on blood pressure can be explained through their influence on the Renin-Angiotension-Aldosterone System (RAAS) which has positive regulator PPAR, RXR and LXR as well as negative regulator VDR. Their results shows that Torcetrapib binds more strongly to the positive regulators, leading to increased blood pressure, however, JTT-705 binds both to positive regulator and negative regulator, balancing the positive/negative control over RAAS and resulting in a less chance to cause hypertension. Furthermore, it shows that JTT-705 has the potential to treat cancer and prevent inflammation.&lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;The combination of protein sequence alignment and docking to predict off-targets is of great interest, since machine learning like multi-category Bayesian model is a general approach in chemogenomic, however, such approaches omit the issue of multiple binding sites of one protein, as the underlying assumption is that all the ligands interact at the same binding site, which is not true in real. Docking in terms of this point might be helpful, however, docking in general needs the knowledge of binding site ahead, but, most of the time, the binding sites are unknown, this makes prediction less reliable. Another issue I’d like to mention is the conclusion presents that JTT-705 has better results than Torcetrapib mainly due to it promiscuity. Promiscuity does present more potential to treat one disease, but it also implies that it is also possible to target many undesirable proteins, leading to unexpected side effect. This makes the system so uncertain that a quantitative evaluation of how the system will be affected is urged to put forward; however, the progress is depend on the exploration of chemogenomic and proteomic space.&lt;br /&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8491362841412726588?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8491362841412726588/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8491362841412726588' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8491362841412726588'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8491362841412726588'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/drug-discovery-using-chemical-systems.html' title='Drug Discovery Using Chemical Systems Biology: Identification of the Protein-Ligand Binding Network To Explain the Side Effects of CETP Inhibitors'/><author><name>Bin Chen</name><uri>http://www.blogger.com/profile/03736801389411661562</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2130335078737492346</id><published>2009-10-05T17:02:00.000-07:00</published><updated>2009-10-05T17:07:16.861-07:00</updated><title type='text'></title><content type='html'>&lt;strong&gt;Searching Chemical Space with the Bayesian Idea Generator&lt;/strong&gt;,&lt;br /&gt;Authors: Willem P. van Hoorn and Andrew S. Bell&lt;br /&gt;Source: J. Chem. Inf. Model, ASAP&lt;br /&gt;&lt;br /&gt;&lt;div align="justify"&gt;In this publication, an application of Bayesan statistics is described to narrow down the search space for virtual chemical libraries. The Pfizer Global Virtual Library is defined as the compounds that could be prepared using the validated synthesis protocols and available monomers (~ 25000). Although these add up to a virtual collection in the order of 1012 compounds that could be synthesized, only a tiny has ever been synthesized, leaving enough room for targeted follow-up libraries. This potential chemical space is so huge (and growing), it does not allow a search for close analogues of HTS hits. To tackle this problem, a multicategory Bayesan model has been designed, based on the observation that compounds from the library series have more in common (template or bonds to disconnect) than those from a different series. This Bayesan short-cut reduces the search space to a few library synthesis protocols, which can then be searched within minutes. The result is a list of library synthesis protocols with the most similar combinatorial compounds that have been made with this protocol. This gives the scientist not only the ability to quickly evaluate the potential of each individual library, but also offer the possibility to jump scaffolds through the diversity of the answers.&lt;br /&gt;The multicategory Bayesan learning approach has then been applied to the screenable Pfizer compounds, based on the fingerprints and using Pipeline Pilot. This model, available to the Pfizer scientists as a Web service, would make 16 predictions from which library protocol a query molecule could have come from and then for each library, the 6 closest compounds that have been made, yielding 96 compound ready for a biological screening. In this way, not the closest possible analogues were identified, but readily screenable compounds. If any of the compounds showed the desired activity, they would be followed up with full library. The same approach was then performed with commercially available compounds from two vendors, clustering the compounds in clusters of 10 or more compounds. Again, a model was built to produce a ranked list of clusters a compound most likely belongs to. This was the test case for the virtual library, where the quality of the Bayesan learning approach could still be compared to the Tanimoto nearest neighbour search.&lt;br /&gt;As most of the chemists are interested in what part of the molecule contributes to the Bayesan score, a report was created that highlights the features that contribute the most to the Bayesan score, the sum of the normalized probabilities of the fingerprint feature. As an additional feature, the Bayesan Idea Generator offers a greater chance to jump scaffolds than a 2D Tanimoto search. Since 6 compounds are selected from 16 different libraries, the results are not dominated by compounds from a single library, which would be overall the most similar to the query molecule.&lt;br /&gt;The approach was further illustrated by four project examples. In conclusion, the Bayesan Idea Generator offers a fast and accurate method to identify which library synthesis protocol was used to prepare a query molecule by applying Bayesan statistics. This approach makes the searching of the vast virtual chemical space possible in a manageable time, without having to encode the chemical knowledge necessary to make compounds. The result of the search allows a quick testing of the potential of a library, as the compounds are available, without the need of synthesis. &lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt;The method presented in this very interesting paper shows an clever approach to limit cost of synthesis and biological screening in the drug discovery process, while also opening up the potential for scaffold leaps and reducing computational time.&lt;br /&gt;(Pipeline Pilot protocol to perform the described calculations is described in the supporting information)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2130335078737492346?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2130335078737492346/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2130335078737492346' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2130335078737492346'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2130335078737492346'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/searching-chemical-space-with-bayesian.html' title=''/><author><name>Stefan Furrer</name><uri>http://www.blogger.com/profile/05065895922703854994</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1846243595972156523</id><published>2009-10-04T14:15:00.000-07:00</published><updated>2009-10-04T14:29:01.891-07:00</updated><title type='text'>Review to: Shape: automatic conformation prediction of carbohydrates using algorithm</title><content type='html'>authors: Jimmy Rosen, Laurence Miguet, Serge Perez&lt;br /&gt;Source: Journal of Cheminformatics 2009, 1:16 doi:10.1186/1758-2946-1-16&lt;br /&gt;&lt;br /&gt;This paper introduces the prediction method of the three-dimensional structure of biomolecules. This paper describes the software package (Shape) which uses genetic algorithm to predict the conformation of carbohydrates. This paper introduces genetic algorithm. The paper tells chemists how to use Shape.&lt;br /&gt;&lt;br /&gt;There are two ways to discover the automatic conformation of carbohydrates. One way is experiment method. The other way is molecular modeling. It is difficult for many of the chemists to use molecular modeling method to predict the automatic conformation of carbohydrates, so the Shape, a software package which can be easily used to perform the prediction, is important. The Shape is flexible. Shape adopts genetic algorithm to search through the conformational space of the biomolecule to ﬁnd the low energy region. To use this software package one needs the force field, like MM3, which can evaluate the energy. The software is implemented in Java and it can be run on Linux and windows.&lt;br /&gt;&lt;br /&gt;The user of the software package (Shape) input data about the structures of molecules and control parameters. Genetic algorithm gives conformations of the moleculer.&lt;br /&gt;&lt;br /&gt;In my opinion, to improve the performance of the software package (Shape), one can make the software to run on more platforms than UNIX, windows and so on. Moreover, one can add the functionality to the software package to enable the user to use Perl to write code and integrate it to the Shape.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1846243595972156523?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1846243595972156523/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1846243595972156523' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1846243595972156523'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1846243595972156523'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/10/review-to-shape-automatic-conformation.html' title='Review to: Shape: automatic conformation prediction of carbohydrates using algorithm'/><author><name>Hongliu</name><uri>http://www.blogger.com/profile/01406705553116644059</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6851409371858638662</id><published>2009-09-23T22:38:00.000-07:00</published><updated>2009-09-23T22:52:33.085-07:00</updated><title type='text'>Review to: The multiple roles of computational chemistry in fragment-based drug design</title><content type='html'>Fragment-based screening for lead generation is gathering heat in recent years. In August 2009, Journal of Computer-Aided Molecular Design had a special issue for Fragment-Based Drug Design (FBDD). This paper is one of them, reviewing the roles of computational chemistry played in the fragment-based screening and lead generation process.&lt;br /&gt;&lt;br /&gt;In paper, authors provide a generic FBDD work flow to show the multiple roles Computational Chemistry (CC) played. Base on authors, CC often been introduced at beginning of a FBDD process, fragment selection and fragment library creation, to help producing target focused fragments and increase the chance of success. At post screening stage, CC is also used, to prioritize the fragment hits and evolve fragment hits towards more drug-like molecules. Those assistances have been done iteratively, combined with the use of X-ray crystallography and orthogonal NMR techniques. Besides detailed description of this CC incorporated FBDD work flow, authors also reviewed CC incorporated FBDD studies for HSP90, PDE10a, BCL-2, BACE.&lt;br /&gt;&lt;br /&gt;Computational Chemistry played important roles in FBDD, both in pre and post screening. Since FBDD is a strategy shift from screening higher molecular weight compound to screening lower ones, which CC is done better, the importance of CC aide is obvious in fragment selections and library creation. But the role of CC played in the post screen, optimizing and expanding the fragment hits in evolving fragment to lead process made it necessary. It would be more useful if the paper had been more specific about the CC techniques used.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6851409371858638662?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6851409371858638662/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6851409371858638662' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6851409371858638662'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6851409371858638662'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/review-to-multiple-roles-of.html' title='Review to: The multiple roles of computational chemistry in fragment-based drug design'/><author><name>Yi</name><uri>http://www.blogger.com/profile/01986022360906099290</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2601215254950610724</id><published>2009-09-23T08:09:00.000-07:00</published><updated>2009-09-23T08:11:06.194-07:00</updated><title type='text'>The impact of aromatic ring count on compound developability – are too many aromatic rings a liability in drug design?</title><content type='html'>Authors: Timothy J. Ritchie and Simon J.F. Macdonald&lt;br /&gt;Source: Drug Discovery Today, September 2009 (uncorrected proof as of Sep. 21 2009)&lt;br /&gt;&lt;br /&gt;The authors investigated the impact of the number of aromatic rings on compounds developability in small-molecule drug.  The author got the idea from the comparison of their in house data which represent the aromatic ring count in the GSK pipeline. GSK pipeline data shows that the mean value of aromatic ring count in compounds decreases as the compounds go to the further stage.&lt;br /&gt;&lt;br /&gt;Since the concept of aromatic ring count is a more simplistic property and it is more intuitive to be interpreted than other sophisticated molecular descriptors. Also, it is easier to keep in mind during the lead optimization process which is likely to be iterative.  Regarding the definition of aromaticity, they used Daylight's definition of aromaticity and the fused aromatic ring is counted individually, for example, indole is counted as two aromatic rings.  The authors analyzed the relationship of the aromatic ring count with several molecular properties, such as aqueous solubility, lipophilicity(cLogP), and logD. In addition, they extended to analyze the relationship of aromatic ring count with human serum albumin binding, P450 3A4 inhibition, and hERG inhibition.&lt;br /&gt;&lt;br /&gt;In the results, all the investigated properties are likely to show consistent messages. As the number of aromatic ring increases in a compound, all other physico-chemical properties are changed with the same manner. The aqueous solubility decreases, and the lipophilicity increases, and logD increases, too. In addition, both pharmacological(serum albumin binding) and toxicological(CYP 450 and hERG) properties are likely to be in worse way as the number of aromatic ring increases. The authors figured out one interesting result that the number of aromatic ring affects the aqueous solubility although compounds have similar lipophilicity (cLogP). For instance, a compound which contains more aromatic rings has lower aqueous solubility comparing to the other compound which contains less aromatic rings when they have similar lipophilicity.&lt;br /&gt;&lt;br /&gt;In conclusion, the authors suggested that limiting the number of aromatic ring in a compound will have more chance to move to next stage in drug discovery/development process which means more developable. He commented why aromatic/hetero aromatic rings are so prevalent in the small-molecule drugs. First, medicinal chemists often introduce the aromatic ring when they need to improve the potency of a compound. Because of the rigidity of ring structure, ring structure generally increase the ligand-receptor binding energy. Second aryl-aryl coupling is well known synthesis method to transform the structure easily and its commercial availability is good.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2601215254950610724?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2601215254950610724/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2601215254950610724' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2601215254950610724'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2601215254950610724'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/impact-of-aromatic-ring-count-on.html' title='The impact of aromatic ring count on compound developability – are too many aromatic rings a liability in drug design?'/><author><name>JaeHong Shin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://3.bp.blogspot.com/_IJCpekjdhLQ/SwQl3BHImII/AAAAAAAADJQ/gDR7Zr_kiTo/s1600-R/profile_picture-full%3Binit:.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5741693981161052624</id><published>2009-09-22T08:35:00.000-07:00</published><updated>2009-09-22T08:41:52.836-07:00</updated><title type='text'>review of Supervised prediction of drug-target interactions using bipartite local models.</title><content type='html'>&lt;div align="justify"&gt;Authors: Bleakley K, Yamanishi Y.&lt;/div&gt;&lt;div align="justify"&gt;&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/19605421"&gt;Full article&lt;/a&gt;&lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt;I was attracted by its title at first glance, since as far as I know, the prediction of drug-target interaction is usually done by docking or classic machine learning algorithms like SVM, Bayesian model, random forest and so on, thus using network modeling method from the domain of system biology to predict drug-target relation is very novel and of great interest. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;In drug-target bipartite network, node is either a drug or a target, and two nodes are linked if the drug interacts with the corresponding target. Basically, bipartite local model is going to discover the “unknown edge”, namely missed interaction. This paper used two approaches based on target info and drug info respectively and at last combined the results together, generating a final probability score for the interaction. Suppose we are looking for whether there should be an edge between drug d(i) and target t(j) or not, first we exclude target t(j), and make a list of all other known targets of d(i) in the bipartite network, labeled A , as well as a separate list of the targets not known to be targeted by d(i), labeled B. Simply, we can make a method like SVM classifier based on protein sequence similarity to predict whether t(j) belongs to A or it belongs to B. If it belongs to A, it means there is an edge between d(i) and t(j). The same approach also can be applied to d(i).  We exclude drug d(i), and make a list of all other known drugs targeting to t(j) in the bipartite network, labeled A , as well as a separate list of the drugs not known to  target t(j), labeled B. Then we can use a method based on drug structure similarity to see whether d(i) belongs to A or it belongs to B. If it belongs to A, it means there is an edge between d(i) and t(j). If both approaches predict that there is an edge between d(i) and t(j), it has high confidence to say d(i) interacts with t(j). Since it combines protein sequence and compound structure together, the result performs very well compared to other methods like KRM and NN.&lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;Essentially, it also used machine learning methods like SVM in the classifier, but it does provide a new direction to predict drug target relation, what’s more, it shows the potential to apply bioinformatics method into cheminformatics. Overall, it is an interesting paper. Although there are still some issues I’d like to mention. For example, It categorizes the network into four bipartite networks based on the protein family, viz, enzyme. So the scope of the prediction thus limits to enzymes. Therefore, we are unable to predict the interaction between a drug and a heat shock protein, which is not in the other three bipartite networks neither. Besides, it mentions that the amino acid sequence alignments is more predictive than that obtained from the chemical structure information, it might be bias to say this. Within a bipartite network, where the targets are very similar, but the drugs are very diverse, certainly, the sequence similarity would give better results, but it does not mean, as a whole, sequence alignment performances better than structure similarity. As most of the time, the interaction only is dependent on the binding site, composed of a small portion of the whole sequence.&lt;br /&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5741693981161052624?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5741693981161052624/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5741693981161052624' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5741693981161052624'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5741693981161052624'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/review-of-supervised-prediction-of-drug.html' title='review of Supervised prediction of drug-target interactions using bipartite local models.'/><author><name>Bin Chen</name><uri>http://www.blogger.com/profile/03736801389411661562</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6744379423542567606</id><published>2009-09-22T02:36:00.000-07:00</published><updated>2009-09-22T02:40:38.932-07:00</updated><title type='text'>Conformational Landscape of the Human Immunodeficiency Virus Type 1 Reverse Transcriptase Non-Nucleoside Inhibitor Binding Pocket: Lessons for Inhibit</title><content type='html'>Authors: Kristina A. Paris, Omar Haq, Anthony K. Felts, Kalyan Das, Eddy Arnold and Ronald M. Levy;&lt;br /&gt;Source: J. Med. Chem., Article ASAP, DOI: 10.1021/jm900854h&lt;br /&gt;&lt;br /&gt;&lt;div align="justify"&gt;Current strategies for the treatment of HIV involve blocking different steps in the retrovirus’ life cycle. In this paper, the authors focus on the inhibition of the viral enzyme reverse transcriptase, by non-competitive, non-nucleoside reverse transcriptase inhibitors. The binding pocket for the non-nucleoside inhibitors is very flexible and changes conformation when different inhibitors are bound. This allows inhibitors to have different shapes, however sharing similar motifs: aromatic residues (pi-pi stacking) and H-bond donor forming H-bond with K101. The variety of the inhibitors leads to the assumption of different possible confirmations of the binding pocket.&lt;br /&gt;In this study, the authors investigated 99 available X-ray structures of HIV-1 reverse transcriptase, containing both wild type as well as mutants, with and without inhibitors. Each X-ray structure represents a point in the conformational landscape for binding inhibitors. The authors went on to characterize such a conformational and free energy landscape through cluster analysis, to obtain information on location of free energy basins and their shape. Gathering information not only about the different means for inhibitors to bind, but also about the strain free energy required to adapt the confirmation for binding. This was done through an average root-mean-square deviation (rmsd) of all Ca in the binding pocket within 15A of any inhibitor. The analysis of the binding pocket and side chains, using hierarchical clustering techniques yielded 8 basins of free energy; 1 large and several small clusters.&lt;br /&gt;The binding pocket undergoes a large structural rearrangement upon binding of the inhibitor, especially for Y181 and Y188, opening a binding pocket for the inhibitor, only existing with an inhibitor present. The large cluster however suggests a smaller diversity of inhibitors than previously thought, representing a sampling of a continuum of accessible states, separated by relatively low energy barriers. On the other hand, the few smaller energy basins are sparsely populated, speculating having large energy barriers around. Of these, especially the rilipivirine and capravirine complexes stood out, showing additional interaction with the binding site through very specific interactions (hydrophobic tunnel and H-bonds), resulting in an overall higher affinity and greater activity against most mutants, even though the mutations seemed to have little effect on the binding pocket confirmation .&lt;br /&gt;Even though the database set used in this study, does not represent a systematic coverage of the conformational landscape, as the inhibitors are based on SAR and modelling around previous blockers, possibilities arise to optimize and design future ligands based on the sparsely populated basins. In addition the authors will publish another study in the near future, on the construction of a free energy landscape through molecular simulation and using the X-ray structures as “landmarks”. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6744379423542567606?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6744379423542567606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6744379423542567606' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6744379423542567606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6744379423542567606'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/conformational-landscape-of-human.html' title='Conformational Landscape of the Human Immunodeficiency Virus Type 1 Reverse Transcriptase Non-Nucleoside Inhibitor Binding Pocket: Lessons for Inhibit'/><author><name>Stefan Furrer</name><uri>http://www.blogger.com/profile/05065895922703854994</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8802603792159371796</id><published>2009-09-21T17:10:00.000-07:00</published><updated>2009-09-21T17:14:11.792-07:00</updated><title type='text'>Computational analysis of membrane proteins: the largest class of drug targets</title><content type='html'>Authors: Yalini Arinaminpathy, Ekta Khurana, Donald M. Engelman, Mark B. Gerstein&lt;br /&gt;      &lt;br /&gt;     This article discusses that integral membrane proteins (IMPs) is necessary to understand their structures because it represent   ~ 30% of currently sequenced genomes. Thus, computational modeling has been utilized extensively to make crucial advances in understanding membrane protein structure and function.&lt;br /&gt;    &lt;br /&gt;     Despite the experimental challenges of studying these proteins, understanding them is crucial because they represent more than 60% of drug target.  Several experimental methods have been utilized to determine their secondary structure or function.  Crystallographic techniques have been used to determine high-resolution structures.  Under this category, three methods are generally employed: electron microscopy, NMR and X-ray crystallography. Despite the difficulties involved in generating large and sufficiently well ordered 3D crystal, X-ray crystallography is still the most successful and least difficult technique for obtaining high-resolution structures. Three major bottlenecks exist for obtaining structural information of membrane proteins. First, it is difficult to obtain the protein of interest because membrane proteins are usually only present in the cell at low concentrations. Second, membrane proteins are naturally embedded in a heterogeneous dynamic environment of the mosaic lipid bilayer and it is extremely difficult to use high-resolution experimental techniques in their native environment. Third, membrane proteins are generally insoluble in aqueous solution; hence, detergents are required in concentrations above the critical micellar concentration. At the current pace, it will take approximately 30 years to obtain the 1700 membrane protein structures that are needed to account for each structural family. Hence, despite the increasing rate of structure determination, improved structure prediction methods using computational tools are important in studying membrane proteins.   &lt;br /&gt;     &lt;br /&gt;     In the absence of high-resolution 3D structures, computational methods are used for the structure prediction of membrane proteins. The authors focus on one intensively utilized technique: MD simulations because the MD arguably provides the most detailed information. While X-ray structures of membrane proteins provide static simulations enable us to explore the structural dynamics of the proteins in an attempt to bridge the gap between structure and function of proteins. Another approach, known as QM/MM, that combines quantum mechanics and molecular mechanics has been used for the study of membrane proteins. Many of the results from MD simulations based on realistic all-atom models have been consistent with the information obtained from high-resolution structural data, thereby showing that MD simulations can be used to provide crucial information in the absence of high-resolution structural data. The time-scale limitation of MD is the strength of BD. The drawback of BD is the comparatively poor description or parameterization of the biological system simulated. Although MD provides the most detailed information about the dynamics of ion channels, currently accessible simulation times are its greatest limitation. However, this problem might be surmounted in the future with the doubling of computer speeds over the years.&lt;br /&gt;      &lt;br /&gt;        In conclusion, the use of computational tools such as protein simulation methods, in combination with experimental and structural genomics studies, is becoming increasingly valuable in studying the structure and function of membrane proteins.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8802603792159371796?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8802603792159371796/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8802603792159371796' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8802603792159371796'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8802603792159371796'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/computational-analysis-of-membrane.html' title='Computational analysis of membrane proteins: the largest class of drug targets'/><author><name>Kuochung Peng</name><uri>http://www.blogger.com/profile/04963167070928354912</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6085192706559835991</id><published>2009-09-21T13:34:00.000-07:00</published><updated>2009-09-21T13:57:27.204-07:00</updated><title type='text'>Fast automated placement of polar hydrogen atoms in protein-ligand complexes</title><content type='html'>&lt;div align="justify"&gt;By Tobias Lippert and Matthias Rarey&lt;br /&gt;&lt;br /&gt;                In this paper authors describe a noval fast method 'Protoss', to calculate hydrogen atom positions based on optimal hydrogen bond networks. Besides programs which calculate the position of hydrogen atoms by molecular dynamics minimization or place them solely  by geometric criteria, the most frequently applied are WHAT IF, MolProbity, and the Hbuild procedure implemented in the X-PLOR package and two further methods recently published are Computational Titration algorithm and Protonate3D which use forcefield concept. But the approach of Protoss differs form all these programs in two aspects: On the one hand, authors ensure on the speed of the calculation by using an efficient dynamic programming approach with "memoization", i.e. storing partial solutions and combining them to globally optimal solutions. On the other hand, authors wanted to model the protein-ligand interface with an established method. Their objective function is based on the hydrogen bonding term of the Boehm scoring function which has been designed to correctly reflect protein-ligand interactions. This has been used in the FlexX molecular molecular docking program.&lt;br /&gt;                    The algorithm starts by identifying hydrogen bond networks in the protein-ligand interface. A hydrogen bond network is the maximal set of functional groups for which alternative modes exist and that are able to form hydrogen bonds among themselves. The networks are modeled as graphs: Every modeled degree of freedom is represented by a node, i.e. for each amino acid for which rules exist, included water molecules and all functional groups of the ligand that are treated. Edges stand for interactions between amino acids. The problem was then to find the modes in each network that yield the best hydrogen bond network with respect to objective function which was later done efficiently with a dynamic programming approach, a recursive procedure with algorithm in pseudocode. The Protoss algorithm is split into two phases, ‘initialization’ to create modes for each residue for which rules exist, and ‘optimization’ to transform the graph into a tree. The initialization was performed only once for a protein-ligand complex. The generated information can be used for alternative docking poses. To time the program, 900 active sites and their corresponding ligands of the PDBbind dataset were extracted (all amino acids within 6.5 Å of any ligand atom) and optimized.&lt;br /&gt;                   In the results section, a validation based on a dataset from Forrest and Honig is given. Authors demonstrate that they were able to reproduce the results with a quality that is comparable to that of the programs in a fraction of time, making the method suitable for high-throughput modeling applications. They have implemented a program that automatically places hydrogen atoms in protein structures with particular focus on protein-ligand interfaces. In conclusion, the important fact is that the hydrogen atoms are placed facing into the right direction, thus making it possible to correctly identify any interactions that they are involved in. The prediction of the positions of hydrogen atoms is consistent with those in the test sets of Forrest and Honig, and the reported rate for amide chain flips is in unison with the rates reported in the literature.&lt;br /&gt;                    There three main limitations in this paper. First is that the dynamic programming approach of Protoss does not work well on strongly interconnected graphs. It takes much longer to optimize. This time could be reduced by ignoring side chain flips and applying constraints on when an edge is inserted into the graph but then it will apply a minimum quality threshold on the considered hydrogen bonds. Second limitation is Protoss does not consider different protonation states of the ligand or tautomeric forms in the ligand.And although Protoss can orient selected water molecules, it can not yet predict the presence of water molecules in the protein-ligand interface. And thirdly, instead of obtaining only the best scoring solution, it would be desirable to obtain a set of stable conformations that are valid for the active site, but Protoss gains its speed from considering only the best scoring solutions. Since Protoss is intended to be applied on different ligands and and poses individually upon molecular docking and scoring, alternative hydrogen bond networks are of less importance.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6085192706559835991?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6085192706559835991/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6085192706559835991' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6085192706559835991'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6085192706559835991'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/fast-automated-placement-of-polar.html' title='Fast automated placement of polar hydrogen atoms in protein-ligand complexes'/><author><name>Rohan Patil</name><uri>http://www.blogger.com/profile/15299634532963258412</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-KgOOWFrKiIA/TWfZi2IQXWI/AAAAAAAACXI/sgFoH2sDGsM/s220/rspatil_l.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-317451979099670611</id><published>2009-09-21T12:09:00.000-07:00</published><updated>2009-09-21T13:46:29.866-07:00</updated><title type='text'>Review of A Model-Based Ensembling Approach for Developing QSARs</title><content type='html'>Authors: Qianyi Zhang, Jacqueline M. Hughes-Oliver, and Raymond T. Ng&lt;br /&gt;&lt;br /&gt;The paper introduces the quantitative structure activity relationship (QSAR) modeling. The method of ensembling can be used for QSAR modeling. The paper compares the different base learners which create the individual models. The paper presents several measures for performance assessment for imbalanced data. The paper uses F-measure for performance evaluation. The paper compares various strategies to aggregate base learners. The paper presents steps of model-based ensembling.&lt;br /&gt;&lt;br /&gt;The model-based ensembling method presented in the paper is used to classify the compounds to the two classes: the active class and the inactive class. One needs to consider the three factors which determine the construction of an ensemble. The first factor is selection of the base learner. The second factor is the manipulation of the data set for the learner. The third factor is how to combine all the base learners. The paper points out that DT can be used for the binary classification. To manipulate the data set for each learner, the paper uses 10-fold cross-validation. The paper uses probability averaging to combine the base learners.&lt;br /&gt;&lt;br /&gt;The paper gives the steps of the model-based ensembling method. The steps contain two loops, one loop using multiple decision trees, the other loop finding the best threshold.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-317451979099670611?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/317451979099670611/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=317451979099670611' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/317451979099670611'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/317451979099670611'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/model-based-ensembling-approach-for.html' title='Review of A Model-Based Ensembling Approach for Developing QSARs'/><author><name>Hongliu</name><uri>http://www.blogger.com/profile/01406705553116644059</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-340641732362767398</id><published>2009-09-20T11:47:00.000-07:00</published><updated>2009-09-20T11:49:03.845-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='–( 708 | SEPTEMBER 2009 | VOLUME 8 )'/><title type='text'>Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery</title><content type='html'>Michael R. Barnes, Lee Harland, Steven M. Foord, Matthew D. Hall, Ian Dix, &lt;br /&gt;Scott Thomas, Bryn I. Williams-Jones and Cory R. Brouwer.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;High-quality, open and accessible data are the foundation of pre-competitive &lt;br /&gt;research, and strong public–private partnerships have considerable potential to &lt;br /&gt;enhance public data resources, which would benefit everyone engaged in drug &lt;br /&gt;discovery. This paper discusses the background of these changes and proposes new areas of collaboration in computational biology and chemistry between the public domain and the pharmaceutical industry.&lt;br /&gt;&lt;br /&gt;Pharmaceutical industry is facing challenges- declining productivity, patent expiry and down ward trend in pricing is forcing pharmaceutical industry to rapidly adapt the survival mode. The increase in the regulatory requirements and demand for new drug is also increasing. Reduced internal funding is leading to an increased focus on the data exploitation rather than data generation. With a growing body of high quality data in public domains authors suggest that, the volume of data or the computational platform used to present it cannot in them selves transform drug discover out put.&lt;br /&gt;&lt;br /&gt;The public domains have matured in services , Historically the  issues of robustness of public domain systems – including stability , accessibility and backwards compatibility  and lack of structured release updates- has prevented the adoption of external tools with in the industry. However the public domain has matured and the users base have been firmly established. More stable and reliable services that meet all the needs of industry have become available. Some of the public domains, which are very useful now a days are Pubchem, Drugbank, ChEB1&amp; Chem bank.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The constraints of – target traceability , validation studies and safety , efficacy , drug formulation and delivery still resides with industry  and availability of information of this nature through public data bases is limited presently.&lt;br /&gt;&lt;br /&gt;The authors described the current pre-competitive public-private collaborations as listed below  European Bioinformatics institute (eBi), The Predictive safety Testing consortium, Life science grid — Eli Lilly, sage Bionetworks, Pistoia, &lt;br /&gt;Innovative Medicines initiative. The authors also discussed the tools and the infrastructure needed for the drug discover for the public-private partnership. The industry input in providing the data, support existing standards, define needs , contribute algorithms and develop tools and enhance existing approaches &amp; define needs and architectures.&lt;br /&gt;&lt;br /&gt;Finally authors also stressed the issues to be addressed to ensure successful pre-competitive collaborations i.e Funding, effective coordination of international affairs, Legal and ownership issues, Organization, Nature of collaboration, technology to be sued and also determine if pre-competitive collaboration is suitable for the companies.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-340641732362767398?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/340641732362767398/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=340641732362767398' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/340641732362767398'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/340641732362767398'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/lowering-industry-firewalls-pre.html' title='Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery'/><author><name>Hari</name><uri>http://www.blogger.com/profile/06039523722658462025</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1436596869208652957</id><published>2009-09-18T12:49:00.001-07:00</published><updated>2009-09-18T18:24:03.676-07:00</updated><title type='text'>Multiple Classifier Integration for the Prediction of Protein Structure Class</title><content type='html'>This paper presents the authors' efforts of integrating multiple supervised classifiers for a single classification task - prediction of protein structural classes based on the protein sequence. Comparing to using single classification algorithm, integrated methods achieved big improvement in accuracy.&lt;br /&gt;&lt;br /&gt;Basically, the integrated approach is to combine the results of individual classifiers to make the prediction, using some voting scheme. The authors discussed four methods of integration: Simple Majority Voting System (SMVS), Weighted Majority Voting System (WMVS), Simple Majority Voting System with Algorithm Selection (SMVS_AS), and Weighted Majority Voting System with Algorithm Selection (WMVS_AS). SMVS simply counts the vote from each algorithm for a piece of data, and each algorithm is treated equally. WMVS takes the performance of the algorithms into account and assign weights to the votes. These two methods uses all algorithms available. The last two methods, SMVS_AS and WMVS_AS use the same voting methods. However, they filter the algorithms first to get rid of redundent algorithms - those are based on similar methods, or ended in similar results. Both SMVS and WMVS produced better results than the best single classification algorithm.&lt;br /&gt;&lt;br /&gt;The method used to select the algorithms is called mRMR (minimum Redundancy Maximum Relevance). This method was originally developed for feature selections. As its name suggests, it tries to find the best set of features that have the minimal redundancy among themselves, and the maximal relevance to the classification variables. It uses mutual information to estimates the relevance between features or between feature and classification variables. In this paper, mRMR is only used to rank the algorithms. Then they choose the subset of algorithms that would produce the best accuracy. The authors used all of the 34 classifiers from Weka, and with algorithm selection, only 11 are used. The methods with algorithm selection get the best accuracy, which is slightly better the SMVS_AS, but noticeably better than the other two methods without algorithm selection.&lt;br /&gt;&lt;br /&gt;This paper shows that for problems that are hard to get a high accuracy by using only one algorithm, combining results of multiple algorithms could be a good option. However, I'd assume that efficiency and resources could be concerns if using this method to solve some problems. The author in this paper didn't discuss this aspect, though.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1436596869208652957?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1436596869208652957/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1436596869208652957' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1436596869208652957'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1436596869208652957'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/multiple-classifier-integration-for.html' title='Multiple Classifier Integration for the Prediction of Protein Structure Class'/><author><name>djiao</name><uri>http://www.blogger.com/profile/06486651456900333839</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1419796841265538153</id><published>2009-09-18T01:21:00.000-07:00</published><updated>2009-09-18T01:26:24.871-07:00</updated><title type='text'>Predicting Multiple Ligand Binding Modes Using Self-Consistent Pharmacophore</title><content type='html'>&lt;div align="justify"&gt;&lt;/div&gt;&lt;p align="justify"&gt;by Izhar Wallach and Ryan Lilien&lt;/p&gt;&lt;p align="justify"&gt;          This paper presents a step towards improving the prediction of protein – ligand binding mode for a set of ligands known to interact with a common target protein.  Authors also suggest generating a pharmacophoric  map simultaneously , by first generating and ranking possible binding modes using virtual docking and then identifying a set of poses maximally consistent with their inferred underlying pharmacophoric models.&lt;br /&gt;          Their method utilizes the spatial arrangement of the chemical groups of the ligands to first generate a pharmacophoric map and then discriminate between candidate binding modes.  They have divided it into two main parts “Data Set Generation” and “The binding mode prediction algorithm (the main focus in paper)”. In Data set generation, they generated a large testing set and performed a range of experiments with varying assumptions and quality of the candidate binding modes. The sets of ligands capable of binding the target protein as well as the correct binding mode for each compound were identified using three main criteria; firstly the bound ligand has more than six non-hydrogen atoms, secondly the ligand binds the protein at the binding site defined by the reference complex (i.e. within 5 Å of a reference ligand’s atom), and thirdly the protein structure can be aligned to the target protein with an RMSD &lt; 1 Å for non-hydrogen atoms.&lt;br /&gt;            In the binding mode prediction, the algorithm receives an input set of ligands and their ranked lists of candidate binding modes.  They have defined a configuration to be a set of poses, each pose belonging to a different ligand. Predicting the binding mode for a set of ligands corresponds to finding a configuration for which all poses are near-native. The RMSD of a pose is defined with respect to its native binding mode, whereas the RMSD of a configuration is the average RMSD of the configuration’s poses. Later they performed chemical group analysis using 47 chemical groups and their function types. They generated two 3D maps containing location and type of functional groups of each ligand, and then clustered and scored them. Using the virtual docking score, they assigned each pose a probability of correctness and probability of selecting a binding mode was defined by the Boltzmann distribution. They also used Gibbs sampling to search for new configurations with an optimal score.&lt;br /&gt;            They have used four protein target systems - thrombin, cyclin-dependent kinase 2 (CDK2), dihydrofolate reductase (DHFR), and HIV-1 protease (HIV-1P). And using FlexX docking package, they generated ranked lists of candidate binding modes. They evaluated configuration scoring function using Sigmoid and Euclidean similarity measurements under varied sets of input ligands. They evaluated the configuration scoring function under different sets of three parameters that can influence the relationship between a configuration’s score and its RMSD: (i) the number of ligands in the docking set, which affects the number of functional group clusters and their sizes, (ii) the similarity  of the ligands, which affects the distribution of chemical groups in space and thus the compactness of the clusters, and (iii) the functional group similarity function which affects the distribution of the clusters in space. Then they ran 50 random prediction experiments for each parameter setting using the two similarity functions and varied sizes of the input sets and length of ranked lists.&lt;br /&gt;              Ultimately the conclusion made is that they have successfully presented a novel algorithm to improve binding mode predictions of a docking algorithm for the case of multiple ligands known to bind a common target. The consistency was scored using an objective function which considers the alignment of similar chemical groups. Their objective function correlates well with the RMSD. Then authors  demonstrated the expected performance of the algorithm over different sets of input ligands varied by the size, similarity, and sparsity of their ranked lists. This algorithm seems to be of use to the pharmaceutical industry in the structural analysis of molecular hits resulting from High Throughput Screening (HTS) and Structure Activity Relationship (SAR) experiments. &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1419796841265538153?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1419796841265538153/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1419796841265538153' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1419796841265538153'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1419796841265538153'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/predicting-multiple-ligand-binding.html' title='Predicting Multiple Ligand Binding Modes Using Self-Consistent Pharmacophore'/><author><name>Rohan Patil</name><uri>http://www.blogger.com/profile/15299634532963258412</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-KgOOWFrKiIA/TWfZi2IQXWI/AAAAAAAACXI/sgFoH2sDGsM/s220/rspatil_l.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2418391964537238259</id><published>2009-09-16T08:36:00.000-07:00</published><updated>2009-09-16T08:37:21.034-07:00</updated><title type='text'>Review of Virtual Screening of Abl Inhibitors from Large Compound Libraries by Support Vector Machines</title><content type='html'>Authors: X. H. Liu, X. H. Ma, C. Y. Tan, Y. Y. Jiang, M. L. Go, B. C. Low&lt;a href="http://pubs.acs.org/doi/full/10.1021/ci900135u#afn4"&gt;&lt;/a&gt; and Y. Z. Chen&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This paper discusses Support Vector Machines (SVM) as a Virtual screening tool for searching Ab1 inhibitors from large compound libraries. In this paper a ligand based virtual screening (VS) method, support vector machines (SVM), has been explored, such that high yield and low false-hit rates in searching active agents of single and multiple mechanisms from large compound libraries.&lt;br /&gt;&lt;br /&gt;SVM classifies active compounds based on differentiating physicochemical profiles between active and inactive compounds rather than on structural similarity to active compounds per se, which has the advantage of not relying on the accurate computation of structural flexibility, activity-related features, binding affinity and salvation effects. &lt;br /&gt;&lt;br /&gt;In this paper they developed a SVM VS model for identifying Ab1 inhibitors and evaluated its performance by a 5-fold cross validation test and large compound data base screening test. Virtual hits and false-hit rate  in searching the large libraries   were evaluated by using PubChem, MDDR compounds similar in structural and physicochemical properties to the known Ab1 inhibitors. VS performance of SVM was also compared to those of two similar VS methods, Tanimoto similarity searching and K nearest-neighbor (KNN), and to an alternative machine learning method, PNN. The performance of SVM shows mostly good performances both on classification and regression tasks. The point we have to notice here is that the other methods also performed very well.&lt;br /&gt;&lt;br /&gt;The comparison of various VS tools with SVM VS was discussed in identifying the Ab1 inhibitors reported in MDDR and PubChem yielded comparable results. But the direc comparison of  the reported performances of these VS tools is inappropriate because the differences of these VS tools is inappropriate because the differences in the type, composition and diversity of compounds screened and in the molecular descriptors, VS tools and the parameters used. The performance of SVM is primarily due to SVM classification models rather than that of molecular descriptors used.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2418391964537238259?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2418391964537238259/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2418391964537238259' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2418391964537238259'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2418391964537238259'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/review-of-virtual-screening-of-abl_16.html' title='Review of Virtual Screening of Abl Inhibitors from Large Compound Libraries by Support Vector Machines'/><author><name>Hari</name><uri>http://www.blogger.com/profile/06039523722658462025</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-7057879267208445019</id><published>2009-09-16T08:10:00.000-07:00</published><updated>2009-09-16T08:10:08.481-07:00</updated><title type='text'>Predicting pKa</title><content type='html'>by Adam C. Lee and Gordon M. Crippen, in JCIM 2009.&lt;br /&gt;&lt;br /&gt;In this paper, the authors reviewed available pKa prediction methods and sources. The authors also discussed several concerns of pKa prediction, such as the scope, statistical validity, predictive power of method and accuracy.&lt;br /&gt;&lt;br /&gt;The pKa is one of the most important physico-chemical properties of small molecules and macromolecules since pka represent how the molecules forms their properties in different environment of human gastrointestinal tract which has different range of pH, 1 to 8.  Predicting pka without physical samples is main advantage and the prediction of pka would be beneficial to understanding of pharmacokinetics of pharmaceutical drug and the properties of bio-organic molecules. However, predicting pKa is still challenging topics in predictive model building due to the gap of macro-pKa and micro-pKas.  For instance, if a molecule has poly-protonable sites(n), the molecule could have 2n independent micro-species which are called micro-pKas.  Although macro-pka is macroscopic property of a molecule as a combination micro-pKas but it cannot be negligible at some concentration.&lt;br /&gt;&lt;br /&gt;The authors categorized two types of scope on pKa. One is globular proteins and the other is small molecules.  As for the protein, most of experimental data is obtained from NMR chemical shift.  The protein pka database(PD) is currently available and it contains 1400 data points of ionizable amino acid side chains, such as Asp, Glu, His, and etc.  The predictive methods for proteins are categorized as "the null model", "Electrostatic model", and "Empirical model." The authors mentioned "the null model" could be a good starting point for prediction models.  "Electrostatic model" is most well known method and it is based on solving the linearized Poisson-Boltzmann(PB) equation.  PB equation is calculated by the partial atomic charge from force field. The quality for "Empirical model" depends on how diverse of training set are used with as few parameters as possible.&lt;br /&gt;&lt;br /&gt;Predicting small molecules' pka ie substaintially different from proteins. The less micro-pkas of small comparing to proteins seems to predict pka more simple than proteins but there are huge variety of chemical structures.  Also, the prediction of site-specific pkas of a molecule is important since the site-specific prediction allows to understand pharmacokinetic behavior of a molecule more realistic. Experimental data availability is limited because the reliability of the data from literature or 'in house' is questionable. Beilstein database is accessible but commercial and Lange's Handbook of Chemistry also contains pka experimental data. The authors also mentioned about the differences among some pka experimental methods of small molecules. Regarding the predictive models, there are three different approaches, one is "linear free energy relationships", the second is "quantitative structure-property relationship(QSPR)", and the last is "quantum mechanical and contnuum electrostatic" Among these methods, QSPR is most common techniques. The authors also discussed about database method, neural network and decision tree for predictive models.&lt;br /&gt;&lt;br /&gt;Lastly, the authors compared available software for pka prediction. However, the authors insisted that there is no reliable benchmarks for pka prediction even training data set doesn't exist. The authors proposed some guidelines when building predictive models:1) the chemical diversity of training data set, 2) the range of experimental values of the external test set, 3)the size of test set, 4) lesson learn from the ouliers (poor predictions)&lt;br /&gt;&lt;br /&gt;In conclusion, once the predictive model is build, in silico pka is able to calculate pka without the synthesis of chemical compounds or pka experiment. However, as there is no comprehensive database and molecular information from a predictive software, there is a need for consensus evaluation for predictive models. Moreover, appropriate data and freely available information in public is still initial challenge.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-7057879267208445019?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/7057879267208445019/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=7057879267208445019' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/7057879267208445019'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/7057879267208445019'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/predicting-pka.html' title='Predicting pKa'/><author><name>JaeHong Shin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://3.bp.blogspot.com/_IJCpekjdhLQ/SwQl3BHImII/AAAAAAAADJQ/gDR7Zr_kiTo/s1600-R/profile_picture-full%3Binit:.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2613960226800658732</id><published>2009-09-15T20:09:00.000-07:00</published><updated>2009-09-15T21:00:44.296-07:00</updated><title type='text'>Similarity of Protein-RNA Interfaces Based on Motif Analysis</title><content type='html'>The authors in this paper proposes a new method for measuring similarity of Protein-RNA interfaces - SIMA (Similarity by Identity and Motif Alignment). The method is based on a combination of the similarity of the interactions among the interacting elements (amino acid - nucleotides), and sequence similarity of the nucleotides and protein. The difference between this method and previous methods lies in the representation of the geometric interactions. It reduces properties of these interactions (such as aromatic stack, nonaromatic stack, hydrophbicity, H-bond, double H-bond, and vdW contact) into symbolic representation, or motifs, which are translated to single-digit codes. This way, the similarity of the interactions are measured with a score of alignment of the motifs, and the similarity of the protein-RNA interface is based on the optimal alignments of the motifs and the nucleotide sequences. &lt;br /&gt;&lt;br /&gt;There are two issues I found worth mentioning. First, the scoring method in this paper doesn't have strong theoretical basis, and as the authors said, "the calibration of the degree of similarity (based on the score) will require further exploration of the scoring system". They claim that the score can gives a binary result: whether interfaces are similar or not similar. However, I feel this statement needs further validation. That leads to the second issue: the paper lacks of validation of their methods. The authors did give two case studies to show the ability of the SIMA methods, However, that is not enough to support the validity of this method. A comparison of the results from this method with previous established method or curated data will be more powerful proof.&lt;br /&gt;&lt;br /&gt;As the authors suggested, this method has lots of potentials. It can be used to compare complex protein-RNA interfaces, or view similarity across multiple interfaces. It may be used in interface classifications, and be applied in evolutionary analysis or detection of important protein-RNA interactions. To me, the most interesting part of this paper is that the reduction of molecular level interactions to symbolic representations can still successfully be used in the measurement of the similarities of the protein-RNA interfaces.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2613960226800658732?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2613960226800658732/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2613960226800658732' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2613960226800658732'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2613960226800658732'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/similarity-of-protein-rna-interfaces.html' title='Similarity of Protein-RNA Interfaces Based on Motif Analysis'/><author><name>djiao</name><uri>http://www.blogger.com/profile/06486651456900333839</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5005140593263781603</id><published>2009-09-15T10:38:00.000-07:00</published><updated>2009-09-15T10:43:10.247-07:00</updated><title type='text'>How do Proteins Bind Carbon Dioxide?</title><content type='html'>Authors: Thomas R. Cundari, Angela K. Wilson, Michael L. Drummond, Hector Emanuel Gonzalez, Kameron R. Jorgensen, Stacy Payne, Jordan Braunfeld, Margarita De Jesus and Vanessa M. Johnson&lt;br /&gt;     &lt;br /&gt;&lt;br /&gt;       The purpose of this article, “How do Proteins Bind Carbon Dioxide?” is to find direction for improved utilization, increased mitigation, and enhanced sequestration of greenhouse gas. The authors have embarked on a study of CO2 binding in proteins using first-principles methods and bioinformatics. They seek an answer not for a specific enzyme but rather an answer across as wide a part of the proteome as is possible. The methods of the experiment that authors use include: 1) Molecular Mechanics (MM), 2) Density Functional Tight Binding (DFTB), 3) Density Functional Theory (DFT), 4) Correlation Consistent Composite Approach (ccCA), 5) Ligand Environment Analysis for CO2, 6) (MSD) Motif Database, 7) pI and Hydropathy Calculations, 8) PDB codes for the proteins included in this study and 9) Normalized Temperature Factors.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;According to a bioinformatics analysis of CO2 in proteins, the most basic amino acids like arginne, lysine and histidine are those most commonly found in CO2 protein binding sites. The calculated isoelectric point(pI) highlights the importance of acid/base chemistry in the binding of CO2 in proteins. In addition, inspection of various CO2 binding pockets using tools such as LPC20 and Ligand Explorer implies that while either or both oxygens of Co2 are generally tightly bound in the protein environment, CO2 binding in proteins was analyzed for simple SSEs β-sheet and α-helix that there is a marked preference for β-sheet over α-helices in protein-CO2 binding sites. The authors have considered four hypotheses to explain the preferential binding of CO2 toβ-sheet over α-helices. First, β-sheet simply may be more prevalent in the ~20 CO2 binding proteins. Second, it is known that certain amino acids show a propensity for eitherα-helix or β-sheet formation. A third possibility is that the SSE preference might correspond to a marked difference in flexibility. A final hypothesis concerns differing availabilities of hybrogen-bond donors and acceptors. The authors have also applied a parameter – free ab initio technique – the correlation consistent Composite Approach (ccCA) – to quantify the thermodynamics of interaction between CO2 and amino acid side chain motifs deemed significant in protein binding of CO2 that indicates the amino acid side chains of arginine and histidine bind more strongly to CO2 than do other side chain mimics, e.g., methane (Ala), phenol (Tyr), or benzene (Phe).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;It is concluded that acid/base interaction is the dominant chemical force by which proteins bind CO2. Also, among the two types of regular secondary structural elements, β-sheet are more amenable to CO2 binding thanα-helices. The data supports the inference that while either or both oxygens of CO2 are generally tightly bound in the protein. The authors believe appropriate for describing protein – CO2 interactions, may thus open up new vistas for the design and analysis of protein – CO2 interaction via computational chemistry/biology.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5005140593263781603?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5005140593263781603/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5005140593263781603' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5005140593263781603'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5005140593263781603'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/how-do-proteins-bind-carbon-dioxide.html' title='How do Proteins Bind Carbon Dioxide?'/><author><name>Kuochung Peng</name><uri>http://www.blogger.com/profile/04963167070928354912</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6917546440960385569</id><published>2009-09-15T08:21:00.000-07:00</published><updated>2009-09-15T08:43:54.860-07:00</updated><title type='text'>Review of Virtual Screening of Abl Inhibitors from Large Compound Libraries by Support Vector Machines</title><content type='html'>&lt;div align="justify"&gt;Authors: X. H. Liu, X. H. Ma, C. Y. Tan, Y. Y. Jiang, M. L. Go, B. C. Low&lt;a class="ref" href="http://pubs.acs.org/doi/full/10.1021/ci900135u#afn4"&gt;&lt;/a&gt; and Y. Z. Chen&lt;/div&gt;&lt;div align="justify"&gt;&lt;a href="http://pubs.acs.org/doi/full/10.1021/ci900135u"&gt;full article&lt;/a&gt;&lt;/div&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;This paper is from &lt;a href="http://bidd.nus.edu.sg/group/research.htm"&gt;YZ Chen’s &lt;/a&gt;group, who has great experience to use machine learning methods, particularly SVM, to study virtual screening. They applied SVM into various dataset, resulting in lots of papers (i.e &lt;a href="http://dx.doi.org/10.1016/j.jmgm.2007.03.003"&gt;Xa inhibitor&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1002/prot.20605"&gt;transporter&lt;/a&gt;). This paper is one typical paper studying Ab1, which is an interesting target in Leukemia. As SVM itself is not a new method, their focus is on the data preparation and validation. The result demonstrated SVM has high yields and false hit rates. It is a nice paper for whose, who are interested in QSAR and virtual screening, here I only would like to highlight some of the issues.&lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;My first concern is the problem adding other unproved compounds as inactives. As the reported inactives are rare, it is fine to add putative noninhibitors. However, it should be carefully analyzed the distance between active set and inactive set as well as the diversity of active set. This paper does mention the possible false negative won’t result in bias, but it never talk how different between inactive set and active set. It’s highly possible that the high precision results from the fact that active set is quite different from inactive set rather than the method itself has the ability to represent structure-activity relation. The very low FP (only 1) might explain this point.&lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;The virtual screening result, in which the yield is 50%, further increases my doubt. A good model should have high performance when it comes to new data, but in this paper the number of identified true inhibitors using SVM is actually much lower than similarity search. Does it mean SVM have poor performance? The paper further argues that SVM has low false rate, which is also quite questionable. The paper assumes that very few undiscovered compounds are in the PubChem, so if the inhibitors identified from PubChem using one method were quite few, it is considered that this method has low false rate. I don’t think this is reliable and also it is not fair to compare it with similarity search at this point.&lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;I do believe the paper attempted to solve the issues (i.e, artifical enrichment they mentioned), but it is still unclear after I read a couple of times. I think they are also the problems in QSAR since the chemical space is hard to interpret.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6917546440960385569?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6917546440960385569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6917546440960385569' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6917546440960385569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6917546440960385569'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/review-of-virtual-screening-of-abl.html' title='Review of Virtual Screening of Abl Inhibitors from Large Compound Libraries by Support Vector Machines'/><author><name>Bin Chen</name><uri>http://www.blogger.com/profile/03736801389411661562</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3009974085056382334</id><published>2009-09-14T05:55:00.000-07:00</published><updated>2009-09-14T08:00:17.806-07:00</updated><title type='text'>Benchmark Data Set for in Silico Prediction of Ames Mutagenicity</title><content type='html'>&lt;div align="justify"&gt;&lt;span style="font-size:85%;"&gt;&lt;strong&gt;AUTHOR: Katja Hansen, Sebastian Mika, Timon Schroeter, Andreas Sutter, Antonius ter Laak, Thomas Steger-Hartmann, Nikolaus Heinrich, and Klaus-Robert Mueller&lt;/strong&gt;&lt;br /&gt;University of Technology, Berlin, Germany, idalab GmbH, Berlin, Germany, and Bayer Schering Pharma AG,Berlin, Germany&lt;br /&gt;&lt;strong&gt;SOURCE: J. Chem. Inf. Model., ASAP, DOI: 10.1021/ci900161g&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The bacterial reverse mutation assay, Ames test, is an important early alert system for potential carcinogenicity and/or teratogenicity in drug discovery and development. The Ames test is used to detect frame-shift mutations or base pair substitutions, by exposure of histidine-dependent Salmonella strains to the test compound. Reverse mutation, induced by a mutagen restores the histidine synthesis and enables the growth of bacterial colonies (“revertants”) on a histidine deficient medium, resulting in an Ames positive compound.&lt;br /&gt;Although commercial tools (DEREK, MultiCASE) for the prediction of Ames mutagenicity are available and several approaches have been described in the literature, the lack of a large, clearly defined benchmark data set does not allow a reasonable comparison of different models. In this paper, the authors propose a benchmark set of 6512 compounds, collected from different public sources and made available for future comparisons. This set was annotated with CAS numbers, WDI names and cleaned up of contradicting results with DEREK and MultiCASE data. All data sources used classification rules according to international standards, but did not indicate experimental details. &lt;/span&gt;&lt;/div&gt;&lt;div align="justify"&gt;&lt;span style="font-size:85%;"&gt;The potential of an adaptable, well performing and technically feasible prediction model for mutagenicity, was evaluated through the comparison of 4 non-commercial machine learning techniques with 3 commercial products (DEREK, MultiCASE and PiplinePilot). Molecular descriptors, ranging from constitutional, topological and fragment based descriptors, to molecular properties, were selected from DragonX, using 3D structures. Different machine learning techniques, such as Support Vector Machines, Gaussian Process (probability of compound being mutagen), Random Forest (collection of decision trees) and k-nearest Neighbor (simplest model) were applied. Although one of the commercial tools, the PipelinePilot applies a Bayesian learning, the DEREK and MultiCASE tools are based on fixed, static training sets for the generation of rules or QSAR models. The evaluation of the different models was performed in a five-fold crossvalidation setting. The quality of the models was evaluated plotting the false positive rate against the true positive rate.&lt;br /&gt;Overall, the commercial tools were outperformed by the machine learning models. This was especially the case for DEREK and MultiCASE, as they could not take advantage of the information rich data set, leaving DEREK with the lowest selectivity and specificity. The great performance of the machine learning algorithms indicates the power of this approach, as well as the high information content of the collected benchmark set. Although the simple k-nearest neighbor model yielded good results, indicating strong influence of small local molecular changes on mutagenicity, more sophisticated methods resulted in a significant performance gain.&lt;br /&gt;In conclusion, the 5 machine learning methods described in the paper yield good results on the collected benchmark set. The authors propose this public available data set to be used as a benchmark set for the optimization of future models, due to the possibility for direct comparison of different methods. This will help to establish an adaptable and technically practical prediction model that statistically outperforms commercial tools, even though DEREK and MultiCASE are still justified to obtain interpretable structural information on mutagenicity. &lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3009974085056382334?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3009974085056382334/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3009974085056382334' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3009974085056382334'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3009974085056382334'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2009/09/benchmark-data-set-for-in-silico.html' title='Benchmark Data Set for in Silico Prediction of Ames Mutagenicity'/><author><name>Stefan Furrer</name><uri>http://www.blogger.com/profile/05065895922703854994</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-7894529957028694338</id><published>2008-11-12T19:37:00.000-08:00</published><updated>2008-11-17T11:53:37.826-08:00</updated><title type='text'>Public chemical compound databases</title><content type='html'>&lt;b&gt;AUTHOR: &lt;span style="font-size:+0;"&gt;&lt;/span&gt;Williams, Antony J.&lt;?xml:namespace prefix = o /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/b&gt; &lt;p class="MsoNormal"&gt;&lt;b&gt;SOURCE: &lt;span style="font-size:+0;"&gt;&lt;/span&gt;CURRENT OPINION IN DRUG DISCOVERY &amp;amp; DEVELOPMENT, (MAY 2008) &lt;o:p&gt;&lt;/o:p&gt;&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;b&gt;Vol. 11, No. 3, pp. 393-404.&lt;/b&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;This review summarizes several Open Access and Free Access chemical databases available on the Internet today.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;It also goes in to some detail about a paradigm shift that is occurring from Information professionals conducting chemical and literature searches for chemists using expensive commercial databases to chemists doing their own searches using free databases on the Internet.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The benefits and drawbacks of this shift are also discussed.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The author acknowledges that current free databases contain many errors and that manually curated commercial sources are currently the “gold standard” for quality of data.&lt;/p&gt;&lt;p class="MsoNormal"&gt;The author points out that the field of bioinformatics is far ahead of cheminformatics in the availability of free resources.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;GenBank and the Protein Data Bank have offered free data to scientists for over 20 years.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;It is suggested that this might be because publishers in the chemical fields have discouraged the open flow of chemical data and information.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The recent explosion of free chemistry resources may eventually threaten the business models of the commercial &lt;/p&gt;&lt;p class="MsoNormal"&gt;Public chemistry databases can take many forms.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;These include aggregated chemical structures in single files available for download, databases that use the aggregated files to create their own databases, and commercial organizations that manually gather and curate data.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Some of these commercial entities include Chemical Abstracts Service, InfoChem and Symyx.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;There are also a number of databases that offer access to data available from various outside resources via molecular connection tables.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;These are dubbed “linkbases”.&lt;/p&gt;&lt;p class="MsoNormal"&gt;The author goes on to describe the differences between “Open Access” versus “Free Access”.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Basically Open Access information is freely available and users have the right to read, download, copy, distribute, print, search, link to, index crawl or use for any other lawful purpose.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Free Access on the other hand removes price barriers to information but not permission barriers.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The differences here are usually of no concern to a scientist accessing information on an as needed basis.&lt;/p&gt;&lt;p class="MsoNormal"&gt;Proliferation of errors from original public data depositions by free information aggregators such as PubChem, ChemIDPlus and ChemFinder is a fairly large problem.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Errors include problems with structure identifier pairs and inaccurate structure representations.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Researchers have cautioned, “users beware” in the use of information from these free resources.&lt;/p&gt;&lt;p class="MsoNormal"&gt;Reviews of six different public compound databases follows.&lt;/p&gt;&lt;p class="MsoNormal"&gt;PubChem is the highest profile online resource offering three databases: Compound, Substance and Bio-Assay.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Dubbed the “granddaddy of all free chemistry databases” PubChem is large but contains many errors and users should be wary.&lt;/p&gt;&lt;p class="MsoNormal"&gt;The EPA Distributed Structure-Searchable Toxicity (DSSTox) database provides documented, standardized and structure-annotated files of toxicity information.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The DSSTox datasets are highly curated and of high quality.&lt;/p&gt;&lt;p class="MsoNormal"&gt;eMolecules primarily provides paths to vendors of particular chemical compounds though it was recently enhanced to also provide access to NMR, MS and IR spectra.&lt;/p&gt;&lt;p class="MsoNormal"&gt;DrugBank is another curated resource assembled from aggregated freely available data but is enhanced with data generated within the laboratories of the host organizations.&lt;/p&gt;&lt;p class="MsoNormal"&gt;NMRShiftDB is an open source collection of chemical structures and associated NMR shift assignments.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The number of compounds represented is comparably small to the other databases.&lt;/p&gt;&lt;p class="MsoNormal"&gt;ChemSpider contains over 18 million unique chemical structures gathered from chemical vendors, commercial databases, members of the Open Notebook Science community and has also integrated the SureChem patent database.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Structure/substructure searching is possible and virtual screening using the LASSO similarity search tool has been added. The author suggests that the focus of community “crowd source” or “wiki-like” curation in ChemSpider is the wave of the future to increase quality of data in free resources.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;Wikis, Weblogs and the Internet as a public compound database are also discussed.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Blogs and wikis can contain much useful information but are crippled in that they are not searchable by structure.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Use of InChI strings and InChIKeys can help with this problem but adoption and implementation of InChIKeys has been slow to materialize.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The Royal Society of Chemistry has started to embed InChIs in their articles and some Bloggers have started to include InChIs in their postings.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;Some Wikis are also InChI enabled.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The value of InChIs however is still in its infancy.&lt;/p&gt;&lt;p class="MsoNormal"&gt;In conclusion the author suggests that the new found availability of public chemical compound databases with their associated data is enabling scientists to access information both cheaply and in a time effective manner.&lt;/p&gt;&lt;p class="MsoNormal"&gt;Commercial chemical database providers might experience some challenging times as the proliferation of freely available chemistry resources increases in both quantity and quality.&lt;span style="font-size:+0;"&gt; &lt;/span&gt;The shift toward a more open community for science is ever expanding.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-7894529957028694338?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/7894529957028694338/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=7894529957028694338' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/7894529957028694338'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/7894529957028694338'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/public-chemical-compound-databases.html' title='Public chemical compound databases'/><author><name>Paul Schulwitz</name><uri>http://www.blogger.com/profile/10580572258632254784</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://2.bp.blogspot.com/_t2eBBd4xpLM/SRpK8x3CoBI/AAAAAAAAAAM/OibrxThH738/S220/100_2939.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2072432908048983090</id><published>2008-11-12T10:48:00.000-08:00</published><updated>2008-11-12T11:05:51.815-08:00</updated><title type='text'>Review of “Automating Laboratory Operations by Integrating LIMS with Analytic Instruments and SDMS” by Jianyong Zhu</title><content type='html'>This &lt;a href="http://idea.iupui.edu/dspace/bitstream/1805/323/1/Thesis_Jay%20Zhu_Final.pdf"&gt;paper&lt;/a&gt; presents the result of the design of a product named "LabTechie Total Instrument Interfacing Solutions." An SDMS (Scientific Data Management System) is used for analyzing scientific data and the author focuses on the creation of "LabTechie" as a middle layer to communicate with LIMS, instruments and SDMS, hopefully resulting in an increase in efficiency and reduction of costs.&lt;br /&gt;&lt;br /&gt;Most general purpose LIMS (Laboratory Information Management Systems) lack the ability to automatically collect data from analytical instruments so this project focuses on the design and implementation of the middle layer to capture instrument data and information.&lt;br /&gt;&lt;br /&gt;The conventional method of creating drivers between LIMS and instruments is mentioned, and the author emphasizes there is a failure to make generic drivers. Raw data often exists, and typical LIMS to instrument drivers don’t perform any analysis of the raw data. These points are used to argue for the creation of the middle layer to be used for data collection, communication and storage of results.&lt;br /&gt;&lt;br /&gt;The author also make an argument that XML can be used for long-term archiving to relieve many of the problems presented by drivers. Making the XML schema or DTD (Data Type Definitions) widely available is argued as a means to ensure correctness. Although creating or using such standards are a necessary step to move the product closer to a goal of platform independence and data preservation, it does not necessarily ensure ease of use for future generations because there can be many competing schema that can also be created by other companies. The author presents the benefits of security features available through XML, but also concludes that XML has data storage and communication problems because of the size needed for XML files.&lt;br /&gt;&lt;br /&gt;The middle layer described, in order to provide fewer complications, is customized to read data from an instrument, in this example a CDS (Chromatography Data System). The middle layer is also capable of writing sample data to the CDS, but requires an operator to run it. After reading data, the middle layer stores original instrument files, displays them in a spreadsheet format, and saves the information in XML. It also archives SDMS results and makes everything available to LIMS. I think CDS (Chromatography Data Systems) are used as an example of an instrument in this thesis because it is a typical complex instrument used in a chemistry laboratory. &lt;br /&gt;&lt;br /&gt;The paper states as a conclusion that having a middle layer is possible, but requires a lot of resources and effort to work. The conclusion also states that it is only efficient for instruments with similar data formats. However, I think that working within the constraints of physical instruments, this paper and project present a good point of abstraction for dynamic functionality, even though LIMS companies do often and eventually compete by making their own drivers for instrument communication. So staying on top of the different instruments and integrating different levels of communication with key pieces of software would be essential for such a product to be successful. Each LIMS product also would have to have configation options or drivers to communicate with the middle layer. We do find, however, that software such as Labware LIMS permits its own embedded creation of what has been classified here as a middle layer. &lt;br /&gt;&lt;br /&gt;I also think the user interface and design was over complicated and perhaps should have focused only on management of automated processes. With a better user interface and if the market was constantly changing in the area of competing vendors, this product could have the prospect for marketability because it could potentially empower clients to more easily change their vendors and applications for LIMS, SDMS or instruments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2072432908048983090?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2072432908048983090/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2072432908048983090' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2072432908048983090'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2072432908048983090'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/review-of-automating-laboratory.html' title='Review of “Automating Laboratory Operations by Integrating LIMS with Analytic Instruments and SDMS” by Jianyong Zhu'/><author><name>Adam</name><uri>http://www.blogger.com/profile/12069560580951612376</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-204555640944355089</id><published>2008-11-12T08:34:00.000-08:00</published><updated>2008-11-12T08:46:53.801-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ligand similarity'/><category scheme='http://www.blogger.com/atom/ns#' term='pharmacological network'/><category scheme='http://www.blogger.com/atom/ns#' term='cheminformatics'/><title type='text'>“Quantifying the Relationships among Drug Classes”</title><content type='html'>&lt;div  style="text-align: justify; font-family: times new roman;font-family:times new roman;"&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size:70%;"&gt;A Synopsys of “Quantifying the Relationships among Drug Classes”&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span style="font-size:100%;"&gt;Jérôme Hert, Michael J. Keiser, John J. Irwin, Tudor I. Oprea, and Brian K. Shoichet&lt;br /&gt;&lt;/span&gt;&lt;span style="font-size:100%;"&gt;Journal of Chemical Information and Modeling 2008, 48, 755-765&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-size:115%;"&gt;&lt;br /&gt;This paper attempts to quantify the differences between Bioinformatics- and Cheminformatics-based networks by using similarity measurement in order to investigate the relationships among pharmacological targets. Bioinformatics networks are obtained from the sequence identities using PSI-BLAST whereas Cheminformatics networks are obtained from the ligand-set similarities with Similarity Ensemble Approach (SEA) and Bayesian statistics.&lt;br /&gt;&lt;br /&gt;Similar chemical structures have similar activity behaviors.  In other words, similar molecular structures bind to similar target proteins.  Although many approaches in Bioinformatics have given efforts to build protein networks using sequence and structural information, it turns out that these networks are limited to apply for quantifying relationships among targets. For instance, some active compounds on m-opioid receptor also show the similar activity on M3 muscarinic receptors despite the two protein structures are dissimilar.  Recently, cheminformatics networks have been able to predict off-target effects of drugs and polypharmacology, side effect, and drug repositioning.&lt;br /&gt;&lt;br /&gt;Several tasks are conducted for this research. Ligands data sets are created from two different databases: the MDL Drug Data Report (MDDR) and the WOrld of Molecular BioAcTivity (WOMBAT). Seven different fingerprints are used to calculate similarities and their performances are evaluated: Daylight, Unity, MDL Keys, FCFP_4, ECFP_4, CATS, and FEPOPS. For the similarity calculation, the Tanimoto coefficient is used. The Similarity Ensemble Approach (SEA) is used to compare the similarity measurements with the sequence similarity value determined by PSI-BLAST. The SEA score is expressed as an expectation value (E-value) that corresponds to one from PSI-BLAST. Bayesian model is also used to quantitatively measure similarities between ligand sets.&lt;br /&gt;&lt;br /&gt;After preparing the data and conducting all possible similarity calculations, the correlation between Bioinformatics and Cheminformatics networks is measured.  As expected, the result shows that they do not show any significant correlation as previous works already demonstrated, even though both of them share the purpose to map pharmacological targets in accordance with similarities in a formal network. Another metric of similarity is calculated with different fingerprints in order to examine the relationships shared among the ligands-sets. They introduced the threshold networks and used them to compare ligand similarity-based Cheminformatics networks with sequence-based Bioinformatics networks.  In order to investigate stability and effectiveness of each fingerprint and different methods such as SEA and Bayesian, 10 fold cross validation is performed.  Lastly, properties of the threshold networks are identified that Cheminformatics networks are small-world networks and have broad-scale properties while Bioinformatics networks are not.&lt;br /&gt;&lt;br /&gt;In conclusion, this research merges to three key points. First, the similarity-based Cheminformatics networks are substantially different from sequence identity-based Bioinformatics networks.  The sequence-based similarity is measured across the entire protein whereas ligand-based similarity is locally measured. For example, GPCRs are conserved in overall sequence and share a common ancestry, but recognize diverse ligand scaffolds.  Second, the Cheminformatics networks are very satisfactory to represent and identify ligands.  Third, Cheminformatics networks are capable of representing pharmacological targets by network theory.  Although Cheminformatics networks are established only with ligand information itself, the relationships between ligands and their receptors could be well mapped.  Lastly, even though this observation is arguably more relevant to medicinal chemistry perspectives, ligand-based similarity approaches are pharmacologically more informative to represent the associations of target networks than sequence and structure-based approaches are.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-204555640944355089?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/204555640944355089/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=204555640944355089' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/204555640944355089'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/204555640944355089'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/quantifying-relationships-among-drug.html' title='“Quantifying the Relationships among Drug Classes”'/><author><name>JaeHong Shin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='31' height='32' src='http://3.bp.blogspot.com/_IJCpekjdhLQ/SwQl3BHImII/AAAAAAAADJQ/gDR7Zr_kiTo/s1600-R/profile_picture-full%3Binit:.JPG'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3479769239395962185</id><published>2008-11-12T04:31:00.000-08:00</published><updated>2008-11-12T04:56:59.521-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='I571 - Assignment #3'/><category scheme='http://www.blogger.com/atom/ns#' term='Userscript'/><title type='text'>I571- Userscripts for the Life Sciences</title><content type='html'>Published: 21 December 2007&lt;br /&gt;BMC Bioinformatics 2007, 8:487 doi:10.1186/1471-2105-8-487&lt;br /&gt;&lt;br /&gt;This paper introduces userscripts that integrate information from two or more web resources, especially about how to integrate the chemistry and biology related resources. And also, the paper gives examples that enrich web pages with information from other resources. After I read the paper, I learn about the function of the userscript and its important role of in enabling the user to modify the HTML content of a web page on-the-fly. I learn about the sources of information used by the discussed userscripts. That is, web databases is the primary source. Userscripts retrieve information from external web resources by using HTTP just like any web browser itself. To simplify the process, userscripts often use a combination of XMLHttpRequest, possibly via the Greasemonkey GM_xmlhttpRequest wrapper method, and the JavaScript Object Notation (JSON) format to represent data.&lt;br /&gt;&lt;br /&gt;The paper points out that using identifiers simplifies recognition of biological and chemistry relevant information on web pages. Identifiers are used to make connections between databases and identify a specific entry in a database.&lt;br /&gt;&lt;br /&gt;The paper introduces what are Microformats. Microformats are a lightweight specification that extends HTML to add semantic markup to web pages. But those identifiers which simplify the recognition of biological and chemical information on web pages may or may not be marked up with microformats.&lt;br /&gt;&lt;br /&gt;The paper gives the way to deal with the problem that userscripts will fail if the external resource changes its API or the URL it provides to access it. To deal with the problem, each userscript checks once a day for a new version and prompts the user to install it if one is available. The result is that when a userscript is updated to deal with a new API or URL, every user will quickly have access to the latest version.&lt;br /&gt;&lt;br /&gt;The authors created a userscript, ChemGM.user.js that will automatically run OSCAR on a web page and provide inline hypertext links to PubChem for chemical structure names (including 2D structure depictions generated by another web service and PubChem searches). Using a web service interface, the paper created a userscript, 3DStructureView.user.js, which enables users to see the 3D structures from our database when they visit thePubChem website.&lt;br /&gt;&lt;br /&gt;In conclusion, the paper's main idea is that userscripts are used as an effective way of integrating biology and chemoinformatics web resources because those scripts are simple and of great use.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3479769239395962185?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3479769239395962185/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3479769239395962185' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3479769239395962185'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3479769239395962185'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/i571-userscripts-for-life-sciences.html' title='I571- Userscripts for the Life Sciences'/><author><name>Hongliu</name><uri>http://www.blogger.com/profile/01406705553116644059</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-771959941097472235</id><published>2008-11-11T21:15:00.000-08:00</published><updated>2008-11-11T21:23:31.785-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Drug Design'/><category scheme='http://www.blogger.com/atom/ns#' term='I571 - Assignment #3'/><title type='text'>ABZYMES  in Drug Design</title><content type='html'>&lt;p class="MsoNormal"&gt;ABZYMES - A Potential Cure for the “Incurable”.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;A Review based on the paper &lt;i style=""&gt;“Toward antibody-directed “ABZYME” prodrug therapy, ADAPT: &lt;span style=""&gt; &lt;/span&gt;Carbamate prodrug activation by a catalytic antibody and its invitro application to human tumor cell killing”&lt;/i&gt; &lt;span style=""&gt; &lt;/span&gt;&lt;span style=""&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal"&gt;&lt;span style=""&gt;&lt;/span&gt; Paul Wentworth, Anita Datta , David Blakey, Tom Boyle ,Lynda Partridge , Michael Blackburn. &lt;/p&gt;  &lt;p class="MsoNormal" style="line-height: normal;"&gt;National Academy of Science USA, Vol 93, pp. 799-803, January 1996, Biochemistry.&lt;span style=""&gt;                                                                                                                               &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal"&gt;The most important shortcoming of the treatments for cancer till date has been the absence of accuracy in the aspect of site specificity. Chemotherapy has undergone considerable metamorphosis and has reached site specific chemotherapy. The unwanted nonspecific toxicity associated with anticancer agents was previously overcome by ADEPT. The 2 components associated to such therapy are&lt;span style=""&gt;   &lt;/span&gt;&lt;/p&gt;  &lt;ul&gt;&lt;li&gt;&lt;span style=""&gt;&lt;span style=""&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-size-adjust: none; font-stretch: normal;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;An antibody-enzyme conjugate&lt;/li&gt;&lt;li&gt;&lt;!--[if !supportLists]--&gt;&lt;span style=""&gt;&lt;span style=""&gt;&lt;span style="font-family: &amp;quot;Times New Roman&amp;quot;; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-size-adjust: none; font-stretch: normal;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;!--[endif]--&gt; An anticancer &lt;span style=""&gt; &lt;/span&gt;prodrug &lt;span style=""&gt; &lt;/span&gt;with low toxicity&lt;/li&gt;&lt;/ul&gt;    &lt;p class="MsoNormal"&gt;The conjugate is administered first .Due to the antibody binding to the tumor associated antigenic determinants, the conjugate populates predominantly near the tumor site.&lt;span style=""&gt;  &lt;/span&gt;The prodrug is administered next. The enzyme plays the role of the cleavage agent in the cytotoxic process .Due to the toxicity limitations, a bacterial enzyme was preferred to a human enzyme. But this proved to be a dose limiting aspect. &lt;span style=""&gt; &lt;/span&gt;&lt;span style=""&gt; &lt;/span&gt;To overcome this aspect, Abzymes were considered instead of enzymes. ABZYMES (Antibodies with the characteristics of Enzymes). This drug design takes into account 3 aspects.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;1. Better performance of the antibody when compared to the enzyme.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;2. Minimizing the prodrug alkylation of the Abzymes.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;3. Maximum catalytic rate enhancement.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;A series of immunological and biotechnological techniques were employed for Hapten synthesis, Hapten density determination, Antibody generation, Hybridoma generation and antibody isotyping. &lt;span style=""&gt; &lt;/span&gt;This was the Wet lab part of the experiment.&lt;span style=""&gt;  &lt;/span&gt;The Informatics part of the experiment was in determination of the values for the LineWeaver Burk plot which is an extension of the MM equation. This data was then used in to determine the optimum activity of the Abzymes. Computer modeling shows that k&lt;sub&gt;cat&lt;/sub&gt; values of  &lt;span style=""&gt;&lt;/span&gt;1.0 &lt;span style=""&gt; &lt;/span&gt;S&lt;sup&gt;-1&lt;/sup&gt;&lt;span style=""&gt;  &lt;/span&gt;can optimize selectivity and reduce peripheral toxicity.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Among chemotherapy, artificially induced apoptosis etc, Abzymes has been one of the very effective notions in the anticancer drug research. &lt;span style=""&gt; &lt;/span&gt;Abzymes have been thought to have a good use in the HIV treatment research also. The Abzymes are predicted to be able to cut off the Achilles heel part of the HIV infection which essentially makes the treatment for HIV very simple. Abzymes shows potential to be a cure for most diseases thought to be incurable now.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-771959941097472235?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/771959941097472235/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=771959941097472235' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/771959941097472235'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/771959941097472235'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/abzymes-in-drug-design.html' title='ABZYMES  in Drug Design'/><author><name>Manasa</name><uri>http://www.blogger.com/profile/02983912282218994944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8302138269422336682</id><published>2008-11-11T19:30:00.000-08:00</published><updated>2008-11-11T20:03:41.158-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Docking'/><title type='text'>Protein-Ligand Docking against Non-Native Protein Conformers</title><content type='html'>&lt;div  style="text-align: justify;font-family:arial;"&gt;&lt;span style="font-size:100%;"&gt;&lt;o:smarttagtype namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="place"&gt;&lt;/o:smarttagtype&gt;&lt;o:smarttagtype namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="PlaceType"&gt;&lt;/o:smarttagtype&gt;&lt;o:smarttagtype namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="PlaceName"&gt;&lt;/o:smarttagtype&gt;&lt;/span&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:view&gt;Normal&lt;/w:View&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:snaptogridincell/&gt;    &lt;w:wraptextwithpunct/&gt;    &lt;w:useasianbreakrules/&gt;    &lt;w:dontgrowautofit/&gt;   &lt;/w:Compatibility&gt;   &lt;w:browserlevel&gt;MicrosoftInternetExplorer4&lt;/w:BrowserLevel&gt;  &lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" latentstylecount="156"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if !mso]&gt;&lt;object classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id="ieooui"&gt;&lt;/object&gt; &lt;style&gt; st1\:*{behavior:url(#ieooui) } &lt;/style&gt; &lt;![endif]--&gt;&lt;style&gt; &lt;!--  /* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal  {mso-style-parent:"";  margin:0in;  margin-bottom:.0001pt;  mso-pagination:widow-orphan;  font-size:12.0pt;  font-family:"Times New Roman";  mso-fareast-font-family:"Times New Roman";} @page Section1  {size:8.5in 11.0in;  margin:1.0in 1.25in 1.0in 1.25in;  mso-header-margin:.5in;  mso-footer-margin:.5in;  mso-paper-source:0;} div.Section1  {page:Section1;} --&gt; &lt;/style&gt;&lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */  table.MsoNormalTable  {mso-style-name:"Table Normal";  mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;  mso-style-parent:"";  mso-padding-alt:0in 5.4pt 0in 5.4pt;  mso-para-margin:0in;  mso-para-margin-bottom:.0001pt;  mso-pagination:widow-orphan;  font-size:10.0pt;  font-family:"Times New Roman";  mso-ansi-language:#0400;  mso-fareast-language:#0400;  mso-bidi-language:#0400;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;/div&gt;&lt;p  style="text-align: justify;font-family:arial;" class="MsoNormal"&gt;&lt;span style="font-size:100%;"&gt;&lt;span style=";font-family:times new roman;font-size:85%;"  &gt;Marcel L. Verdonk, Paul N. Mortenson, Richard J.Hall, Micheal J. Hartshorn, and Christopher W. Murray&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p  style="text-align: justify;font-family:arial;" class="MsoNormal"&gt;&lt;span style="font-size:100%;"&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-style: italic;"&gt;J.Chem. Inf.Model&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;/o:smarttagtype&gt;&lt;o:smarttagtype namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="PlaceType"&gt;&lt;/o:smarttagtype&gt;&lt;o:smarttagtype namespaceuri="urn:schemas-microsoft-com:office:smarttags" name="PlaceName"&gt;&lt;/o:smarttagtype&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:view&gt;Normal&lt;/w:View&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:snaptogridincell/&gt;    &lt;w:wraptextwithpunct/&gt;    &lt;w:useasianbreakrules/&gt;    &lt;w:dontgrowautofit/&gt;   &lt;/w:Compatibility&gt;   &lt;w:browserlevel&gt;MicrosoftInternetExplorer4&lt;/w:BrowserLevel&gt;  &lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" latentstylecount="156"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if !mso]&gt;&lt;object classid="clsid:38481807-CA0E-42D2-BF39-B33AF135CC4D" id="ieooui"&gt;&lt;/object&gt; &lt;style&gt; st1\:*{behavior:url(#ieooui) } &lt;/style&gt; &lt;![endif]--&gt;&lt;style&gt; &lt;!--  /* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal  {mso-style-parent:"";  margin:0in;  margin-bottom:.0001pt;  mso-pagination:widow-orphan;  font-size:12.0pt;  font-family:"Times New Roman";  mso-fareast-font-family:"Times New Roman";} p.MsoBodyText, li.MsoBodyText, div.MsoBodyText  {margin-top:0in;  margin-right:0in;  margin-bottom:6.0pt;  margin-left:0in;  mso-pagination:widow-orphan;  font-size:12.0pt;  font-family:"Times New Roman";  mso-fareast-font-family:"Times New Roman";} @page Section1  {size:8.5in 11.0in;  margin:1.0in 1.25in 1.0in 1.25in;  mso-header-margin:.5in;  mso-footer-margin:.5in;  mso-paper-source:0;} div.Section1  {page:Section1;} --&gt; &lt;/style&gt;&lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */  table.MsoNormalTable  {mso-style-name:"Table Normal";  mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;  mso-style-parent:"";  mso-padding-alt:0in 5.4pt 0in 5.4pt;  mso-para-margin:0in;  mso-para-margin-bottom:.0001pt;  mso-pagination:widow-orphan;  font-size:10.0pt;  font-family:"Times New Roman";  mso-ansi-language:#0400;  mso-fareast-language:#0400;  mso-bidi-language:#0400;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;p class="MsoBodyText" style="margin-bottom: 0.0001pt; text-align: justify;"&gt;&lt;span style="font-size:85%;"&gt;This paper deals with building Non-Native Protein Conformers and comparison between Native Protein-Ligand Docking against Non-native Protein-Ligand Docking. &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoBodyText" style="margin-bottom: 0.0001pt; text-align: justify;"&gt;&lt;span style="font-size:85%;"&gt;Advances in technology like X-ray crystallography and NMR have helped to increase the number of proteins in PDB. Mostly, proteins are considered to be drug-targets. Protein-Ligand docking play a vital role in Structure based drug designing and discovery, where it is used in Virtual Screening and interactive design. &lt;st1:place st="on"&gt;&lt;st1:placetype st="on"&gt;Range&lt;/st1:placetype&gt; of &lt;st1:placename st="on"&gt;Docking&lt;/st1:placename&gt;&lt;/st1:place&gt; programs are available to the community e.g. DOCK, GOLD, FLEX, GLIDE and ICM.&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoBodyText" style="margin-bottom: 0.0001pt; text-align: justify;"&gt;&lt;span style="font-size:85%;"&gt;Performance of a docking program is measured against native protein conformers, but in real life application Ligand are docked against non-native conformation of protein i.e. the apo structure or a structure of a different protein ligand complex. Recently they constructed an Astex Diverse Set : a new test set containing 85 protein-Ligand complexes. This set i) only include structure of target that are of drug-discovery, agrochemical interest and have drug-like Ligand; ii)diverse set of complexes with no target repeated and iii) high quality structure factor.&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoBodyText" style="margin-bottom: 0.0001pt; text-align: justify;"&gt;&lt;span style="font-size:85%;"&gt;The construction of Astex Non-native Set was based on Astex Diverse Set. In brief the Set have all the same target structure of the entry in Astex Diverse Set but the target can be an apo structure or a complex with different Ligand. The non-native docking validation then consisted of docking each Ligand in Astex Diverse Set against all non-native protein conformers. Hence Astex non-native Set contains only non-native protein target of derived set and no new Ligand. Advantages of doing so are i) direct comparison within each set as both of them conatins same target and Ligand and ii) understand the Ligand induced protein conformation using docking performance. &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoBodyText" style="margin-bottom: 0.0001pt; text-align: justify;"&gt;&lt;span style="font-size:85%;"&gt;To create an Astex non-native Set, different filters were applied to discard the non-native structure. i) Only structures determined at resolution &lt;-2.5 Å were kept (2340 non-native structure). ii) Superimposing: Each target from the above filter criteria were to be superimpose on to the native structure (reference structure) to get a reliable RMSD value. Superimposing was done using KFIT. Only one non-native conformation was kept for each PDB entry (1541 nns). iii) Identify ligands :- After superimposing, only ligands that were in close proximity were considered and were classified as solvent, cofactor or Ligand depending on the atom type (1287 nns). iv) Setting the structure : Each non.native structure now needed to be set up for docking a reference Ligand. Preparing the binding site of the non-native structure as of the reference structure. Protocol started off with all residue within 6 Å of heavy atom of reference Ligand. It also dealt with side chain flips for asparagine, glutamine and histidine residues (1142 nns). and v) Visual inspection: Manually looking at each non-native structures to see weather some disorder or something is wrong or not.(1112 nns). Now the GOLD Docking program is applied to both the Sets and the Results are compared. Docking performance is defines as percentage of complexes for which the result predicted binding site is within 2 Å of the experimental binding site. Sampling Performance is defined as percentage of cases for which the GOLD solution has rmsd &lt;&gt;  &lt;/span&gt;&lt;/p&gt;&lt;p class="MsoBodyText" style="margin-bottom: 0.0001pt; text-align: justify;"&gt;&lt;span style="font-size:85%;"&gt;Result showed that there was a drop in docking performance of non-Native docking(61%) as compared to native docking (80%) and a drop in sampling performance of non-native docking(71%) as compared to native docking (91%). It because the docking protocol take account of protein flexibility.&lt;/span&gt;&lt;span style="font-size:85%;"&gt;  &lt;/span&gt;&lt;span style="font-size:85%;"&gt;Considering Multiple Conformer in non-native docking significantly improves the docking performance (67%) as well as sampling performance (86%). Docking against non-native structure containing similar ligands result in significant performance.&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="text-align: justify;"&gt;&lt;o:p&gt; &lt;/o:p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8302138269422336682?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8302138269422336682/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8302138269422336682' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8302138269422336682'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8302138269422336682'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/protein-ligand-docking-against-non.html' title='Protein-Ligand Docking against Non-Native Protein Conformers'/><author><name>Kuldeep Jariwala</name><uri>http://www.blogger.com/profile/11625149973471157567</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2022546511876194452</id><published>2008-11-11T06:14:00.000-08:00</published><updated>2008-11-11T06:22:41.604-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Labels: I571 (Nov-2008) - Assignment #3'/><title type='text'>Chemical structure indexing of toxicity data on the Internet: Moving toward a flat world</title><content type='html'>A.M. Richard, L.S. Gold, M.C. Nicklaus&lt;br /&gt;Current Opinion in Drug Discovery &amp;amp; Development 2006 9(3):314-325&lt;br /&gt;&lt;br /&gt;This review discusses public initiatives aimed at integration of diverse sets of toxicology data using a chemically indexed vantage point. Standardized chemical structure annotation of public toxicity databases and information resources play an increasingly important role in the 'flattening' and integration of diverse sets of biological activity data on the Internet.&lt;br /&gt;&lt;br /&gt;The authors draw attention to the fact that if long-term screening and prediction objectives are to be achieved then the data must be systematized, integrated and mined. This article builds on recent reviews by other authors and focuses particularly on the increasing role played by standardized chemical structure annotation as a top-level indexing and mining metric for public sources of toxicology data. According to the authors, the ability of users to broadly integrate toxicological information across biological information domains using chemical structure as the primary annotation and search metric will expand the possibilities, and change the paradigms for toxicity data mining and screening.&lt;br /&gt;&lt;br /&gt;The role played by chemical content annotators, structure locator services, structure/data aggregator web sites, structure browsers and InChI codes in overcoming barriers to the integration of toxicity data and bringing users closer to the reality of a mineable chemical Semantic Web has been described. The increasing levels of implementation of data models, data standardization and curation efforts, and the increasing availability of large resources of biological profiling data enrich the biological activity content that can be meaningfully associated with chemicals. The authors cite the example of a long-term collaboration between the EPA DSSTox Database project and the Carcinogenic Potency DataBase (CPDB) project, which has more recently been broadened to include collaboration with the NCI/CADD group's public web&lt;br /&gt;services and the PubChem project that endeavors to break down the traditional barriers that exist between disparate information domains in the study of the toxicity of chemicals.&lt;br /&gt;&lt;br /&gt;They emphasize that full structure annotation using mol file content, or the equivalent, is the ultimate goal for enabling fully functional structure and substructure searching capabilities. However, they agree that text-based representations of chemical structure do play vital intermediary role in the primarily text-centric information world of the present.&lt;br /&gt;&lt;br /&gt;One of the primary requirements for chemical information resources on the Internet to be optimally useful for modeling and data mining applications is the free availability of chemical structure information in a variety of formats, including SD file, for either part or the entirety of a database or structure-index file. &lt;br /&gt;The authors describe two main types of data aggregation that occur with respect to chemical indexing and biological/toxicological activity. The first type of aggregation enables a user to survey and restrict the overall chemical space to structure analog or similarity space, which requires the availability of sufficient standardized chemical structure and property representations across toxicology experiments. The second type of aggregation involves the ability to derive intermediate aggregations and summarizations of the toxicity data which is accomplished by using a built-in hierarchical data structure, informed by toxicity domain experts, that captures essential experimental details (eg, dose-response data) at the lowest level.&lt;br /&gt;&lt;br /&gt;The authors believe that data models such as these will provide an essential link to the toxicity domain experts, encouraging their active involvement and participation, and serving as a sanctioned conduit for transferring unstructured data from internal databases, government archives and literature studies into a data format that can be integrated, mined and potentially modeled. Attention is drawn towards a major issue with respect to the chemical indexing technologies and advances that pertains to the quality of chemical information associated with toxicological data in the public domain. This issue has been addressed during the course of the EPA DSSTox Database project where chemical structure, chemical name and CAS RN information for greater than 8000 unique chemical records from a variety of toxicity databases were reviewed for accuracy and consistency through organic chemistry expertise, and verified using multiple public information sources.&lt;br /&gt;&lt;br /&gt;In conclusion the authors are optimistic that over time an increasingly interlinked, flat chemical world would facilitate the detection of inconsistencies and errors in chemical databases, resulting from more frequent cross-checks among data/structure collections.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2022546511876194452?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2022546511876194452/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2022546511876194452' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2022546511876194452'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2022546511876194452'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/chemical-structure-indexing-of-toxicity.html' title='Chemical structure indexing of toxicity data on the Internet: Moving toward a flat world'/><author><name>sunya</name><uri>http://www.blogger.com/profile/06365960181763745516</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6701307374380005843</id><published>2008-11-09T19:07:00.000-08:00</published><updated>2008-11-09T19:32:36.266-08:00</updated><title type='text'>Chemical Descriptor Library (CDL): A Generic, Open Source Software Library for Chemical Informatics</title><content type='html'>&lt;div style="mso-element:para-border-div;border:none;border-bottom:solid #4F81BD 1.0pt; mso-border-bottom-themecolor:accent1;padding:0in 0in 4.0pt 0in"&gt;  &lt;p class="MsoTitle"&gt;&lt;span class="Apple-style-span" style="font-family: 'times new roman';"&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:14.0pt; mso-bidi-font-size:11.0pt"&gt;&lt;span style="mso-spacerun:yes"&gt;                    &lt;/span&gt;&lt;span style="mso-spacerun:yes"&gt;            &lt;/span&gt;&lt;span style="mso-spacerun:yes"&gt;        &lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:13.0pt; mso-bidi-font-size:11.0pt"&gt;Vladimir J. Sykora and David E. Leahy&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:13.0pt; mso-bidi-font-size:11.0pt"&gt;&lt;span style="mso-spacerun:yes"&gt;            &lt;/span&gt;&lt;span style="mso-spacerun:yes"&gt;                        &lt;/span&gt;&lt;i style="mso-bidi-font-style: normal"&gt;J.Chem. Inf. Model. 2008, 48, 1931-1942&lt;/i&gt;&lt;/span&gt;&lt;i style="mso-bidi-font-style: normal"&gt;&lt;span style="font-size:16.0pt;mso-bidi-font-size:11.0pt"&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:12.0pt"&gt;The field of Chemical Informatics has emerged from the aim of achieving data reliability and knowledge extraction from the vast amount of chemical data accumulated during the years especially from the drug discovery. Recently much effort has been directed in the development of algorithms that generate numeric information reflecting the physical properties of the molecules. There are currently some of the open source software libraries, namely, OpenBabel, JOELib, CDK available for chemical informatics. One can use the algorithms that are available in these and solve many problems. This article is about one such open source software library called Chemical Descriptor Library (CDL).&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:12.0pt"&gt;CDL is written in standard C++. It represents molecules as mathematical graphs, where nodes represent atoms and edges represent bonds. CDL employs the C++ Boost Graph Library (BGL) as the underlying data structure of the molecular graph, using an undirected adjacency list specialization for the BGL graph data structure. Different Sets of properties that contain information about an atom within a molecule or about the bonds in the molecule are attached to vertices and edges in the molecular graph. In addition, the whole molecule’s properties are contained in a data structure.CDL provides a generic interface for the access of these properties and is implemented with the use of ‘accessors’ which are the objects that provide necessary functions for the access of properties. In addition to an interface for the access of properties, the CDL also defines a generic interface to traverse the structure of a molecule graph. This is implemented using iterators. The &lt;i style="mso-bidi-font-style:normal"&gt;molecule &lt;/i&gt;class is the basic programming object in the CDL. It encapsulates the graph data structure and its associated properties (atom/bond/molecule). It is parameterized on the atom, bond and molecular properties.&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;Some computationally intensive calculations like&lt;span style="mso-spacerun:yes"&gt;  &lt;/span&gt;ring perception, bond order assignment, aromaticity perception, flags assignment are performed and are stored in the internal properties of the molecule object, during the instantiation and this is called Molecule initialization. This removes the requirement to recalculate them during calls to algorithms. The core algorithms implemented within CDL are Molecular Text Parsers, Molecular Text Writers, Substructure Search, SMARTS Language for Molecular patterns and properties, Two Dimensional Binary Fingerprints, Topological Pharmacophores Fingerprints, Atom Types Fingerprints.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:12.0pt"&gt;According to the article, three sets of experiments were used to quantify the performance of algorithms mentioned above. The experiments were performed under the Linux operating system running on a computer equipped with 2 Dual core Intel Xeon processors with clock rates of 3GHz and 4GB of RAM. In the first set of experiments, the computation time of each algorithm was measured as the number of atoms in a molecule is increased. The chemical structures used were nonbranched, single-bonded, linear carbon chains, starting with 3 carbon atoms and increasing one carbon per structure to a maximum of 30 carbon atoms. The second set of experiments was used to quantify the time taken for an algorithm to complete when applied to datasets of increasing size, i.e. data sets comprising 10000, 50000, and 100000 molecules. The third set of experiments was used to quantify the performance of the sub structure search algorithm and the SMARTS algorithm. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:12.0pt"&gt;The results of experiments showed that the 2DBF generation, SMILES canonicalization and writing, full substructure search exhibit nonlinear running times with respect to the size of the input, where as the parsers, SDfile writing, topological pharmacophores, and atom types fingerprint generation run linearly. To process a data set of 100000 molecules, 2DBFG algorithm took 244.68s to complete whereas it took 0.61s for the atom type fingerprint generation algorithm to complete. With respect to substructure search in the same data set, it took over 25 min for full substructure search algorithm to search for the longest query and 12.67s for the shortest query. In contrast the SMARTS algorithm took 4 minutes for the longest query and 33.58s for shortest query.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="line-height:normal"&gt;&lt;span style="font-size:12.0pt"&gt;Thus the article claims that CDL, &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;by employing three modern programming techniques; the parameterization of internal properties of the molecular graph, the use of iterators as a generic interface for traversal of the structure underlying graph representation, and the use of property accessors as a generic interface to access the parameterized internal properties of the graph, achieves the flexibility to either attach necessary internal properties as the implementation requires or allow &lt;span style="mso-spacerun:yes"&gt; &lt;/span&gt;exernal users to provide their own property data structures and still be able to use all CDL code as long as they provide necessary property accessors. Thus the resulting algorithms will be modular, less error prone and simple to incorporate into external software projects.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;p class="MsoNormal" style="line-height:normal"&gt;Here is the link to CDL &lt;span class="Apple-style-span" style="font-family: Georgia; "&gt;&lt;a href="http://cdelib.sourceforge.net/doc/index.html"&gt;http://cdelib.sourceforge.net/doc/index.html&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6701307374380005843?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6701307374380005843/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6701307374380005843' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6701307374380005843'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6701307374380005843'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/chemical-descriptor-library-cdl-generic.html' title='Chemical Descriptor Library (CDL): A Generic, Open Source Software Library for Chemical Informatics'/><author><name>Sashikiran Challa</name><uri>http://www.blogger.com/profile/17562002072193850430</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://3.bp.blogspot.com/_1FWV0pIa5ig/S5rqETn1DnI/AAAAAAAAC1A/I8QBB6z4QXM/S220/100_0241.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3364976638193571208</id><published>2008-11-08T07:00:00.000-08:00</published><updated>2008-11-08T07:21:09.186-08:00</updated><title type='text'>Evaluation of machine-learning methods for ligand-based virtual screening</title><content type='html'>Chen, Beining; Harrison, Robert; Papadatos, George; Willett, Peter; Wood, David; Lewell, Xiao; Greenidge, Paulette; Stiefl, Nikolaus&lt;br /&gt;J Comput Aided Mol Des (2007) 21:53-62&lt;br /&gt;DOI: 10.1007/s10822-006-9096-5&lt;br /&gt;&lt;br /&gt;K nearest neighbor algorithm (kNN) uses the Euclidian distance to define the distance between two molecules described by feature vectors in a multivariate real numbered space. A test molecule can then be classified to the most common category of its k nearest neighbors. In this study, the author introduced a kNN algorithm (CKD), in which a Gausian-based kernel is employed as a weighing scheme to increase the influence of the training molecules that are closest to the query molecule. In the first step, the bandwidth, h, of the Gaussians that are used to estimate the probability distributions is optimized using the leave-one out cross validation algorithm. The test set molecules are then ranked by decreasing probability calculated using a KD-Scoring function.&lt;br /&gt;&lt;br /&gt;Training-sets and test-sets were chosen from eleven activity classes and eight most diverse activity classes in MDDR database with the diversity quantified by the mean pair-wise similarity (MPS) using Tanimoto coefficient. Molecules are characterized by three types of non-binary descriptors: Pipeline Pilot, Hologram and Molconn-Z. The mean percentages of test set actives retrieved in the top-1% of the ranked list averaged over five training sets for each class, are used to quantify the screening performance. For the first eleven classes, Hologram consistently outperforms Pipeline, which in turn outperforms Moloconn Z. For the eight most structurally diverse classes, however, the difference is negligible. The author then successfully concluded that due to its structure based nature, the performance of Hologram representation has a much stronger dependency on the degree of structural similarities between training set and test set molecules than the property based descriptors such as Pipeline Pilot. The author further showed that the performance of CKD for molecules represented by ECFP4 (a kind of binary fingerprints) is comparable to that of a Aitchison and Aitkin’s kernel weighted kNN algorithm (BKD), which is specifically developed for the processing of molecules represented by binary descriptors. Therefore, it could be inferred that CKD might be a more preferable data analysis tool in that it is able to handle both non-binary and binary data with satisfactory performance.&lt;br /&gt;&lt;br /&gt;In the second part, the author evaluated the performance of naïve Bayesian classifier (NBC) when the training set contains only a few active molecules. In NBC, a weight for each bit is calculated depending on the numbers of active and inactive molecules in the training set that have that particular bit set to one. For a test set molecule, the weights of those bits are set in its fingerprint are then summed to represent its overall probability of activity. The result of NBC is compared with those obtained from group fusion method, which involves the use of multiple reference molecules and a single similarity coefficient to give a fused similarity score for each test set molecule and rank the database.&lt;br /&gt;&lt;br /&gt;14 activity classes identified in the MDDR were divided into 3 groups of structural diversity (high, medium, low) still using MPS criterion. The training set actives for both the NBC and group fusion experiments were selected using three methods, with A1 yielded most diverse sets, then A2 and then A3. In addition, another three methods were employed to select training set inactives for NBC only, with I3 resulting in the most diverse sets, then I2 and then I1. Molecules are represented by ECFP4 fingerprints. All experiments were run ten times and the performance is still evaluated by the mean percentage retrieval of actives in top 1% of the ranked test set. The result shows: NBC works better for low diversity classes while group fusion is superior in processing high diversity classes; A3 consistently outperformed A1 and A2 for both NBC and group fusion; I2 and I3 gave better results than did I1 for NBC with the exception of the low-diversity activity classes where those three selection methods have comparable performance. The author then concluded the reason for such behavior is that NBC need a high degree of structural commonalities amongst actives to learn the features that contribute to activity and a high degree of structural diversity amongst inactives to learn the features of inactive compounds.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3364976638193571208?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3364976638193571208/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3364976638193571208' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3364976638193571208'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3364976638193571208'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/evaluation-of-machine-learning-methods.html' title='Evaluation of machine-learning methods for ligand-based virtual screening'/><author><name>Tian Xie</name><uri>http://www.blogger.com/profile/01160808784159693498</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5329644680260728992</id><published>2008-11-02T09:56:00.000-08:00</published><updated>2008-11-02T09:57:49.934-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='I571 - Homework 3'/><title type='text'>Insights into Drug Metabolism by Cytochromes P450 from Modeling Studies of CYP2D6-drug Interactions</title><content type='html'>Marechal, J; Kemp, C; Roberts, G; Paine, M; Wolf, C; Sutcliffe, M&lt;br /&gt;British Journal of Pharmacology (2008) 153, S82-S89&lt;br /&gt;&lt;br /&gt;The cytochrome (CYP) P450 family of enzymes has become increasingly important in drug development.  Of all the pharmaceutical drugs available on the market, 90% of them are metabolized by one of the seven major CYP isoforms.  The predicted metabolism and biosynthesis in humans of therapeutic compounds by CYP enzymes can improve the pharmaceutical development of compounds and improve the effectiveness of patient treatments.  Development of these compounds has been aided by the creation of crystal structures.  These structures can be used to obtain functional information such as predicted binding and metabolism.  Until recently, scientists had to rely on the crystal structures of CYPs from distantly related bacteria.  Recent advances have developed crystal structures for several human CYP enzymes.  This study focuses on the CYP2D6 and how the structural model evolved into an important drug development tool.&lt;br /&gt;&lt;br /&gt;Early studies of the structure of CYP2D6 were done using homology models from bacterial CYP’s that shared less than 25% sequence identity.  Most scientists were pessimistic that the homology model would provide useful insight, but scientists were able to identify some useful information about active sites residues Asp301 and Phe483.  These were both later confirmed by crystal structures. &lt;br /&gt;&lt;br /&gt;Later studies used mammalian CYP (rabbit) for more accurate CYP2D6 models in spite of sharing only 40% sequence similarity.  This model was generated using the rabbit CYP structure with a combination of four bacterial structures.  This predictive model was tested in a series of molecular docking studies of know CYP2D6 substrates using a program called GOLD.  Substrates such as codeine, MPTP, dextromethorphan, and spirosulphonamide confirmed the metabolism docking site and help validate the model.  This model aided scientists in identifying substrate-binding residues Glu216, Asp301, and Phe120 and predict how they influence binding.  This model also successfully predicted the binding characteristics of atypical (substrates devoid of basic Nitrogen) substrates as well as the ability to predict substrate inhibition with CYP2D6. &lt;br /&gt;&lt;br /&gt;When the crystal structure of CYP2D6 finally became available, the model based on the rabbit CYP structure was validated against it.  There was good agreement between the residues in the binding sites and reasonable agreement between the root mean square deviation of the alpha carbon atoms in the crystal structure and the model.  The key difference was in the F-G loop region, but this region varies between CYPs and the difference can be linked to the fact that homology modeling is limited to the structural space occupied by the template used.  These types of changes can be difficult to predict using molecular dynamics simulations.&lt;br /&gt;&lt;br /&gt;Overall, the development of a CYP2D6 model using homology modeling, molecular docking, active site characterization, and bioinformatics analysis can be used to adequately predict compound interactions.  With an accurate amino-acid sequence and a good working knowledge of an enzyme, a high quality homology model can be developed.  The model can help predict how an enzyme is metabolized and offer insights into its functionality.  This information is valuable for drug development by being able to screen compounds at an earlier stage and reduce time and money in pharmaceutical trials.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5329644680260728992?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5329644680260728992/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5329644680260728992' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5329644680260728992'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5329644680260728992'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2008/11/insights-into-drug-metabolism-by.html' title='Insights into Drug Metabolism by Cytochromes P450 from Modeling Studies of CYP2D6-drug Interactions'/><author><name>Dwitty</name><uri>http://www.blogger.com/profile/12178856732093025265</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5572293190569627055</id><published>2007-10-31T18:15:00.000-07:00</published><updated>2007-10-31T19:27:51.806-07:00</updated><title type='text'>Combining Unsupervised and Supervised Artificial Neural Networks to Predict Aquatic Toxicity</title><content type='html'>&lt;span style="font-family: times new roman; font-style: italic;"&gt;J.Chem.Inf.Comput.Sci.&lt;/span&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;2004,&lt;/span&gt;&lt;span style="font-style: italic;"&gt; 44, &lt;/span&gt;&lt;span style="font-family: times new roman; font-style: italic;"&gt;1897-1902 DOI- 10.1021/ci0401219 &lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family: times new roman;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;div style="text-align: justify;"&gt;&lt;div style="text-align: justify;"&gt;&lt;span style="font-family: times new roman;"&gt;This paper deals with the usage of supervised and unsupervised neural networks to predict the toxicity of a large set of different chemicals following the QSAR postulates.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: times new roman;"&gt;Unsupervised neural networks learn to detect regularities and correlations in their input data and adapt their future response to that input . The neurons of competitive networks learn to recognize groups of similar vectors and seperate disimilar ones (said to be clustering). A supervised ANN learns from the input-output pair examples to build an external relationship between the input and the output. So this requires both input and output vectors to train the network until it manages to approximate a mathematical function between them. ANNs are used in monolithic approach to develop models to predict toxicity but the method used to predict toxicity in this paper is mixed method which combines different individual networks each built on subproblem.&lt;br /&gt;&lt;br /&gt;Two datasets are considered in study of aquatic toxicity -one based on the acute toxicity of fathead minnow fish(&lt;span style="font-style: italic;"&gt;Pinephales promelas&lt;/span&gt;)-second one from the TETRATOX database which contains information about the inhibition of growth determined by chemical agents to a protozoan ciliate.&lt;br /&gt;This model built by the authors considers two datasets as input -training and test data ,which are initially clustered into groups considering data of similar chemical structure or composition. This clustering is done by employing unsupervised ANNs and then each cluster is given as input to supervised ANNs to arrive at the good predictive model. The results obtained with these models are comparable with the results with the best models published .Chemical descriptors were used to afford chemical information, while parallel computing was used for bothe clustering and predictive models. So the combination of local experts resulted with  more reliability of the models and precision of sub models  is increased by this flexible architecture.&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: times new roman;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;span style="font-family: times new roman;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family: times new roman; font-style: italic;"&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5572293190569627055?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5572293190569627055/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5572293190569627055' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5572293190569627055'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5572293190569627055'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/combining-unsupervised-and-supervised.html' title='Combining Unsupervised and Supervised Artificial Neural Networks to Predict Aquatic Toxicity'/><author><name>Abhigna</name><uri>http://www.blogger.com/profile/14455147165539340627</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-205950265090899580</id><published>2007-10-24T11:14:00.001-07:00</published><updated>2007-10-24T11:15:39.292-07:00</updated><title type='text'></title><content type='html'>&lt;p class="MsoNormal" style="line-height: normal;"&gt;&lt;b&gt;&lt;span style=";font-family:&amp;quot;;font-size:14;"  &gt;I571- &lt;/span&gt;&lt;/b&gt;&lt;b&gt;&lt;span style=";font-family:&amp;quot;;font-size:14;"  &gt;Mapping of Activity-Specific Fragment Pathways Isolated from Random Fragment Populations Reveals the Formation of Coherent Molecular Cores&lt;/span&gt;&lt;/b&gt;&lt;b&gt;&lt;span style=";font-family:&amp;quot;;font-size:14;"  &gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;i&gt;J. Chem. Inf. Model.,&lt;/i&gt; &lt;b&gt;ASAP Article&lt;/b&gt; 10.1021/ci700251b S1549-9596(70)00251-6 &lt;span style=";font-family:&amp;quot;;font-size:12;"  &gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=""&gt;     &lt;/span&gt;This paper discusses a mapping technique that can be used to further classify and indentify sets of substructures used to target selected activity classes. Although there is a current repertoire of available substructure classification technology, activity-specific fragment pathways with the aid of core regions enable the derivation of compound-class-directed sets of structural descriptors. These descriptors will not therefore depend on hierarchical or retrosynthetic molecular fragmentation, which are the conventional design schemes. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style=";font-family:&amp;quot;;font-size:12;"  &gt;&lt;span style=""&gt;     &lt;/span&gt;Traditionally, molecular substructures have been noted for for being information-rich and a powerful descriptor for applications. More recently, molecule-specific information has been captured and analyzed using randomly generated molecular fragmented populations. MolBlaster is the method used and designed to generate these populations of random fragments that will go on to be tested. The aforementioned method along with previous tools allowed for the detection of molecular similarity relationships and their respective active compounds using large database compounds and reference molecules. This led to the further study of mining the random fragment profiles for activity-class specific compounds. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style=";font-family:&amp;quot;;font-size:12;"  &gt;&lt;span style=""&gt;            &lt;/span&gt;Over 1000 active molecules, which were placed into 45 different activity classes were systematically analyzed. The molecules were taken from various compound repositories or literature. Each molecule was used to create random fragment populations, generated by MolBlaster, using 3000 fragmentation iterations per molecule and randomized numbers of bond deletions per step. The resulting scheme produced characteristic fragment populations for the various classes of active compounds. The authors then took each compound class and defined a set of activity-class-characteristic substructures, (ACCS). These sets of ACCS were then used for the mapping studies. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style=";font-family:&amp;quot;;font-size:12;"  &gt;&lt;span style=""&gt;            &lt;/span&gt;The mapping of characteristic substructures were done using PerlMol. Active molecules were analyzed for alternative core regions, defined as the set of all atoms of the molecule that had an ACCS match rate of greater than the percent relevant to the core level. Subsets of ACCS were created and compared by core sets, relative core size, core regularity, and core formation. Cluster analysis and ACCS fingerprints were generated using a Tanomoto coefficient (modified) and Weighted Pair Group Method with Arithmetic Mean (WPGMA). &lt;/span&gt;&lt;span style=";font-family:&amp;quot;;font-size:12;"  &gt;In order to quantitatively compare the relationships between core sets in each activity class, bit string representations were generated, where each bit position accounted for the presence or absence of an individual fragment of the corresponding ACCS set. As a distance metric for cluster analysis of these prototypic fingerprints, a modified version of the Tanimoto coefficient was configured using over bits that are set off and on. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"&gt;&lt;span style=";font-family:&amp;quot;;font-size:12;"  &gt;&lt;span style=""&gt;            &lt;/span&gt;Several interesting results were found after analysis. It was proven that 93% of active molecules form a core after fragment mapping, coherent cores were formed when different activity classes regions of fragments overlapped, the change of background molecule sets let to a at most 5% variation in core formation, more than 79% of all molecules displayed only a single contiguously growing core region, and that distinct cores were rarely seen. &lt;span style=""&gt; &lt;/span&gt;Further systematic mapping revealed that substructures do in fact regularly from coherent molecular core regions. The goal of extending the repertoire of already available substructures and demonstrating the ability to identify sets of substructures to target selected activity classes was reached. &lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-205950265090899580?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/205950265090899580/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=205950265090899580' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/205950265090899580'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/205950265090899580'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i571-mapping-of-activity-specific.html' title=''/><author><name>AllenA</name><uri>http://www.blogger.com/profile/18155595070326503446</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5315641094474745562</id><published>2007-10-24T10:20:00.000-07:00</published><updated>2007-10-24T10:22:39.597-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='I-571 coffee break'/><title type='text'>Text-based Information Extraction from Literature</title><content type='html'>&lt;p class="MsoNormal"&gt;&lt;span class="textbold"&gt;Automated Extraction of Information from the Literature on Chemical-CYP3A4 Interactions &lt;/span&gt;&lt;br /&gt;&lt;span class="text"&gt;Feng, C.; Yamashita, F.; Hashida, M.&lt;/span&gt;&lt;br /&gt;&lt;span class="textitalics"&gt;J. Chem. Inf. Model;&lt;/span&gt; &lt;span class="textbold"&gt;(Article); 2007&lt;/span&gt;; &lt;span class="text"&gt;ASAP Article&lt;/span&gt;;  &lt;span class="text"&gt;DOI: &lt;a href="http://dx.doi.org/10.1021/ci700091m"&gt;10.1021/ci700091m&lt;/a&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="text-align: justify;" class="MsoNormal"&gt;&lt;span class="text"&gt;In chemistry and biology fields, information on one specific research topic usually can be found in a large variety of publications. It’s considered to be a very time-consuming and labor-intensive task to collect and organize literature information. In this paper, authors discuss a NLP(natural language processing)-based system for extracting information from literature on interactions between chemicals and CYP3A4, a member of CYPs with broad substrate specificity that often leads to unexpected drug-drug interactions.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;p style="text-align: justify;" class="MsoNormal"&gt;&lt;span class="text"&gt;The strategy for extracting information in this work contains 3 main steps, identification of chemicals and CYP3A4 names, sentence processing, and interaction extraction.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;p style="text-align: justify;" class="MsoNormal"&gt;&lt;span class="text"&gt;The first step is to identify names of chemicals and CYP3A4. Unlike usual dictionary-based approach in name recognition, which often leads to partial matches and mismatches, a combination of dictionary- and context-based approach introduced in this paper is used to avoid partial matches or mismatches and to identify chemical names not included in dictionary. This approach was reported to give high recall and precision in chemical name identification.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;p style="text-align: justify;" class="MsoNormal"&gt;&lt;span class="text"&gt;After the completion of the first name identification step, sentences containing both chemical names and CYP3A4 are subjected to the sentence processing step. As the first movement in this step, noun phrases and verb phrases are created from each sentence following certain rules. Then a simple clause is reconstructed to express a single event involving chemicals and CYP3A4. Thus, the text is ready for the final interaction extraction step.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;p style="text-align: justify;" class="MsoNormal"&gt;&lt;span class="text"&gt;The interaction extraction process described in this paper features a pattern matching only based on the sequences of keywords (chemical name, CYP3A4 name, and key verbs) in a clause when information on chemical- CYP3A4 interaction is extracted. This feature allows a high recall for this system and improved precision by implementing several rules to filter temporal hits.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="text-align: justify;"&gt;  &lt;/div&gt;&lt;p style="text-align: justify;" class="MsoNormal"&gt;&lt;span class="text"&gt;Although the system discussed in this paper shows advantage in some aspects to other systems, it also has flaws which can be attributed to speech tagging used for determination of boundaries between noun and verb phrases. The applicability of this system to interactions between chemicals and any functional proteins is to be examined.&lt;o:p&gt;&lt;/o:p&gt;&lt;/span&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5315641094474745562?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5315641094474745562/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5315641094474745562' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5315641094474745562'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5315641094474745562'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/text-based-information-extraction-from.html' title='Text-based Information Extraction from Literature'/><author><name>Qian</name><uri>http://www.blogger.com/profile/17839273512185384669</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3728379409190524661</id><published>2007-10-24T10:13:00.000-07:00</published><updated>2007-10-24T10:23:36.381-07:00</updated><title type='text'>I571: Large-Scale Annotation of Small-Molecule Libraries Using Public Databases</title><content type='html'>Yingyao Zhou,Bin Zhou,Kaisheng Chen,S. Frank Yan,Frederick J. King,Shumei Jiang,and Elizabeth A. Winzeler&lt;br /&gt;J. Chem. Inf. Model. 2007, 47, 1386-1394&lt;br /&gt;&lt;br /&gt;In medical research, few publicly accessible databases could provide excellent &lt;a href="http://en.wikipedia.org/wiki/Annotation"&gt;annotation&lt;/a&gt; for small chemical compounds.  Part of the reason is that knowledge about small molecules is difficult to retrieve by text index services. This usually can be done by commercial data vendor with high cost. However, commercial data sources which fail to provide a friendly annotation interface for large numbers of compounds are not widely available right now. Therefore, using annotation information for the selection of lead compounds from a high-throughput screening (HTS) campaign presently occurs only under a very limited scale.&lt;br /&gt;&lt;br /&gt;In this paper, the author proposes a new way to automatically create a comprehensive local compound annotation database. They use an integrated cheminfomatics and bioinformatics pipeline to combine PubChem with other contributing database with annotation. The main part of this paper is to illustrate how such an annotation database can provide significant and substantial information to scientists for their screening compound collection at the very early stages of the drug discovery process.&lt;br /&gt;&lt;br /&gt;First, how to implement it? They downloaded all the compounds and substances from PubChem first. And then, using the NCBI programming interface, they retrieved links from compounds to such database as PMC,MeSH with annotation information. Finally, they integrated chemical structures and their annotations using various public bioinformatics programming utilities. A web service has also been established for public users to submit compounds in batch and directly retrieve relevant compound knowledge from the annotation database.&lt;br /&gt;&lt;br /&gt;Then in this paper, it observed that the availability of compounds with annotation in their resultant database roughly parallels the availability of annotations in the CAS database. It demonstrated that the two knowledge bases, PubChem and CAS, presently are complementary. An integration of the annotations from both databases would create a more complete description of the biological and pharmacological properties for the compounds of interest eventually.&lt;br /&gt;&lt;br /&gt;At last they took MeSH-driven HTS data analysis and antimalarial assay validation as examples to show that annotation can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. Considering the instrumental role the gene ontology database has been playing for system biology studies, it is reasonable to expect a compound mechanism of an action database will lead to more interesting cheminformatics discoveries in the near future.&lt;br /&gt;&lt;br /&gt;Here, we have to expect that content providers will contribute their catalogs to such public database as PubChem and enables licensees to prescan their in-house chemical collection gradually. Meanwhile, a better annotated chemical-related literatures method should be explored. And the quality and formulation of annotations should be enhanced.&lt;br /&gt;&lt;br /&gt;In sum, this is one of the most important papers to use such public database as PubChem to facilitate chemical research. So I think how to fully use a public database to serve our research will become a potential topic in the near future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3728379409190524661?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3728379409190524661/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3728379409190524661' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3728379409190524661'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3728379409190524661'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i571-large-scale-annotation-of-small.html' title='I571: Large-Scale Annotation of Small-Molecule Libraries Using Public Databases'/><author><name>Bin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3231447089592610074</id><published>2007-10-24T09:56:00.000-07:00</published><updated>2007-10-24T10:01:43.981-07:00</updated><title type='text'>wednesday October 24, 2007</title><content type='html'>&lt;div align="justify"&gt;&lt;span style="font-size:130%;"&gt;I-571 Chemical Fragment Spaces for de novo Design&lt;/span&gt;&lt;br /&gt;Harald Mauser and Martin Stahl&lt;br /&gt;Chem. Inf. Model; 2007, 47, 318-324 DOI: 10.1021/ci6003652&lt;br /&gt;   The de novo design is a method that focuses on positioning and connecting molecular fragments or growing substituents on a core in the binding site. The combinatorial library of analogs is created by appending up to several substituents from a list of drug fragments. All reasonable conformers of each analog are constructed and optimized. The final structures are again scored with consideration of the interaction energy, hydrogen bonding, and solvation energy change. But sometimes the dataset is really huge, we just can use part of them because of the limitation of computer calculation capability.&lt;br /&gt;   This paper uses chemical fragments spaces that are combinations of molecular fragments and connection rules to compact the dataset size. At first, the author generated the fragments by using a simple iterative disconnection algorithms based on the Daylight toolkit. The fragment rules were coded as Reaction SMARTS and applied on the cononical SMILES representations of the compound collection. Linking marking broken bonds were encoded as [m*] that it can be an identifier to differentiate between link types. Then, the authors takes two methods to select fragments emphasizing diversity and frequency of occurrence. The selection based on diversity employed the maximum common substructure search (MCSS) method. In this set a fragment with a larger number of links is more versatile during de novo design and should be preferred. Alternatively, the author applied selections biased toward the fragments’ frequency of occurrence. By setting the frequency threshold to 3 for this dataset, they can get the most versatile fragments. At last, Feature Tree fragments space program were run for each of the structures.&lt;br /&gt;The author select three methods to generate the fragments dataset with the aim of evaluating if small set of fragments would still be able to efficiency cover sizable portions of chemical space. The first one started from known chemical structures and disconnected them into fragments to be recombined with a set of chemical rules. The second one is to list synthetically tractable scaffolds and buildings blocks to be connected in a combinatorial chemistry fashion. The last one based on the first one, but added two generic rules. First, the author compared diversity of each of these sets, and found the set I and random II have a high diversity, and it shows diversity-oriented selection method can get good fragments diversity. Then, they analyzed the coverage of chemical space by each of these fragments spaces. A MOS-based similarity metric was employed to assess the similarity between the resulting structure and the query molecules. The results show that the frequency-biased and the diverse fragment subset perform somewhat better than the random set, but they are not different from each other. The combined set has a significant improvement in performance that reproduces 400 more structure than a random set of equal size. Therefore, the combined set can maximize the coverage of known druglike chemical space with a strongly reduced set of fragment..&lt;br /&gt;    In the conclusion, they get the combination of the most frequently occurring fragments with substructure-based diverse subset covers a significantly larger portion of druglike chemical space. We can use this method to encode an enormously large number of chemical structures in a very compact format.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3231447089592610074?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3231447089592610074/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3231447089592610074' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3231447089592610074'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3231447089592610074'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/wednesday-october-24-2007.html' title='wednesday October 24, 2007'/><author><name>Jun Ma</name><uri>http://www.blogger.com/profile/10403377870524940821</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-367211624906612538</id><published>2007-10-24T09:54:00.000-07:00</published><updated>2007-10-24T10:18:29.500-07:00</updated><title type='text'>I571-Cluster analysis and 2D-QSAR of P. aeruginosa deacetylase LpxC inhibitors</title><content type='html'>by Rameshawar U. Kadam and Nilanjan Roy&lt;br /&gt;&lt;br /&gt;In this paper, the author developed a predictive cluster analysis-based 2D-QSAR model that is applicable to a diverse set of molecules for Pseudomonas aeruginosa deacetylase LpxC inhibition.&lt;br /&gt;&lt;br /&gt;Step 1: determine the data set. On the basis of previous studies, the author picked 51 LpxC inhibitors with specified 4-positions. (Is the size of the data set large enough? Is it representative?) The basic skeleton and conformation of molecules were modeled and minimized using PM3 Hamiltonian using MOPAC interfaced with SYBYL6.9 and a single–point energy calculation was performed. (If the crystal structure of the enzyme is known, what kind of information will be necessary for setting up the molecule database?)  &lt;br /&gt;&lt;br /&gt;Step 2: specify the method. 480 molecular descriptors were calculated using DRAGON5.3; highly correlated descriptors were removed and the remaining descriptors were used to develop QSAR model in Cerius 4.10 software. The author set the equations to be linear polynomial ones with 5 terms plus a constant. The equations with the highest correlation coefficient were used for further study.  (Why the author set the equation to linear polynomial? Isn’t it kind of arbitrary?)&lt;br /&gt;&lt;br /&gt;Step 3: develop and compare the models. First the author developed a statistically significant model to predict pIC50 for the complete data set (the conventional model). 5 descriptors were indentified to be related to the inhibitory activity with the 37 molecules in the training set; the correlation coefficients and the cross-validated correlation coefficients are recorded; outliers were identified. Then the author divided both the training set and testing set into two clusters using GFA, assuming that molecules with diverse structures should be accurately reflected by separate QSAR models. Again the most relevant descriptors were identified and correlation coefficients are recorded to compare those in the conventional model.&lt;br /&gt;&lt;br /&gt;The author came to the conclusion that the cluster analysis based approach provides better predictability as compared to conventional 2D-QSAR. &lt;br /&gt;&lt;br /&gt;The method seems straight-forward…  How convincing is the result? This is a descriptor based analysis, but it may shed lights on the mechanism underlying the enzymatic activity if we can interpret the meaning of the descriptors properly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-367211624906612538?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/367211624906612538/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=367211624906612538' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/367211624906612538'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/367211624906612538'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i571-cluster-analysis-and-2d-qsar-of-p.html' title='I571-Cluster analysis and 2D-QSAR of P. aeruginosa deacetylase LpxC inhibitors'/><author><name>Nan</name><uri>http://www.blogger.com/profile/06074354900881647312</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-4384910066266074838</id><published>2007-10-24T09:42:00.000-07:00</published><updated>2007-10-24T09:54:16.264-07:00</updated><title type='text'>I571- 4D Fingerprints in QSAR</title><content type='html'>Iyer, M. and Hopfinger, A.J. (2007) Treating chemical diversity in QSAR analysis: Modeling diverse HIV-1 intgrase inhibitors using 4D fingerprints. J. Chem. Inf. Model., 47: 1945-1960.&lt;br /&gt;&lt;br /&gt;In this paper, the authors apply a methodology for generation of 4D fingerprints (J. Chem. Inf. Comput. Sci., 41: 1367-1387, 2001) to clustering and QSAR modeling of a structurally diverse set of HIV Integrase inhibitors. The development of this methodology has been driven by the need to model ADMET (Absorption, Distribution, Metabolism, Elimination and Toxicity) properties of diverse molecules for which the target structure, binding geometry and/or mechanism of action is unknown. The Integrase inhibitors were selected as a demonstration case for this methodology because they combine the attributes of a good QSAR data set with the ambiguities of ADMET modeling .&lt;br /&gt;&lt;br /&gt;The data set consisted of 213 compounds comprising 12 structural classes. Each structural class was divided into training and test sets with a total of 148 and 65 compounds, respectively. The endpoint modeled in the QSAR assessment was IC50 for the 3’-processing activity of HIV Integrase. To generate 4D molecular fingerprints, each compound was subjected to a molecular dynamics simulation to generate an ensemble of 1000 energetically accessible conformations. This sampling of conformational space constitutes the 4th dimension of the molecular fingerprints. Each molecule was then divided into functional pieces referred to as interacting pharmacophore elements (IPEs). Eight different IPE types were defined including both non-specific (any atom, hydrogen-suppressed) and specific (HB donors, aromatic atoms, polar negative atom) types. For each pair of IPE types within a given molecule, an absolute molecular similarity main distance-dependent matrix (MDDM) that captures the intrinsic size, shape and conformational flexibility of the molecule over the conformational space sampled was constructed. The eigenvalues of these matrices constitute the 4D fingerprint. In addition to the 4D fingerprint descriptors, a number of traditional QSAR descriptors (i.e. HOMO, LogP, etc) were also included. Compounds were then clustered using 2 different approaches; Partitioning around Medoids (PAM) and divisive hierarchical clustering. QSAR models were constructed for each training set using multiple linear regression and optimized using a genetic algorithm. Correlation coefficient of fit (R2) and cross-validated correlation coefficients (Q2) were calculated for each model.&lt;br /&gt;&lt;br /&gt;A single QSAR model constructed from the entire set of HIV Integrase inhibitors was not statistically significant, suggesting that no set of structural features common to all structural classes could describe the inhibitory activities of the entire data set. Clustering of the data set using either of the 2 methods employed segregated the compounds into the same 3 major clusters. The divisive hierarchical clustering approach further divided the largest of these clusters into 3 subgroups. QSAR models constructed for each of clusters or subclusters were significantly improved over the original global model. The model derived from the largest cluster was only marginally significant; however, the models from the further hierarchical clustering of these compounds all had reasonable correlation coefficients (R2 ≥ 0.78, Q2 ≥ 0.71). Interestingly, all of the models were either entirely or overwhelmingly composed of 4D descriptors. When applied to the test set for each cluster, the individual QSAR models performed reasonably well (R2pred ≥ 0.62) across structural classes and quite well (R2pred = 0.80) for individual clusters. The models also provided some limited information pharmacophore types involved in pharmacologic activity and their special distribution within the inhibitors of various structural classes.&lt;br /&gt;&lt;br /&gt;Overall, the application of 4D fingerprints was shown to be successful for both compound clustering and QSAR modeling, and may provide an advantage over traditional QSAR descriptor alone by incorporating more information around molecular flexibility and conformational states.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-4384910066266074838?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/4384910066266074838/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=4384910066266074838' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/4384910066266074838'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/4384910066266074838'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i571-4d-fingerprints-in-qsar.html' title='I571- 4D Fingerprints in QSAR'/><author><name>RAKemper</name><uri>http://www.blogger.com/profile/06832485813728392472</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8990847395438813112</id><published>2007-10-24T08:52:00.000-07:00</published><updated>2007-10-24T09:22:14.509-07:00</updated><title type='text'>I571-Application of Molecular Docking to Protein Folding Structure Prediction and Drug Discovery</title><content type='html'>By: Chun-Yen Chung, Shih-Ching Ou, Chia-Chih Tsai, Chin-Ch Chien, and Da-Yu Su&lt;br /&gt;&lt;br /&gt;&lt;p class="MsoNormal"&gt;The process of docking is widely used in molecular modeling for target based design.&lt;span style=""&gt;  &lt;/span&gt;Some of the many challenges of drug design relates to the challenge of finding small molecules that will associate with proteins.&lt;span style=""&gt;  &lt;/span&gt;This binding causes proteins to act as either enzyme inhibitor or receptor agonist.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Computational methods are being used to identify the formation of intermolecular complexities.&lt;span style=""&gt;  &lt;/span&gt;The concept that drug activity can be achieved by the binding of the ligand to a protein is well accepted.&lt;span style=""&gt;  &lt;/span&gt;A successful drug activity needs to exhibit geometric and chemical complementarity.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Docking is a computational method that works by searching for a ligand that can fit the geometrically and energetically binding site of a receptor.&lt;span style=""&gt;  &lt;/span&gt;Docking’s energy evaluations are carried out by calculating the energy evaluation using a scoring function.&lt;span style=""&gt;  &lt;/span&gt;Molecular modeling consists in calculating the energy of conformations and interactions.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;By using docking, one tries to develop the physical understanding and analytical tools to identify configurations in which two molecules interact and the molecular design capabilities needed to make drugs that interact with predetermined molecular sites.&lt;span style=""&gt;  &lt;/span&gt;Popular docking programs include the following: Autodock, Dock, and FlexX.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Molecular modeling is the combination of computational chemistry and computer graphics.&lt;span style=""&gt;  &lt;/span&gt;It is widely used to identify the energy evaluation and minimization, comformational analysis, and dynamic simulations of the structures.&lt;span style=""&gt;  &lt;/span&gt;Potential binding of ligands at a particular region of the receptor can be modeled and superimposed on one another to form a consensus-binding mode.&lt;span style=""&gt;  &lt;/span&gt;Such alignment forms a pharmacophore of potential binding, therefore, giving a chemist a guide for potential molecular design and synthesis.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;The protein folding problem is figuring out the correct spontaneous folding of proteins.&lt;span style=""&gt;  &lt;/span&gt;Many of these polypeptides/proteins are formed by chains of amino acids, and have certain chemical properties that could be used to manually fold the protein, producing a well defined 3-D of the structure.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Molecular mechanics (MM) and molecular dynamics (MD) describe the molecular structures and properties.&lt;span style=""&gt;  &lt;/span&gt;Its uses are in organic molecules, peptides, and oligonucleotides, just to mention a few.&lt;span style=""&gt;   &lt;/span&gt;MM will predict the energy associated with a given conformation of a molecule, but this value has no absolute meaning by itself.&lt;span style=""&gt;  &lt;/span&gt;One needs to find the differences in energy between two or more conformations.&lt;span style=""&gt;   &lt;/span&gt;To do so, one needs this MM energy equation: &lt;span style="color: rgb(51, 153, 153);"&gt;Energy (E) = E Stretch + E Bending + E Torsion + E Non-bonded Interactions&lt;/span&gt;.  A force-field is the collection of data dealing with the behavior of different types of atoms, bonds, and MM energy.&lt;span style=""&gt;  &lt;/span&gt;Other force-fields describe deformations, coupling between bending and stretching in adjacent bonds.&lt;/p&gt;  &lt;p class="MsoNormal"&gt;Drug–likeness prediction consists of identifying the molecular properties and structural features that determines if a molecule is a drug or non-drug.&lt;span style=""&gt;  &lt;/span&gt;Such properties can be: hydrophobicity, electronic distribution, hydrogen bonding characteristics, molecule size, among others.&lt;span style=""&gt;  &lt;/span&gt;There is a wide variety of possible drug targets and each requires a combination of matching molecular characteristics.&lt;span style=""&gt;  &lt;/span&gt;To distinguish non-drugs, one can use the molecular weight, logP, or H-donor and H-acceptor. &lt;span style=""&gt; &lt;/span&gt;Molinspiration offers a strategy for identifying active drug molecules using a method that implements the Sophisticated Bayesian Statistics, for more information go to: &lt;a href="http://www.molinspiration.com/docu/miscreen/druglikeness.html"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;http://www.molinspiration.com/docu/miscreen/druglikeness.html&lt;/span&gt;&lt;/a&gt;. &lt;/p&gt;  &lt;p class="MsoNormal" style=""&gt;Computer assisted drug design (CADD) offers the process of discovering new drugs through the visualization, characterization, manipulation, and analyzes of drug candidates and target receptors using molecular modeling programs.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8990847395438813112?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8990847395438813112/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8990847395438813112' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8990847395438813112'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8990847395438813112'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i571-application-of-molecular-docking.html' title='I571-Application of Molecular Docking to Protein Folding Structure Prediction and Drug Discovery'/><author><name>Carlos</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-8888967901632802703</id><published>2007-10-23T19:10:00.001-07:00</published><updated>2007-10-23T19:11:44.336-07:00</updated><title type='text'>I571:  Pharmacogenomics and molecular design</title><content type='html'>&lt;span&gt;Bailey D "Site-specific molecular design and its relevance to pharmacogenomics and chemical biology" The Pharmacogenomics Journal (2001) 1, 38-47. &lt;/span&gt;&lt;br /&gt;&lt;span&gt;&lt;br /&gt;With the evolution and improvement in information technology, more and more information became available to the public.  Improvement of the information technology increased the throughput of the productivity and the growth of the economy.  And this technology is being matured to be utilized in drug discovery in pharmaceutical industry.&lt;br /&gt;&lt;br /&gt;Using the power of the computation and information technology, the concept of chemoinformatics is invented to determine the properties of chemical entities by looking at the molecular structure and empirically calculating their physical properties.  High throughput process is then being used to screen thousands of compounds selected by using the virtual screening to narrow down the lead compounds for further research and development.&lt;br /&gt;&lt;br /&gt;Virtual screening, powered by software, calculates the properties of chemical compounds.  Because it depends solely on software algorithm and computing power, the rate of which compounds can be screened is remarkable.  They are, however, determined by the geometry and mathematical modeling, which leads to some deviations from the actual properties.  But it can be used to screen potential chemical entities initially, before taking it to high throughput system, where they are analyzed experimentally.&lt;br /&gt;&lt;br /&gt;The similarity search algorithm of different chemical compound is being matured and there are a lot of data generated and complied in the database.  Using the search, a chemical compound known to have an activity to the targeted site can lead to new chemical entities that could be more effective in therapeutic means.  Also, using RECP, retrosynthetic combinatorial analysis procedure, the active binding fragment of the database can be searched and assembled together to increase the interaction energy of the existing molecule or creating a new molecule.&lt;br /&gt;&lt;br /&gt;Reverse engineering of designing drug compound can be utilized in de novo drug design.  Increasing interaction energy between the site and ligand can be performed computationally based on hydrogen bonding between site and ligand, hydrophobic interactions, electrostatic interactions, steric interactions and those mediated by water molecules in the site.&lt;br /&gt;&lt;br /&gt;Still there are challenges of designing drugs computationally and still needs to be tested experimentally.  But it enables searching the database and also creating a new molecule in a way that was not possible using HTS system due to high cost and time limitation.  Maturation of structure determination of proteins, homology modeling, and affinity selection could be combined to analyze the interaction energy between the site and ligand, and will further improve the de novo design of the drugs. &lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-8888967901632802703?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/8888967901632802703/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=8888967901632802703' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8888967901632802703'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/8888967901632802703'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i571-pharmacogenomics-and-molecular.html' title='I571:  Pharmacogenomics and molecular design'/><author><name>Won Hong</name><uri>http://www.blogger.com/profile/00023377302731313600</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-4275706132182212354</id><published>2007-10-23T16:24:00.000-07:00</published><updated>2007-10-23T17:07:29.765-07:00</updated><title type='text'>I573 - Structure Activity Relationship by NMR and by Computer: A Comparative Study</title><content type='html'>J. Am. Chem. Soc. 2002, 124, 11073-11084 DOI: &lt;a href="http://dx.doi.org/10.1021/ja0265658"&gt;http://dx.doi.org/10.1021/ja0265658&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;   In this paper, two strategies are discussed for identifying small core compounds that can be used to create larger, working ligands when lead compounds cannot be found by screening large libraries.  Two ways in which to identify small core compounds as stated in this paper are to use biophysical methods such as NMR or to use computational methods for ligand docking.  The purpose of this paper is to evaluate the performance of a physics-based method for ligand docking but most importantly to compare the performance of the computational method with that of data obtained using NMR.&lt;br /&gt;&lt;br /&gt;    Three protein/small ligand complex are determined using both NMR and the computational method. (In this case, a refinement of MCSS(multiple copy simultaneous search method).  The data from the computational method is then compared to that of the NMR.  The protein that was chosen for this study is FKBP12, which is a peptidyl prolyl cis-trans isomerase protein.  Three ligands: (2s) 1-acetylprolinemethylester (ACPM), 1-formylpiperidine (FOPI), and 1-piperidinecarboxamide (PICA) were chosen because they have features in common (they all have a five or six-membered heterocycle) with FK506 which is known to bind to FKBP12 with high affinity.&lt;br /&gt;&lt;br /&gt;Using NMR each ligand was assigned resonances from 2D TOCSY and 2D NOESY spectra.  The NMR experiments also allowed for detection of weak intermolecular NOE interactions.  From this data, intermolecular restraints between each ligand and the protein were able to be determined.   In the computational method each ligand was assessed using MCSS which determines energetically positions and orientations of functional groups on the surface of a protein.  This was used to determine possible positioning and orientation of each ligand in the binding site of FKBP12.&lt;br /&gt;&lt;br /&gt;The authors of the paper go into more detail explaining and describing their methods and data that was obtained.  From their study they came to the conclusion that using a computation method such as MCSS followed by ranking by preprocessing can be used as the computational alternative to determining structure activity relationship of ligands and their binding sites by use of NMR.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-4275706132182212354?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/4275706132182212354/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=4275706132182212354' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/4275706132182212354'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/4275706132182212354'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i573-structure-activity-relationship-by.html' title='I573 - Structure Activity Relationship by NMR and by Computer: A Comparative Study'/><author><name>kilsoo</name><uri>http://www.blogger.com/profile/01139816994132222380</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2506906431811664872</id><published>2007-10-22T09:14:00.000-07:00</published><updated>2007-10-22T09:29:00.797-07:00</updated><title type='text'>I571:Predict Ligand-Binding Sites Based On Amino Acid Composition</title><content type='html'>&lt;div align="justify"&gt;Use of Amino Acid Composition to Predict Ligand-Binding Sites Soga, S.; Shirai, H.; Kobori, M.; Hirayama, N.J. Chem. Inf. Model; (Article); 2007; 47(2); 400-406. DOI: &lt;a class="link" href="http://dx.doi.org/10.1021/ci6002202"&gt;10.1021/ci6002202&lt;/a&gt; &lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div align="justify"&gt;For medicinal chemists trying to find a suitable drug molecule for the target enzyme, it is very important to locate of the ligand-binding site for each specific protein. This paper describes a novel method for predicting the binding sites for druglike compounds on the surface of proteins based on the specific amino acid composition observed at the ligand-binding sites of ligand-protein complexes, which is determined by high quality X-ray analysis from the Protein Data Bank (PDB). Considering the fact that the binding of a drug with target is usually highly specific, the author makes the assumption that the binding site should have a distinct character significantly different from other concavities ( detected by Alpha Site Finder5 implemented in the software system MOE in this paper) on the target, certain specific amino acid residues should appear with higher probability than others.&lt;br /&gt;At the beginning of the paper, several previous algorithms for predicting ligand-binding sites are categorized into three major types: (1) geometric algorithms, (2) probe-mapping algorithms, (3) physical potential algorithms. Then the author points out that as an important factor, the clustering of specific amino acids at the ligand-binding sites was ignored in these previous work. The method proposed takes advantage of the information about the compositions of the amino acids at the ligand-binding sites of the ligand-protein complexes. Specifically, the author investigates the frequency of appearance of each of the 20 standard amino acids and specific amino acid compositions at the binding sites.&lt;br /&gt;Profiles (SA,CA,RA, definition details see paper) are calculated for the 20 standard amino acids surrounding each ligand for all proteins in the training set, show that the amino acid composition at each ligand-binding site is highly specific, validating the assumption of this work. Then author goes on using these characteristic profiles predict ligand-binding site locations on the basis of the amino acid compositions around the concavities. Particularly, PLB (propensity for ligand binding ) indices for 15232 concavities found in 756 proteins are caculated, concavity with higher PLB indices is expected to be more likely a bing site. Concavities with the highest PLB index match 79% of the true ligand-binding sites in the test data set, and if concavities with the first two highest PLB indices included, the hit percentage is up to 86%. After exclude nondruglike ligands in the test data set, the concavities with the first two highest PLB indices cover 95% of the true binding sites. So the author suggests that including concavities with the first two highest PLB indices is very fast but accurate selection. Finally, two examples are given to illustrate that this method can distinguish the true binding site from other concavities in a protein.&lt;br /&gt;There is no complex computational work needed to predict the binding site, so it is cheap and efficient “information based” method. The method proposed in this paper can serve as a very powerful tool for medicinal chemist, since the high resolution X-ray structures and detailed information about the position of residues is not required to predict the binding site, and it can predict the binding site for unknown ligands for novel target. The flaw of this work is that when author choose training and test set, if a protein was multimeric, only the monomer with smallest atomic displacement parameters for their non-hydrogen atoms is considered, so it fail to identify the binding site in other subunit and on the interfaces of subunits, which might happen in reality. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2506906431811664872?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2506906431811664872/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2506906431811664872' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2506906431811664872'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2506906431811664872'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/predict-ligand-binding-sites-based-on.html' title='I571:Predict Ligand-Binding Sites Based On Amino Acid Composition'/><author><name>Jingyu</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-6971917281561006957</id><published>2007-10-20T11:49:00.000-07:00</published><updated>2007-10-20T11:58:35.026-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='I-571 coffee break'/><category scheme='http://www.blogger.com/atom/ns#' term='3D search similarity descriptors'/><title type='text'>I-571:  Surface-based Similarity Searching</title><content type='html'>&lt;p style="margin-bottom: 0in;"&gt;Surface similarity-based molecular query-retrieval&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;R. Singh,  BMC Cell Biology 2007, 8(Suppl 1):S6 doi:10.1186/1471-2121-8-S1-S6  &lt;/p&gt; &lt;p style="margin-bottom: 0in;"&gt;&lt;a href="http://www.biomedcentral.com/1471-2121/8/S1/S6"&gt;http://www.biomedcentral.com/1471-2121/8/S1/S6&lt;/a&gt;&lt;/p&gt;   &lt;p style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p style="margin-bottom: 0in;"&gt;This paper describes a method for determining similarity among molecules by making use of a three-dimensional system that encodes properties of the molecule as a field which is mapped onto the molecular surface.  The author reasons that biological activity is  strongly dependent on the three-dimensional character of the molecule and on the distribution of physicochemical properties such as hydrogen bond donors and acceptors.  The paper starts with an overview of methods for representing molecules and descriptors used for characterizing molecules such as SMILES, fingerprints, 2-D and 3-D graphs, and 3-D surfaces.  He then comments on some of the difficulties in using two-dimensional representations to describe three-dimensional entities, and the computer resources needed to search and compare three-dimensional graphs efficiently.  The surface-based method proposed in this paper captures information not well represented by 2-D methods and provides for more efficient searching than other three-dimensional search methods.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The author then goes on to describe the results of applying his method to molecular searches.  First he tests the accuracy of the method in retrieving structures from a database on a selection of 5000 molecules from the MMDR database available from MDL, and compares the results to the same search using MDL's commercial 2-D database, ISIS.  In this comparison of the surface-based method gave largely the same results as were obtained ISIS.  The second test addressed the speed of retrieval compared to the Molecular Hashkeys algorithm, which is also based on molecular surfaces.  This test used 30 molecules from the MDDR collection and compared the time required to retrieve them using the two methods.  The Molecular Hashkey method was considerably slower, matching one conformer every two seconds compared to 120 conformers per second for the proposed method.  The third test sought to validate the ability of the method to predict biological activity.  The validation used absorption data for 30 compounds from the MMDR collection and took place in three stages.  In the first step 20 of the 30 compounds were used as the training set for constructing a model with a neural network.  Next the performance of the model was tested in a leave-one-out cross-validation sequence in which the proposed method performed much better than Molecular Hashkeys at predicting biological activity.  Finally the proposed method was tested against the 10 compounds that were not used by the neural net to construct the model.  And again the proposed method performed quite well.&lt;/p&gt;  &lt;p style="margin-bottom: 0in;"&gt;The details of the proposed surface-based model and the search algorithm used are described in some detail in the final section of the paper.  Three surface properties are calculated for each molecule; in addition to the  molecular surface itself, the hydrogen bond donor strength and hydrogen bond acceptor strength at each point on the surface are calculated.  These properties are then mapped onto the surface of a unit sphere.  The search algorithm then compares the resulting unit-spheres using a Histogram Intersection method.  I didn't quite follow the description of the Histogram Intersection method, but a search on Google turned-up a lot of pages on image analysis.   Histogram Intersection seems to be a well established method for doing the sort of thing described in this paper.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-6971917281561006957?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/6971917281561006957/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=6971917281561006957' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6971917281561006957'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/6971917281561006957'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/10/i-571-surface-based-similarity.html' title='I-571:  Surface-based Similarity Searching'/><author><name>Steve W</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='26' src='http://2.bp.blogspot.com/_ZoHHoUVFBn8/SRmqFv-UNxI/AAAAAAAAAD4/NZMZ9Z-m9M8/S220/norbornane.JPG'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-5909014815912730086</id><published>2007-05-22T14:44:00.001-07:00</published><updated>2007-05-23T14:47:56.376-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='QSAR model validation applicability domain cross-validation'/><title type='text'>QSAR Model Validation</title><content type='html'>"Principles of QSAR models validation: internal and external" (Grammatica, P., &lt;a href="http://dx.doi.org/10.1002/qsar.200610151"&gt;10.1002/qsar.200610151&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The development of QSAR models has become increasingly easier with the development of both cheminformatics tools for descriptor generation (&lt;a href="http://www.talete.mi.it/dragon_net.htm"&gt;DRAGON&lt;/a&gt;, &lt;a href="http://almost.cubic.uni-koeln.de/cdk/cdk_top"&gt;CDK&lt;/a&gt;, &lt;a href="http://joelib.sourceforge.net/wiki/index.php/Main_Page"&gt;JOELib&lt;/a&gt;, &lt;a href="http://www.niss.org/PowerMV/"&gt;PowerMV&lt;/a&gt;) and modeling tools (&lt;a href="http://cran.r-project.org/"&gt;R&lt;/a&gt;). However simply developing a regression equation (as one example) is not sufficient for declaring that one has a &lt;i&gt;good&lt;/i&gt; QSAR model. Since the goal of such a model is to predict properties for molecules that it hasn't seen before, the key feature of the model that must be analysed is it's predictive ability. The procedure by which this is done is termed model validation.&lt;br /&gt;&lt;br /&gt;In general, one must validate a model for two reasons. First, we want to ensure that a model actual encodes a meaningful relationship as opposed to a random relationship. And secondly we must ensure that the model can actually provide reasonable predictions for a molecule that it has not been trained on. The topic of model validation has been discussd by a number of authors such as &lt;a href="http://dx.doi.org/10.1016/S1093-3263(01)00123-1"&gt;Tropsha&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0342472"&gt;Hawkins&lt;/a&gt; and &lt;a href="http://dx.doi.org/10.1002/1521-3838(200210)21:4&lt;348::AID-QSAR348&gt;3.0.CO;2-D"&gt;Kubyini&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The current paper is a useful summary of three aspects of validating a QSAR model. The first topic discussed is the requirement for an &lt;b&gt;unambigous algorithm&lt;/b&gt;. Now, in general the modeling algorithm used (linear regression, neural nets, PLS etc) have a well defined implementation. Equally important is the descriptors that are input to these algorithms. As Grammatica points out, different versions of descriptor generating software can produce different descriptor values (due to bug fixes etc). As a result a model developed using one version of the software may not be valid if a new version is used which generates different values for the same descriptor. This problem is excaberated when different software package, supposedly implementing the same descriptor, are used. Due to small differences in the implementations one can end up with different descriptor values. Clearly, some form of standardization is required in this area.&lt;br /&gt;&lt;br /&gt;The second aspect that Grammatica tackles is the issue of &lt;b&gt;domain applicability&lt;/b&gt;. Put simply, domain applicability tries to identify what types of molecules a model can provide a reliable prediction for. Note that it does not try to determine which molecules a model will predict &lt;i&gt;correctly&lt;/i&gt;. Rather, the question is tries to answer is, if a model predicts a molecular property correctly or wrongly, how sure is it of the answer. Clearly, a model built on a set of straight chain alkanes is probably not meant to be applied to, say, aromatics. The thing is, when asked to predict a property of benzene, it wil return a numerical value. But given our knowledge of the situation, this value is in al liklihood, not very reliable.&lt;br /&gt;Gramatica focuses on the use of &lt;a href="http://en.wikipedia.org/wiki/Mahalanobis_distance"&gt;leverage&lt;/a&gt; to measure the domain applicability. A comprehensive overview of methods to measure domain applicability is described by &lt;a href="http://www.frame.org.uk/atlafn/atlacontents/33(2).htm"&gt;Netzeva&lt;/a&gt; et al (article 868).&lt;br /&gt;&lt;br /&gt;An interesting point raised by Grammatica relates to issue of descriptor selection using a GA. She pints out that &lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;a population of MLR models of similar good quality, developed by variable selection performed with GA, can include a hundred different models developed on the same training set but based on different descriptors&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;The point is that by using different descriptors these models will exhibit different applicability domains. However some recent work I've been doing seems to suggest that the variability in the descriptors being selected in such a stochastic process might be indicative of the reliability of the underlying model. I have focussed on the use of random forests and this observation may or may not be transferrable to OLS+GA, but one would think that if different types of descriptors are being selected, they should end up being quite similar to each other (though not necessarily in an obvious manner). Something to look at.&lt;br /&gt;&lt;br /&gt;The final topic Grammatica addresses is the the importance of &lt;b&gt;external validation&lt;/b&gt; of a QSAR model. That is, after building a model using a training set, one must obtain predictions for an external prediction set that has not been used during training at all. One alternative that is used by a number of people is cross-validation (CV). Though useful for small datasets (as well as when one does not have access to new data), CV results are not necessarily indicative of good predictive ability. As as been shown by &lt;a href="http://dx.doi.org/10.1016/S1093-3263(01)00123-1"&gt;Golbraikh&lt;/a&gt; et al., a low q&lt;sup&gt;2&lt;/sup&gt; usually indicates a poorly predictive model, but a high q&lt;sup&gt;2&lt;/sup&gt; does not necessarily indicate a good predictive model.&lt;br /&gt;&lt;br /&gt;She goes onto point out a number of authors who believe that CV alone is sufficient to characterize a models predictive ability. I have always though this position to be  unsatisfactory, unless there is no way out (small dataset, lack of new data etc.). Grammatica notes, and I agree, that CV is a &lt;i&gt;necessary&lt;/i&gt; measure of predictive ability but is not &lt;i&gt;sufficient&lt;/i&gt;. Dave Stanton, in a personal communication, points out that CV (via q&lt;sup&gt;2&lt;/sup&gt;) really measures the homegenity of the dataset and not necessarily the predictive ability.&lt;br /&gt;&lt;br /&gt;Overall a useful paper that has a good list of references highlighting various aspects of QSAR model validation&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-5909014815912730086?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/5909014815912730086/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=5909014815912730086' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5909014815912730086'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/5909014815912730086'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/05/qsar-model-validation.html' title='QSAR Model Validation'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3158875632504428248</id><published>2007-05-14T19:25:00.000-07:00</published><updated>2007-05-14T20:13:20.500-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='scaffold hopping review'/><title type='text'>Short Review on Scaffold Hopping</title><content type='html'>"Scaffold-Hopping: How Far Can You Jump?" (Schneider, G. et al., &lt;a href="http://dx.doi.org/10.1002/qsar.200610091"&gt;10.1002/qsar.200610091&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Scaffold hopping is common technique in the field of drug design where one starts from a certain chemical structure and attempts to identify other structures which perform or exhibit similar functionality. Given that one of the fundamental assumptions of computational drug design is that similar structures will exhibit similar behavior, one might think that scaffold hopping is simply a matter of similarity searching.&lt;br /&gt;&lt;br /&gt;However, though scaffold hopping does involve similarity searching, the real value of scaffold hopping techniques is their ability to identify &lt;i&gt;novel&lt;/i&gt; chemotypes - i.e., structural features that may not be very similar to the starting structure but exhibit properties that are similar to the starting structure. There is a large body (&lt;a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&amp;DB=pubmed"&gt;PubMed&lt;/a&gt;, &lt;a href="http://scholar.google.com/scholar?q=%22scaffold+hopping%22&amp;hl=en&amp;lr=&amp;btnG=Search"&gt;Google&lt;/a&gt;) of work related to this topic discussing a wide variety of techniques.&lt;br /&gt;&lt;br /&gt;The review by Schneider succinctly presents the underlying principles of scaffold hopping and highlights a number of successful &lt;i&gt;scaffold hops&lt;/i&gt; - compounds which have been identified using this technique and subsequently shown to exhibit good activity - using a variety of techniques.&lt;br /&gt;&lt;br /&gt;Schneider et al. classify scaffold hopping techniques into 3 classes: ligand based (alignment free and alignment based), receptor based and combined approaches, noting that the complexity and intensiveness of the approaches increase from left to right. Examples of the methods include substructure searches [&lt;a href="http://dx.doi.org/10.1021/jm050792d"&gt;1&lt;/a&gt;], a variety of pharmacophores [&lt;a href="http://dx.doi.org/10.1002/cbic.200400376"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1016/j.bmcl.2005.11.046"&gt;2&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci034296e"&gt;3&lt;/a&gt;], shape similarity [&lt;a href="http://dx.doi.org/10.1021/jm040163o"&gt;1&lt;/a&gt;] and docking [&lt;a href="http://dx.doi.org/10.1021/jm050499d"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/jm030605g"&gt;2&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/jm050837a"&gt;3&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/jm051129s"&gt;4&lt;/a&gt;].&lt;br /&gt;&lt;br /&gt;The authors also note that though scaffold hopping applications mainly use ligand based approaches due to speed and simplicity, situations where ligand induced fits are observed might be also be amenable to scaffold hopping. &lt;br /&gt;&lt;br /&gt;The authors end the review by summarizing the issues underlying similarity as well as suggesting possible areas for further investigation such as "how low should one set a similarity cutoff to retrieve novel chemotypes?", "what types of descriptors and similarity metrics/scoring functions should be employed?" as well as suggesting the need for more statistical modeling of chemical similarity measures.&lt;br /&gt;&lt;br /&gt;Overall a useful review and a good complement to the article by &lt;a href="http://dx.doi.org/10.1002/qsar.200610097"&gt;Glen and Adams&lt;/a&gt; in the same issue.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3158875632504428248?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3158875632504428248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3158875632504428248' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3158875632504428248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3158875632504428248'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/05/short-review-on-scaffold-hopping.html' title='Short Review on Scaffold Hopping'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-207707758741359957</id><published>2007-04-09T18:29:00.000-07:00</published><updated>2007-04-09T20:00:13.936-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='metabolism fingerprint prediction reaction xenobiotics'/><title type='text'>Predicting Metabolic Reaction Centers</title><content type='html'>"Reaction Site Mapping of Xenobiotic Biotransformations" (Boyer, S. et al. &lt;a href="http://dx.doi.org/10.1021/ci600376q"&gt;10.1021/ci600376q&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The computational modeling of metabolism plays an important role in the drug development process. This area of modeling can be tricky to the many and diverse metabolic processes that occur in living systems. The authors of this paper put it succinctly:&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Diversity, Degeneracy and Duplication&lt;br /&gt;&lt;/blockquote&gt;, indicating that metabolic enzymes are diverse in the nature of the reactions that they catalyze, many enzymes can work on multiple substrates and many enzymes can take over the jobs performed by other enzymes.&lt;br /&gt;&lt;br /&gt;One common system of study is the Cytochrome (CYP) family of enzymes and a variety of modeling strategies have been used (&lt;a href="http://dx.doi.org/10.1021/ci600561v"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci6002619"&gt;2&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0500536"&gt;3&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/jm030972s"&gt;4&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci030283p"&gt;5&lt;/a&gt;). This paper considers an alternate approach which is based on an analysis of the occurrence of different reaction types in a large database of reactions.&lt;br /&gt;&lt;br /&gt;More specifically the authors considered the &lt;a href="http://www.mdl.com/products/predictive/metabolite/index.jsp"&gt;Metabolite&lt;/a&gt; database from MDL. Essentially, they analyze the database to identify substructures where reactions occur and then for a new molecule, identify any such substructures. Based on the frequency of occurrence in the database, the method is able to assign a quantitative measure of a metabolic transformation occurring at the matching atom in the query molecule. The authors use the &lt;a href="http://www.molprint.com/"&gt;MOLPRINT-2D&lt;/a&gt; fingerprints to characterize the molecules in the database as well as to characterize the reaction centers. To identify reaction centers in structures from the database, they made use of the supplied annotations. In addition, they also used a &lt;a href="http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem"&gt;maximum common substructure&lt;/a&gt; approach to identify possible reaction centers.&lt;br /&gt;&lt;br /&gt;Given a set of fingerprints for the database of structures and the set of reaction centers, they describe an &lt;i&gt;exact match&lt;/i&gt; and &lt;i&gt;distance&lt;/i&gt; operator which are used in combination to identify hits for a given query molecule. The result of their approach is that for a query molecule, a number of atoms will be identified as possible reaction centers and assigned an &lt;i&gt;occurrence ratio&lt;/i&gt;, which essentially indicates the propensity for a reaction to occur at that atom. These atom-wise values can then be visualized (they do this using RasMol).&lt;br /&gt;&lt;br /&gt;Their benchmark results indicate that the MOLPRINT-2D fingerprints always identify those atoms at which reactions occur. In quantitative terms, the number of reactions occurring at an atom can be overestimated, but is never underestimated. Finally, they developed the procedure using the 2004 version of the database and then selected 30 compounds from the 2005 version of the database and we able to identify 87% of the experimental reaction centers as one of the top three reaction centers.&lt;br /&gt;&lt;br /&gt;Overall an interesting approach to reaction center mapping that does not involve approximations (as in predictive modeling). It's also different from the rule based approaches (such as Talafous, J. et al., J. Chem. Inf. Comput. Sci., 1994, 34, 1326 and &lt;a href="http://www.lhasalimited.org/index.php?cat=2&amp;sub_cat=64"&gt;DEREK&lt;/a&gt;) in that it doesn't rely on hard coded rulesets. &lt;br /&gt;&lt;br /&gt;One thing that is missing from the approach is the consideration chirality, which, AFAIK, the fingerprints don't take into  account. The authors note this mention it as a work in progress. Also it seems that the approach can be skewed by the composition of the database (or subsets of the database) especially if there are few examples of a reaction within the database.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-207707758741359957?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/207707758741359957/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=207707758741359957' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/207707758741359957'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/207707758741359957'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/04/predicting-metabolic-reaction-centers.html' title='Predicting Metabolic Reaction Centers'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-3382314231486762489</id><published>2007-04-05T19:58:00.000-07:00</published><updated>2007-04-05T20:49:53.467-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='chirality descriptor QSAR topological'/><title type='text'>Topological Descriptors For Chirality</title><content type='html'>"Novel Approach for the Numerical Characterization of Molecular Chirality" (Natarajan, R. et al., &lt;a href="http://dx.doi.org/10.1021/ci600542b"&gt;10.1021/ci600542b&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Chirality is a molecular property that can have far reaching effects. Examples include chiral drug molecules, of which one enantiomer may exhibit the drug-like property whereas the other entantiomer may have no effect (or even a undesirable effect of which thalidomide is a famous example). So approaches to quantifying chirality are useful and a number of people have developed numerical measures of molecular chirality for a variety of scenarios such as 2D and 3D QSAR (&lt;a href="http://dx.doi.org/10.1021/ci0505574"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci025516b"&gt;2&lt;/a&gt;) and predicting enantiomeric excess (&lt;a href="http://dx.doi.org/10.1021/cc049961q"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci000125n"&gt;2&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;In this paper Natarajan et al. described a topological descriptor that aims to characterize chirality in terms of a &lt;i&gt;scale&lt;/i&gt;. This is noted to be in contrast to other methods where&lt;br /&gt;&lt;blockquote&gt;chirality is treated as an either-or property&lt;/blockquote&gt;I was a little surprised at this is it seems that there are a number of methods that lead to a numerical measure of the degree of chirality such as by &lt;a href="http://dx.doi.org/10.1021/ci970460k"&gt;Moreau&lt;/a&gt; which characterizes the chirality of the environment of specific atoms. The current method also generate numerical values for each chiral center, based on properties of the groups connected to the chiral center.&lt;br /&gt;&lt;br /&gt;The method is based on the use of the Kier and Hall group delta formulation to provide the property attributes for each of the groups attached to a chiral center. The method then uses the group delta values and the &lt;a href="http://en.wikipedia.org/wiki/Cahn_Ingold_Prelog_priority_rules"&gt;CIP&lt;/a&gt; priority rules to generate a numerical value for a chiral center. In addition to the group delta values, they also used the valence delta values as well as molecular weight for the property values. The net result is that they lead to 3 descriptor values for a given molecule.&lt;br /&gt;&lt;br /&gt;There results indicate that the new descriptor does indeed differentiate between R/S isomers for a variety of molecules. However they also indicate that sometimes one of the descriptors exhibit degeneracy and that this can be fixed by a &lt;i&gt;correction factor&lt;/i&gt;. However they don't mention what this factor would be.&lt;br /&gt;&lt;br /&gt;They also used these descriptors to rank the mosquito repellant activity of &lt;a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=125098"&gt;picaridin&lt;/a&gt; (InChI=1/C12H23NO3/c1-3-10(2)16-12(15)13-8-5-4-6-11(13)7-9-14/h10-11,14H,3-9H2,1-2H3) and AI3-37220. It is interesting to note that of the 3 descriptors they calculate, in one case the repellancy is negative correlated with the descriptors and in the other case positively correlated. For some of the descriptors, the rank correlation is quite low. They note that this inconsistency may be due to chemical nature of the molecule.&lt;br /&gt;&lt;br /&gt;Overall a somewhat interesting, but not entirely new, approach to quantifying chirality, but as with most topological descriptors, the quantification is rather arbitrary. My main issue with this approach is this: what does it mean to say that one molecule is &lt;i&gt;more&lt;/i&gt; chiral than another? What are the physical implications of this?&lt;br /&gt;&lt;br /&gt;This is not clearly explained in the paper&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-3382314231486762489?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/3382314231486762489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=3382314231486762489' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3382314231486762489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/3382314231486762489'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/04/topological-descriptors-for-chirality.html' title='Topological Descriptors For Chirality'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-955548307690295811</id><published>2007-03-29T07:38:00.000-07:00</published><updated>2007-03-29T07:57:28.916-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MQL SMARTS substructure query language'/><title type='text'>Progress in Substructure Query Languages</title><content type='html'>"Molecular Query Language (MQL)-A Context-Free Grammar for Substructure Matching" (Proschak, E. et al., &lt;a href="http://dx.doi.org/10.1021/ci600305h"&gt;10.1021/ci600305h&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html"&gt;SMARTS&lt;/a&gt; query language has been the workhorse of substructure matching in the chemical informatics community. Though it's certainly useful, it's been interesting that there hasn't been much development in this area. However Proschak and coworkers recently described a query language that has nearly all of the ability of SMARTS, but in addition, provides a high degree of flexibility and extensibility.&lt;br /&gt;&lt;br /&gt;The Molecular Query Language (MQL) is a substructure query language that is based on a context free grammar. The grammar definition (in BNF) is available and an implementation based on the CDK is also available. &lt;br /&gt;&lt;br /&gt;The MQL shares a number of features with SMILES and SMARTS. One example of such a difference is that single bonds must be explicitly defined. However this is made up by the ability to query on a large number of properties (which can be numerical or categorical) using relational operators. Thus one can write a query using "order&lt;=2" where order is the name of some property. Examples of these types of properties include &lt;i&gt;implicitHydrogens, allHydrogens, explicitConnections&lt;/i&gt;. An example of a categorical property is &lt;i&gt;belongsToSSSR&lt;/i&gt;, so one can write&lt;br /&gt;something like &lt;i&gt;C[belongsToSSSR=2]&lt;/i&gt; which will match a carbon atom that occurs in a ring system of 2 rings (which represent the SSSR in this case).&lt;br /&gt;&lt;br /&gt;Clearly, this is a much more expressive and straightforward approach to defining substructure queries. As the authors point out, it is quite easy to extend the grammar to include other arbitrary properties - as long as the underlying toolkit can be used to generate a value of the property in question it can be used in a query.&lt;br /&gt;&lt;br /&gt;This is certainly an extremely useful and impressive solution and kudos to the authors to opening up the specification and implementation. Hopefully this will lead to the MQL having a big impact in the community.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-955548307690295811?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/955548307690295811/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=955548307690295811' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/955548307690295811'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/955548307690295811'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/03/progress-in-substructure-query.html' title='Progress in Substructure Query Languages'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-2926622320195563547</id><published>2007-03-15T11:49:00.000-07:00</published><updated>2007-03-29T07:58:43.870-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='QSAR applicability domain toxicology predictive model'/><title type='text'>The Applicability Domain of a Toxicological QSAR Model</title><content type='html'>"Assessing Applicability Domains of Toxicological QSARs: Definition, Confidence in Predicted Values, and the Role of Mechanisms of Action" (Schultz, T.W. et al., &lt;a href="http://dx.doi.org/10.1002/qsar.200630020"&gt;10.1002/qsar.200630020&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;2D QSAR models have been in use for more than 30 years. With regulatory agencies tending towards avoiding animal testing and othr forms of expensive in vivo tests, the use of a QSAR model has gained in importance. However, before a model can be used in a production environment, one of the aspects that must be verified is the &lt;i&gt;domain applicability&lt;/i&gt; - that is, can we use the model to predict the property of a molecule that was not in the training set for the model? A number of approaches to measuring domain applicability have been described (&lt;a href="http://dx.doi.org/10.1021/ci0500381"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci025604w"&gt;2&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0497511"&gt;3&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1016/j.drudis.2006.06.013"&gt;4&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;Schultz et al describe an approach to answering this problem for a specific (2 descriptor) QSAR model which predicted the toxicity of aromatic compounds to &lt;i&gt;T. pyriformis&lt;/i&gt;. Essentially they consider two aspects of a model that dictate its applicability - its structural domain and its descriptor domain. They then choose two test sets, one of which lies within the domain of the model such that the test set descriptor values lie within the ranges shown in the training set and also having structural features that were found in the training set. The second test set was selected so as to lie outside the structural domain of the test set. They then correlate the absolute residuals to the individual descriptor values as well as to the distance from the centroid of the descriptor space. They also categorize each of the test set compounds by mechanisms of action based on structural criteria.&lt;br /&gt;&lt;br /&gt;The authors draw a number of conclusions&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Absolute residuals should not be related to the centroid of a descriptor space if the descriptors do not contribute equally to the predictive ability of the model (they consider a weighted distance to the centroid to take into account this effect)&lt;br /&gt;&lt;li&gt;Obtaining predictions for a molecule outside the structural domain of the training set leads to erroneous predictions&lt;br /&gt;&lt;li&gt;A &lt;i&gt;mechanistic&lt;/i&gt; domain should also be considered for a QSAR model in addition to the descriptor and structural domains. That is, does the test molecule act in the same way that te training set molecules do?&lt;br /&gt;&lt;li&gt; Toxicity for the more electrophilic compounds is predicted quite poorly. They also correlate the compounds that were well and pooly predicted to modes of action so that narcotics and respiratory uncouplers were well predicted but soft-electrophiles were not.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Many of these conclusions are not surprising, though the importance of a mechanistic domain is well described and is something that I believe should be taken into account. The problem with this is that it seems that it would be difficult for a 2D QSAR model to identify a possible mode of action. In the current paper the authors correlated the predictive ability with the mode of action retrospectively. Doing this prospectively would be a difficult job.&lt;br /&gt;&lt;br /&gt;The paper does have a number of problems.  Firstly, &lt;a href="http://en.wikipedia.org/wiki/Mean_squared_error"&gt;RMSE&lt;/a&gt; is &lt;b&gt;not&lt;/b&gt; the same thing as absolute residual. All the graphs are labeled with RMSE but the text talks about absolute residuals. Second are the graphs, especially Figs. 2 and 3. The authors claim that the absolute residuals for the training set are correlated to elecetrophilicity but not logP. They even add a regression line to the plot of absolute residual vs electrophilicity (Fig 3). But they don't add one to the plot of absolute residual vs logP. Frankly, both graphs look relatively random to me. It would have been nice to have a regression line on both graphs as well as some R^2 values.&lt;br /&gt;&lt;br /&gt;One of the biggest problems is that much of it was described previously by &lt;a href="http://dx.doi.org/10.1021/ci0497511"&gt;Guha et al.&lt;/a&gt;, in more general terms.  Thus the authors show that distance to the training set descriptor space centroid of a test set compound does not correlated to predictive ability. This was shown previously for a number of datasets. In this regards there aren't any really new conclusions.&lt;br /&gt;&lt;br /&gt;Finally, what is the need for including huge tables of numbers in a paper? Isn't that sort of stuff meant to go into Supplimentary Information?&lt;br /&gt;&lt;br /&gt;Overall I think the paper has value due to the careful construction of training and test sets to highlight the issues related to domain applicability. The idea of a mechanistic domain is a definitely a good idea and should be considered. But the bulk of the conclusions aren't really that great (or unknown)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-2926622320195563547?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/2926622320195563547/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=2926622320195563547' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2926622320195563547'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/2926622320195563547'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/03/applicability-domain-of-toxicological.html' title='The Applicability Domain of a Toxicological QSAR Model'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-959260844584891977</id><published>2007-03-10T06:23:00.000-08:00</published><updated>2007-03-29T07:59:16.682-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='3D search similarity descriptors'/><title type='text'>Fast 3D Similarity Search Using Descriptors</title><content type='html'>"Ultrafast Shape Recognition for Similarity Search in Molecular Databases" (Ballester, P.J and Richards, W.G. DOI:&lt;a href="http://dx.doi.org/10.1002/jcc.20681"&gt;10.1002/jcc.20681&lt;/a&gt;. Also published &lt;a href="http://dx.doi.org/10.1098/rspa.2007.1823"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;3D similarity searching is a well studied topic with a variety of methods available (&lt;a href="http://dx.doi.org/10.1021/ci960108r"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci010068d"&gt;2&lt;/a&gt;, &lt;a href="http://www.springerlink.com/content/m2844362m5637656/"&gt;3&lt;/a&gt;). Fundamentally there are two approaches to evaluating 3D similarity, viz., superposition methods (&lt;a href="http://dx.doi.org/10.1021/jm040163o"&gt;Rush et al.&lt;/a&gt;) which requires an alignment procedure and descriptor based methods which aim to characterize shape in terms of a small set of numerical values (essentially a &lt;a href="http://en.wikipedia.org/wiki/Dimensionality_reduction"&gt;dimension reduction&lt;/a&gt; procedure) and then evaluate similarity between molecules by using these descriptors. &lt;br /&gt;&lt;br /&gt;Recently there was a splash on Slashdot as well as New Scientist regarding the work described in this paper which belongs to the second class of methods. I was a little surprised since there has been much work in this area of 3D similarity, probably the most similar being Zauhars &lt;a href="http://dx.doi.org/10.1021/jm030242k"&gt;Shape Signatures&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Ballester et al propose using a set of descriptors derived from the distribution of distances between atomic coordinates to 4 specific points: the centroid, the atom closest to the centroid, the atom furthest from the centroid and the atom furthest from the preceding atom. The distance distributions are then summarized using the first 3 moments (mean, variance and skewness). So the shape of each molecule is encoded by a 12-element vector. They then define a similarity measure based on the normalized &lt;a href="http://www.nist.gov/dads/HTML/manhattanDistance.html"&gt;Manhattan&lt;/a&gt; distance.&lt;br /&gt;&lt;br /&gt;The method is clearly quite fast (the authors report a 1500x speedup compared to MOE's EShape3D and 2000x speedup compared to Shape signatures and 14000x speedup compared to &lt;a href="http://www.eyesopen.com/products/applications/rocs.html"&gt;ROCS&lt;/a&gt;) as well as space efficient. However, more importantly, is it accurate? This is important since the method is low dimensional representation of the molecular shape and thus some information is lost. To what extent does loss affect the methods utility in similarity search?&lt;br /&gt;&lt;br /&gt;The authors use a visual comparison approach to justifying their claim that the method does indeed identify similar shapes (from a database of 2.4M compounds) - essentially looking at the molecules that had the highest similarity for 5 query molecules. The figures do indicate a high degree of similarity - but it's not really quantitative. However they also compare the most similar hits retrieved by their method as well as by MOE's EShape3D descriptor method. They report that in most cases, the methods got the same hits, though in some the hits returned by MOE were &lt;i&gt;visually&lt;/i&gt; less similar. The authors also considered the issue of retrieving hits from a collection of conformations - and they indicate that their method retrieves conformers that are more similar to the query than does EShape3D.&lt;br /&gt;&lt;br /&gt;One aspect that would be interesting to see is whether using a larger number of moments would increase the efficacy of the method.&lt;br /&gt;&lt;br /&gt;Overall an interesting method, especially due to its simplicity, though the validation could be a little more rigorous.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Sidenote&lt;/b&gt;: An implementation of this method that allows one to generate the 12-element vectors from SD files as well as query an SD file to find similar hits is available &lt;a href="http://cheminfo.informatics.indiana.edu/~rguha/code/java/#momsim"&gt;here&lt;/a&gt; as a standalone program. The main similarity code has been &lt;a href="http://cheminfo.informatics.indiana.edu/~rguha/code/java/nightly/api/org/openscience/cdk/similarity/DistanceMoment.html"&gt;incorporated&lt;/a&gt; into the &lt;a href="http://almost.cubic.uni-koeln.de/cdk/cdk_top"&gt;CDK&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-959260844584891977?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/959260844584891977/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=959260844584891977' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/959260844584891977'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/959260844584891977'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/03/fast-3d-similarity-search-using.html' title='Fast 3D Similarity Search Using Descriptors'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-1567420813781993893</id><published>2007-03-01T09:32:00.000-08:00</published><updated>2007-03-01T09:44:41.046-08:00</updated><title type='text'>A lightweight but thought provoking paper about visualization</title><content type='html'>Data Reduction and Representation in Drug Discovery, Drug Discovery Today, 12, 1/2, Jan 2007&lt;br /&gt;&lt;br /&gt;This paper points out that even a modest drug discovery project can involve an extremely large amount of data when you factor in compounds, screening results, descriptors, ADME/Tox predictions, and so on. It relates this to the NIH MLSCN where 3,000,000 compounds are expected to be in the screening library by 2009. The paper discusses how data reduction and compression are inextricably linked with visualization as the human mind can only interpred a limited number of data dimensions. The authors discuss a method called VlaaiVis developed at J&amp;amp;J which allows many dimensions of data to be visualized in a pie or radial pattern. as well as the more well known heatmap method in Spotfire. The biological sciences are also discussed, including very 'arty' visualizations of homology relationships across a proteome. In conclusion, the authors assert that the drug discovery industry is just waking up to the opportunities for visualization, in the same way that cartography emerged in the 1930's. Opinion: there is not much 'meat' in the paper, but it highlights the fact that we need to think imaginatively about visualization. Everything is very Spotfire-based at the moment. Any why don't they refer to Edward Tufte?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-1567420813781993893?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/1567420813781993893/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=1567420813781993893' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1567420813781993893'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/1567420813781993893'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/03/lightweight-but-thought-provoking-paper.html' title='A lightweight but thought provoking paper about visualization'/><author><name>David</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-117044783462999433</id><published>2007-02-02T12:12:00.000-08:00</published><updated>2007-02-02T14:23:11.493-08:00</updated><title type='text'>An Anti-Climactic Comparison of QSAR Classifiers</title><content type='html'>"Contemporary QSAR Classifiers Compared" (Bruce, C.L et al., &lt;a href="http://dx.doi.org/10.1021/ci600332j"&gt;10.1021/ci600332j&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;I came across this paper expecting a comparison of many of the classification algorithms used in QSAR modeling. The paper is somewhat interesting but in the end only really compares &lt;a href="http://en.wikipedia.org/wiki/Support_vector_machine"&gt;SVM&lt;/a&gt;'s and &lt;a href="http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm"&gt;Random Forests&lt;/a&gt; (RF). So it's not as comprehensive as I would have liked.&lt;br /&gt;&lt;br /&gt;None of the conclusions are very striking and indeed many of them follow naturally from the theory (especially regarding the RF classification models). However one aspect of this work that is generally not seen in comparative studies is the use of statistical tests to indicate the significance in differences between classifier performance (they use the &lt;a href="http://www.graphpad.com/articles/interpret/ANOVA/friedmans.htm"&gt;Freidman statistic&lt;/a&gt; and the Nemenyi test).&lt;br /&gt;&lt;br /&gt;Some of the problematic issues with this paper include are described below:&lt;br /&gt;&lt;br /&gt;They considered 8 datasets but three of them have 66 to 88 observations. It seems this is quite a small dataset to use a SVM or RF. Furthermore the descriptor sets used for these datasets is on the order of 550 descriptors. Now for a RF model, with implicit feature selection, this is not a huge issue. However, they used all ~550 descriptors for the SVMs - i.e., no feature selection was performed. I think this is a gross mistake!&lt;br /&gt;&lt;br /&gt;They investigated the predictive performance with varying numbers of trees in the RF model and concluded that 100 trees are optimal for cheminformatics applications. A reading of &lt;a href="http://www.amazon.com/Classification-Regression-Trees-Leo-Breiman/dp/0412048418/sr=8-2/qid=1170447957/ref=pd_bbs_sr_2/105-4287350-5880412?ie=UTF8&amp;s=books"&gt;CART&lt;/a&gt; by Breiman et al., would indicate that RF's are &lt;i&gt;provably&lt;/i&gt; resistant to overfitting. What this means is that if 100 trees (for which the authors observed convergance) are replaced with 500 or 1000 trees, one would get the same results (there would be a speed and memory penalty). Thus rather than iterating over different number of trees, one can simply use a large number right away and not have to worry about overfitting. This portion of the authors analysis seems redundant.&lt;br /&gt;&lt;br /&gt;They also consider interpretation of the RF models. Though interpretation of these models is trickier than single decision trees, there is a body of work related to this issue (&lt;a href="http://citeseer.ist.psu.edu/urbanek02exploring.html"&gt;here&lt;/a&gt; and &lt;a href="http://citeseer.ist.psu.edu/107477.html"&gt;here&lt;/a&gt;). The authors did provide an interesting analysis of the most frequent descriptors that showed up in the trees for a random forest. They then used the occurence frequency as a measure of the decsriptors importance. Though interesting, the authors could have also considered the &lt;i&gt;most important&lt;/i&gt; descriptors, as reported by the RF (see &lt;a href="http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp"&gt;variable importance&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Overall, the highlight of the paper is the use of rigorous statistical tests to compare classifiers, which is generally not seen. However, the paper does not really deliver what the title indicates it will and is methodologically lacking in a number of areas.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-117044783462999433?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/117044783462999433/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=117044783462999433' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/117044783462999433'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/117044783462999433'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/02/anti-climactic-comparison-of-qsar.html' title='An Anti-Climactic Comparison of QSAR Classifiers'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-117021715094446777</id><published>2007-01-30T17:56:00.000-08:00</published><updated>2007-01-30T20:19:11.006-08:00</updated><title type='text'>Random Test Set Selection is Not Always Good</title><content type='html'>"Measuring CAMD Technique Performance. 2. How 'Druglike' Are Drugs? Implications of Random Test Set Selection Exemplified Using Druglikeness Classification Models" (A. Good and M.A. Hermsmeier &lt;a href="http://dx.doi.org/10.1021/ci6003493"&gt;10.1021/ci6003493&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;This paper describes issues that arise when testing a QSAR model using a randomly selected test set. The main premise of the work is that random test sets can lead to over optimistic predictions by the model. The authors point out that when a test set is selected from a corporate database, which in many cases will have many close analogs, the resultant test set is quite homogenous and as a result, predictions will overestimate &lt;i&gt;model utility&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;The authors perform drug-likeness classifications (using the ACD and WDI databases) in which they consider a randomly selected test set  as well as a test set based on an ontological classification of the WDI (&lt;a href="http://dx.doi.org/10.1021/ci010385k"&gt;Schuffenhauer et al.&lt;/a&gt;). In the latter case they build models (Naive Bayes classifiers) based on structures belonging to a set of ontological classes and then use the model to predict the drug/non-drug classification for structures from a class that was not in the training set. Their results indicate that the models in the former case lead to better results (lower standard deviation).&lt;br /&gt;&lt;br /&gt;The results are not surprising. If the test set is not similar to the training set, QSAR model performance will degrade. On the other hand, it is clear that if you're aim is to predict a certain class of compounds, it pays to have a training set that is representative of such compounds. This is evident from the second set of results. The key aspect to note is that when using a random test from a large drug-like database, then good results can generally be expected, assuming that the training set was sufficiently &lt;i&gt;heterogenous&lt;/i&gt;. &lt;br /&gt;&lt;br /&gt;The authors state that random test sets lead to &lt;i&gt;over optimistic&lt;/i&gt; predictions. Of course, compared to results from targeted test sets (especially if the training set does not contain members from the class used to build the test set), the predictions are over optimistic. But this is to be expected. I don't see this as a bad thing.&lt;br /&gt;&lt;br /&gt;It seems to me that the message of the paper is that if you're modeling a certain class of compounds you should have a test set that is representative of those compounds. This is a well known fact.&lt;br /&gt;&lt;br /&gt;To conclude, the paper is a useful reminder that random test sets can (sometimes) lead to misleading results, especially when the source of the test set is a database containing many close analogs. However at the same time, I think that the results obtained from random test sets are to be expected and the conclusions regarding &lt;i&gt;over optimistic&lt;/i&gt; results are really a feature and not a bug. In my mind, what the results of the paper really indicate, is that if a model is applied to data beyond its domain of applicability, the results will not be good.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-117021715094446777?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/117021715094446777/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=117021715094446777' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/117021715094446777'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/117021715094446777'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/01/random-test-set-selection-is-not.html' title='Random Test Set Selection is Not Always Good'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116837393271714956</id><published>2007-01-09T12:18:00.000-08:00</published><updated>2007-01-09T12:18:52.726-08:00</updated><title type='text'>Review of what works - paper by Yvonne Martin</title><content type='html'>A nice paper by Yvonne Martin at Abbott labs: What works and what does not: lessons from experience in a pharmaceutical company, QSAR Comb. Sci., 25, 2006, 12, 1192-1200. Overall messages: there is a difference between the statistical accuracy of a method, and its usefulness to the chemist, and the perspective of a computational chemist on a project team goes beyond just delivering results. Also some interesting nuggest: LopP, etc. calculations are not as accurate as we think, but that doesn't mean they are not useful; compounds with Tanimoto similarity &gt; 0.85 to an active have a 30% chance of being active, based on HTS data analysis; a 0.55 similarity cutoff in clustering produces allows collation of compounds similar enough to reveal SAR information.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116837393271714956?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116837393271714956/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116837393271714956' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116837393271714956'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116837393271714956'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2007/01/review-of-what-works-paper-by-yvonne.html' title='Review of what works - paper by Yvonne Martin'/><author><name>David</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116708819788351399</id><published>2006-12-25T14:07:00.000-08:00</published><updated>2006-12-25T15:09:57.946-08:00</updated><title type='text'>Gzip for Molecular Similarity</title><content type='html'>"Similarity By Compression" (J.L. Melville et al., &lt;a href="http://dx.doi.org/10.1021/ci600384z"&gt;10.1021/ci600384z&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;There have been many articles of the use of molecular similarity (&lt;a href="http://dx.doi.org/10.1021/ci040120g"&gt;1&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/jm051209w"&gt;2&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0503863"&gt;3&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0400819"&gt;4&lt;/a&gt;) methods for capturing molecular similarity  (&lt;a href="http://dx.doi.org/10.1021/ci034207y"&gt;1&lt;/a&gt;, &lt;a href=" http://dx.doi.org/10.1021/ci600214b"&gt;2&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0601261"&gt;3&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci0501948"&gt;4&lt;/a&gt;, &lt;a href="http://dx.doi.org/10.1021/ci6002152"&gt;5&lt;/a&gt;) and the application of traditional similarity methods to non drug-like molecules such as &lt;a href="http://dx.doi.org/10.1021/ci034271f"&gt;DNA&lt;/a&gt; and &lt;a href="http://dx.doi.org/10.1021/ci0502597"&gt;proteomic maps&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The paper that is the focus of this post presents an alternative approach to the evaluation of molecular similarity using the SMILES representation directly, rather than evaluating fingerprints or molecular descriptors. The method is based on the &lt;i&gt;&lt;a href="http://www.complearn.org/ncd.html"&gt;normalized compression distance&lt;/a&gt;&lt;/i&gt;. This metric evaluates the distance between two objects by considering their compressed sizes, when considered individually and when joined (concatenated) together. The premise is that if two objects are similar, then their compressed versions will also be similar, such that the concatenated version will compress very efficiently. &lt;br /&gt;&lt;br /&gt;The idea is interesting and has been used in a number of areas as surveyed in the paper. However, my biggest problem with this approach is that the NCD does not strictly satisfy the &lt;a href="http://mathworld.wolfram.com/Metric.html"&gt;requirements&lt;/a&gt; for a metric. Thus it is possible that given the same SMILES, the NCD does not equal 0.0. This shortcoming arises due to the fact that the NCD is actually an approximation to the normalized information distance which is based on the &lt;a href="http://mathworld.wolfram.com/KolmogorovComplexity.html"&gt;Kolmogorov complexity&lt;/a&gt;. The shortcomings are noted in the paper and the authors note that it is possible to &lt;i&gt;convert&lt;/i&gt; the NCD to a proper metric by a scaling process. However, it seems to me that this approach is really a feature of the dataset being studied and does not really give rise to a proper metric (though it may appear so for the set of molecules under consideration). &lt;br /&gt;&lt;br /&gt;The authors implementation was done in Ruby based on system calls to the command line versions of gzip and bzip. They note that one possible reason for violations of metricity is due to the fact that the resultant compressed files included header information along with the actual compressed data. However I quikcly implemented their algorithm using the the &lt;a href="http://docs.python.org/lib/module-gzip.html"&gt;gzip&lt;/a&gt; module from Python as well as the corresponding Java &lt;a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/package-summary.html"&gt;class&lt;/a&gt;. It was interesting to note that the two implementations of the gzip algorithm did not give the same value of the NCD for two molecules. Furthermore, when considering a single molecule and evaluating the NCD to itself, neither of the algorithms gave a zero-value and they both differed in the final value that was returned. As I noted above, the authors note this behavior but I feel that it's difficult to consider the NCD as an actual metric - what does a non-zero self-distance mean?&lt;br /&gt;&lt;br /&gt;They tested their similarity metric by simulating a virtual screen whereby they placed 45 or so active compounds amongst approximately 1800 inactives. Then they considered 5 actives and ranked the 1850 compounds based on similarity to each of the five actives. The final score for each library compound was the highest rank it obtained when all 5 target molecules were considered. In addition to their NCD based approach they also considered Tanimoto similarity as well as simple counts of specific atom types.&lt;br /&gt;&lt;br /&gt;The results are interesting. They considered 5 groups of actives and after implementing a scaling and padding method to metricize the NCD, they show that the AUC is maximal when using their compression based approach. However when we go to look at enrichment factors, the NCD approach is best for 2 out of five datasets (when the top 1% of the library is considered) but is the best for 4 out five datasets when the top 5% of the library is considered. Interestingly they noted that the bzip2 based compression metric performed very poorly on all the datasets.&lt;br /&gt;&lt;br /&gt;Overall the idea is certainly interesting and as the authors note, there is minimal parameter tuning, at the cost of loss of metricity. It is also true that their tests turned up some actives that were not also returned by the Tanimot metric. Given that the overall performance is generally close to (though in some cases better than) the Tanimoto metric based on fingerprints, it's not entirely clear as to the value of this approach - though a combination of the NCD and other metrics might be useful.&lt;br /&gt;&lt;br /&gt;To end, they also note that the compression distance approach might be useful in clustering applications - it's not clear as to why this would be useful as it (intuitively) seems that without a strict metric function, the resultant clustering would be unstable. This might be something interesting to look at.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116708819788351399?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116708819788351399/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116708819788351399' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116708819788351399'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116708819788351399'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2006/12/gzip-for-molecular-similarity.html' title='Gzip for Molecular Similarity'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116518311117918031</id><published>2006-12-03T13:23:00.000-08:00</published><updated>2006-12-20T10:58:04.073-08:00</updated><title type='text'>Model free approach to the Drug/Non-drug problem</title><content type='html'>"Separating Drugs from Nondrugs: A Statistical Approach Using Atom Pair Distributions" (M.C. Hutter, &lt;a rev='review' href="http://dx.doi.org/10.1021/ci600329u"&gt;10.1021/ci600329u&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;This paper describes an approach to determining whether a given molecules will be considered drug like or non-drug like. There's a lot of literature on this topic (&lt;a rev='review' href="http://dx.doi.org/10.1021/cc990032m"&gt;here&lt;/a&gt;, &lt;a rev='review' href="http://dx.doi.org/10.1021/ci000026+"&gt;here&lt;/a&gt;, &lt;a rev='review' href="http://dx.doi.org/10.1021/jm9706776"&gt;here&lt;/a&gt; and &lt;a rev='review' href="http://dx.doi.org/10.1021/cc9800071"&gt;here&lt;/a&gt;). However one of the common features of most approaches to this problem is that they develop a predictive model based on a set of drug-like and on-drug-like examples, and then use that to predict the nature of a new model. Essentially, a model is developed that attempts to encode the distribution of &lt;span style="font-style:italic;"&gt;properties&lt;/span&gt; of the two classes of molecules. &lt;br /&gt;&lt;br /&gt;In contrast, the approach described by Hutter does not attempt to develop a predictive model. Instead he uses a modification of the &lt;a href="http://en.wikipedia.org/wiki/Substitution_matrix"&gt;substitution matrices&lt;/a&gt; used in bioinformatics. Thus for a aset of drug like molecules a matrix is contructed such that the element S_{ij} is the &lt;span style="font-style:italic;"&gt;log-odds&lt;/span&gt; of a specified atom-pair occurinng non-randomly in the set of molecules. There are issues of normalization and so on, which are covered in detail in the paper.&lt;br /&gt;&lt;br /&gt;As noted, the author uses atom-paris to characterize the molecules considered. He expands on the set of atom types considered (Carhart, R. et al, &lt;span style="font-style:italic;"&gt;J.Chem.Inf.Comput.Sci&lt;/span&gt;, &lt;span style="font-weight:bold;"&gt;1985&lt;/span&gt;, 25(2); 64-73) by using the atom types from the MM+ forcefield implemented in &lt;a href="http://www.hyper.com/"&gt;Hyperchem&lt;/a&gt;. In addition, he considers 1-1, 1-2, 1-3, 1-4, 1-5 and 1-6 interactions, resulting 6 matrices for the drug-like and non-drug-like groups. To obtain a final &lt;span style="font-style:italic;"&gt;likeliness score&lt;/span&gt; he constructs difference matrices using each of the individual substitution matrices (for the 6 types of interactions) of the drug-like and non-drug-like group.&lt;br /&gt;&lt;br /&gt;The net result of this process is that given a new molecule and atom-pair identification a single number is generated that indicates whether it is drug like or non-drug like. Fundamentally, the approach is a form of statistical similarity and being model-free it's validity is only dependent on the nature of the populations used for the drug-like and non-drug-like molecules. Given this dependence I would've expected that a larger population would have been used (the work uses 2713 and 1373 compounds for the drug-like and non-druglike). Also given the fact that there are many more non-drugs than drugs it's surprising to see that the training sample size of the non-druglike  compounds is half that of the druglike compounds. It's not clear why this situation occured.&lt;br /&gt;&lt;br /&gt;Overall, the accuracy is similar to other model-based approaches - 71% of druglike compounds are detected correctly. However only 40% of the nondrug compounds are detected correctly. One possible reason for this is the small relative size of the non-drug training population. Of course, another reason is the fact that they simply considered atom-pair information, which is essentially topological. Thus features such as logP, H-bonds etc are not taken into account, which would probably be very good indicators of druglike/non-druglike.&lt;br /&gt;&lt;br /&gt;Finally, the graphs in Fig. 3 are good examples of &lt;a href="http://lilt.ilstu.edu/gmklass/pos138/datadisplay/badchart.htm"&gt;Bad Charts&lt;/a&gt;. 3D is nice for real world objects - not plots.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116518311117918031?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116518311117918031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116518311117918031' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116518311117918031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116518311117918031'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2006/12/model-free-approach-to-drugnon-drug.html' title='Model free approach to the Drug/Non-drug problem'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116492276897094881</id><published>2006-11-30T13:39:00.000-08:00</published><updated>2006-12-20T11:01:08.363-08:00</updated><title type='text'>Random fragments for database screening</title><content type='html'>Chemical Database Mining through Entropy-Based Molecular Similarity Assessment of Randomly Generated Structural Fragment Populations (J. Batista &amp;amp; J. Bajorath, &lt;a rev='review' href="http://dx.doi.org/10.1021/ci600377m"&gt;10.1021/ci600377m&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;In this paper the authors describe a conceptually new approach to chemical database searching using similarity methods. Traditionally, such searches are used by defining a chemical space, either by set of bits representing a specific set of features or by a collection of descriptors. In either cases, the space is &lt;span style="font-style:italic;"&gt;predefined&lt;/span&gt;. Futhermore, depending on the nature of the representation, these approaches can be very complex (one example provided is the Molprint2D (DOI: &lt;a rev='review' href="http://dx.doi.org/10.1021/jm049611i"&gt;10.1021/jm049611i&lt;/a&gt;) fingerprint.&lt;br /&gt;&lt;br /&gt;The approach described by the authors is to take a collection of molecules belonging to a specific class (say active's) and then randomly generate fragments. The fragment generation algorithm is described &lt;a rev='review' href="http://dx.doi.org/10.1021/ci0601261"&gt;here&lt;/a&gt;. The feature of this algorithm that  is useful for a database mining task is that the number of fragments is variable and can be controlled via a user parameter. The main consequence of this is that there is no &lt;span style="font-style:italic;"&gt;fixed&lt;/span&gt; chemical space being used; rather, the space is characteristic of the collection of molecules that are fragmented. &lt;br /&gt;&lt;br /&gt;Once the molecules have been fragmented each fragment is then characterized using a histogram which gives the frequency of occurance of the fragment in each of the molecules. The histogram is then characterized using a modified version of the Shannon entropy which essentially describes whether a fragment is restricted to just a single molecule or is found in all molecules or is somewhere between. In this sense, the entropy measure the authors describe is quite similar to priors used in a &lt;a href="http://en.wikipedia.org/wiki/Naive_Bayesian_classifier"&gt;Naive Bayes&lt;/a&gt; approach. Since the entropy measure characterizes the information content of the fragment, they then select those fragments whose histograms exhibit values in a specified range. With these selected fragments they then compare fragment profiles of the class being considered with fragment profiles of individual database compounds using another entropy based scoring system (proportional Shannon entropy).&lt;br /&gt;&lt;br /&gt;At this point it is not clear to me what the authors mean by a fragment histogram (profile) for a &lt;span style="font-style:italic;"&gt;molecule&lt;/span&gt;, since Figure 1 in the paper describes the histograms for a specific fragment, not for a specific molecule. My understanding here is that a fragment profile for a molecule is simply a count of how many times each fragment (from the set of fragments whose entropy lies in the specified range) occurs in that molecule. If this is the case, this starts resembling a fingerprint, except that the &lt;span style="font-style:italic;"&gt;bits&lt;/span&gt; of the fingerprint are now specific to the molecules being considered and are not fixed a priori.&lt;br /&gt;&lt;br /&gt;To perform a screening they considered several classes of molecules (ACE inhibitors, dopamine agonists etc) and a source database was constructed using a random subset from &lt;a href="http://blaster.docking.org/zinc/"&gt;ZINC&lt;/a&gt;. To generate the fragment profiles they took a subset of the molecules from each class and the remainder of each class was mixed in with the database to serve as potential hits.&lt;br /&gt;&lt;br /&gt;The results are pretty interesting as their retrieval rate is quite similar (and in some cases better) to fingerprint (MACCS, Molprint2D, Daylight, MPMFP) similarity approaches., with differences between the two approaches being about 15%. &lt;br /&gt;&lt;br /&gt;Of course, given that random fragments are generated, the fragment population has a significant effect on the retrieval rates. Most of the conclusions regarding this aspect are as one would expect (such as larger populations of fragments lead to better retrieval rates). On interesting property of their fragment populations was that for most of the activity classes, more than 50% of their random fragments (i.e. derived from the original members of the activity class) were also found in at least 10K molecules from the ZINC derived database - implying that 50% of these fragments are basically background noise. Now, this number is obtained when the largest entropy range is used to select the fragment population (Table 6). For the smallest entropy range most activity classes have 92% or more of the fragments behaving as background noise. So it's interesting to see that we get better retrieval rates as we increase the fragment population (by allowing a larger range of entropy values) but that a large propertion of that population is still noise.&lt;br /&gt;&lt;br /&gt;Overall an interesting and conceptually novel paper.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116492276897094881?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116492276897094881/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116492276897094881' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116492276897094881'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116492276897094881'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2006/11/random-fragments-for-database.html' title='Random fragments for database screening'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116316956768971340</id><published>2006-11-10T06:17:00.001-08:00</published><updated>2006-12-20T11:01:40.076-08:00</updated><title type='text'>A Cheminformatics Review Paper</title><content type='html'>Basic Overview of Chemoinformatics (T. Engels, &lt;a rev='review' href="http://dx.doi.org/10.1021/ci600234z"&gt;10.1021/ci600234z&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;This paper just came out as an ASAP article and presents a review of the field of cheminformatics. Given that the field has been in existance for more than 30 years and covers such a broad range of topics, any review is a difficult task and writing one as a journal length article even more so.&lt;br /&gt;&lt;br /&gt;However Engels provides a reasonably comprehensive overview of the field by covering some of the main topics that come under the purview of cheminformatics. The first topic that is covered is the definition of cheminformatics and he makes a clear distinction between cheminformatics and computational chemistry (which is really theoretical chemistry done with computers). He also highlites the overlap between cheminformatics and bioinformatics and is correct to note that in certain areas of drug design the distinction between these two fields is not very clear.&lt;br /&gt;&lt;br /&gt;Some of the aspects of cheminformatics he touches upon include&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Representation of compounds and reactions&lt;/li&gt;&lt;li&gt;Datatypes&lt;/li&gt;&lt;li&gt;Data sources (databases for literature, properties, patents, reactions, biology)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Search methods&lt;/li&gt;&lt;li&gt;Descriptors&lt;/li&gt;&lt;li&gt;Data analysis&lt;/li&gt;&lt;/ul&gt;The topics listed above will be familiar to readers of the  &lt;a href="http://www.amazon.com/Handbook-Chemoinformatics-Knowledge-Representation-Structures/dp/3527306803/sr=1-1/qid=1163168328/ref=sr_1_1/002-0364154-1673676?ie=UTF8&amp;s=books"&gt;Handbook of Chemical Informatics&lt;/a&gt;, which covers these aspects in great detail. However Engel does provide a handy overview of the topics and the references also provide a good starting point for further study.&lt;br /&gt;&lt;br /&gt;My issue with the paper is that it could have provided some more detail in the areas of data analysis as well as a more critical approach of the topics covered. Clearly, too much detail would make longer than allowed for a journal article. On the other hand the section titled &lt;span style="font-style: italic;"&gt;The Datatypes in Chemistry&lt;/span&gt; read more like a listing of file formats, with a brief mention of the actual datatypes (strings, numerbers, bit vectors, hashes etc). It would have been useful to have an overview of what areas of cheminformatics utilize different datatypes and how multiple datatypes can be used for a given aspect.&lt;br /&gt;&lt;br /&gt;Another issue is the relatively low amount of discussion devoted to cheminformatics applications. Clearly a comprehensive review of this are would be a book-length project. However it would have been useful to have a review of some of the current application areas along with examples from recent published research. One area I would have liked to see covered is the convergence of cheminformatics and bioinformatics methods and the use fo cheminformatics in bioinformatics topics.&lt;br /&gt;&lt;br /&gt;Overall a useful paper, but does read like a summary of the &lt;a href="http://www.amazon.com/Handbook-Chemoinformatics-Knowledge-Representation-Structures/dp/3527306803/sr=1-1/qid=1163168328/ref=sr_1_1/002-0364154-1673676?ie=UTF8&amp;amp;s=books"&gt;Handbook&lt;/a&gt; to some extent.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116316956768971340?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116316956768971340/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116316956768971340' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116316956768971340'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116316956768971340'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2006/11/cheminformatics-review-paper_10.html' title='A Cheminformatics Review Paper'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116309195235721310</id><published>2006-11-09T08:34:00.001-08:00</published><updated>2006-11-09T09:05:52.363-08:00</updated><title type='text'>Building QSAR models using model based descriptors?</title><content type='html'>Recently there was a discussion on the cdk-devel mailing list regarding the inclusion of a descriptor (specifically, estimates of ionization potential) that is actually a predictive model in the descriptor hierarchy.&lt;br /&gt;&lt;br /&gt;The concept of a descriptor is usually some algorithm that characterizes some molecular feature as a numerical value. But this is a little restrictive as one could also consider a molecular property as a decsriptor (examples would be IP, logP etc). But it would appear that such molecular properties are the result of the molecular structure. Now given that &lt;span style="font-style: italic;"&gt;ab initio&lt;/span&gt; calculation of many molecular properties is time consuming, we could use approximations (examples would be Gasteiger-Marsilli charges or atom additive descriptors such as XLogP) or we could develop a predictive model.&lt;br /&gt;&lt;br /&gt;Consider the use of a predictive model to get a molecular property. To build such a model one would probably use a set of descriptors and choose an appropriate modeling technique. Once the model has been built and validated one can use it to predict the molecular property. But it is also important to note that such a model implicitly includes a &lt;span style="font-style: italic;"&gt;random error term.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;Now when we evaluate a descriptor in the traditional sense, the assumption is that the descriptor &lt;span style="font-style: italic;"&gt;exactly&lt;/span&gt; describes the structural feature in consideration. But when we employ a descriptor that is evaluated by a predictive model, we introduce another source of error in any subsequent model that utilizes such a descriptor. In addition, if one considers that many such model-based descriptors are really molecular properties then it is clear that they include several structural features implicitly. In many cases this can be useful since, it makes for a more physical interpretation of the subsequent model (i.e., logP is more interpretable than a 5th order chain descriptor!). But at the same time we loose the structural details of the encoded structur-activity relationship.&lt;br /&gt;&lt;br /&gt;But, in my opinion, more worrisome is the fact that model-baed descriptors are an extra source of error in the model they are used in as well as the fact that the molecules that they are applied to may be outside the applicability domain of the original model.&lt;br /&gt;&lt;br /&gt;Another direction from which this can be considered is that a descriptor simply  &lt;span style="font-style: italic;"&gt;describes &lt;/span&gt;a molecular feature. We then build a model using these descriptions that allows us to &lt;span style="font-style: italic;"&gt;explain&lt;/span&gt; a molecular property. The problem is that when we use model based descriptors they will, in general, try to explain some feature (with a side effect of predicting the feature). This is a bit of a generalization but I believe that Breiman ( Statistical Science, 2001, 16(3), 199-231 ) has provided a very useful categorization of statistical models:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Algorithmic - such models are purely predictive and make no attempt to explain the data and do not have any assumptions about the underlying distribution of the data. Examples include kNN, random forest etc&lt;/li&gt;&lt;li&gt;Data models - these models assume that data is generated by a stochastic process and such a model tries to embody that process. An example is OLS&lt;/li&gt;&lt;/ul&gt;Models of the second type provide predictions as a side-effect of explaining the data. But in either case, they do not necessarily &lt;span style="font-style: italic;"&gt;describe &lt;/span&gt;a molecular feature.&lt;br /&gt;&lt;br /&gt;But of course, much of this discussion is contingent on what one expects out of a model. If all we want is the best prediction possible, then I think that one should use what ever descriptor leads to the best predictive model, assuming that errors can be taken into account at each step of the modeling process. On the other hand if one desires an understanding of the SAR's encoded in a model, I believe that the use of purely descriptive (as opposed to explanatory) descriptors is a better approach.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116309195235721310?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116309195235721310/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116309195235721310' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116309195235721310'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116309195235721310'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2006/11/building-qsar-models-using-model-based_09.html' title='Building QSAR models using model based descriptors?'/><author><name>Rajarshi</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-37358649.post-116300196787549860</id><published>2006-11-08T08:05:00.000-08:00</published><updated>2006-11-08T08:06:07.883-08:00</updated><title type='text'>Welcom</title><content type='html'>Welcome! Please feel free to publish short summaries and reviews of journal artciles pertaining to chemical informatics here -- David Wild&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/37358649-116300196787549860?l=cheminfoclub.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://cheminfoclub.blogspot.com/feeds/116300196787549860/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=37358649&amp;postID=116300196787549860' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116300196787549860'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/37358649/posts/default/116300196787549860'/><link rel='alternate' type='text/html' href='http://cheminfoclub.blogspot.com/2006/11/welcom.html' title='Welcom'/><author><name>David</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
