Microsoft word - amia2009.doc

Extraction of Conditional Probabilities of the Relationships
Between Drugs, Diseases, and Genes from PubMed
Guided by Relationships in PharmGKB
Martin Theobald, Ph.D1, Nigam Shah, M.B.B.S., Ph.D2, and Jeff Shrager, Ph.D3 Departments of (1) Computer Science, (2) Biomedical Informatics, Stanford University, Stanford, CA 94305 USA
Abstract
Suppose, for example, a patient is given a particular diagnosis, and a genomic-analysis Guided by curated associations between genes, reveals a mutation in a gene for which a targeted treatments (i.e., drugs), and diseases in treatment has been explored in the scientific literature, but for which there is no approved networks based on conditional probability tables specific treatment. P(t|g,d) gives one a sense of (cpt’s) extracted from co-occurrence statistics which treatments have been explored the most in over the entire Pubmed corpus, producing a this circumstance. Once a treatment is chosen, broad-coverage analysis of the relationships between these biological entities. The networks responses to watch in this patient (often not be suggest hypotheses regarding drug mechanisms, the same as the mutated gene). Here p(g|t,d) may treatment biomarkers, and/or potential markers offer a sense of what the literature suggests as of genetic disease. The cpt’s enable Trio, an inferential database, to query indirect (inferred) relationships via an SQL-like query language. Thus the conditional probabilities among these entities across the scientific literature may lead to practical new hypotheses, and support inference to p(c|t,d), or eventually even to the holy grail of The goal of clinical research can be thought of as personalized genetic medicine: p(c|t,d,g). seeking the conditional probability (cp) of a cure given particular treatments and diseases; in terms Background
of conditional probabilities: statistically quantified and directed relationships of the form Many researchers have extracted association- p(cure|treatment,disease) [hereafter: p(c|t,d)]. based knowledge from the medical literature. Meta-analysis over clinical trials can obtain an Zhu, et al. [1] computed co-occurrence of compounds and genes, and Jenssen et al. [2] statistically across trials. Such meta-analyses computed a gene-to-gene co-citation network. extract a p(c|t,d) that is statistically tacit in the These are relatively simple computations. literature. In the present work we explore other Extracting conditional multivariate statistics is potentially useful statistically tacit results available in the medical literature. Specifically, computing all combinations of associations, plus we compute conditional probabilities between background counts for normalization, and the treatments (usually drugs), diseases, and genes: potential vocabulary is very large. Wren [3] p(t|d,g), by analyzing their co-occurrence in extracted a network of associations among genes, Pubmed (www.ncbi.nlm.nih.gov/pubmed). diseases, phenotypes, drugs, etc. using the Although not as directly useful as p(c|t,d), these mutual information of shared associations from cp’s and their algebraic co-forms may be Pubmed abstracts over a set of 10,000 common interpreted in a number of useful ways, for words. In order to control the computational example as personalized (e.g., genetically complexity and avoid saturation (which is likely guided) treatment hypotheses [p(t|g,d)], as drug mechanism hypotheses [p(g|t)], as treatment- calculations to only 100,000 abstracts. Similarly, response predictive biomarkers [p(g|t,d)], or as Narayanasamy, et al. [4] mined co-occurrence in potential markers of genetic diseases [p(g|d)]. Pubmed to build an association graph and ranked associations co-occurring with both the objects have to be calculated. In order to make headway (equivalent to mutual information). Although in this endeavor, we need guidance on which most of these projects uncovered various combinations to explore. One source of guidance suggestive associations, they have either used a could be a user query about the relationships small corpus, focused on only one kind of entity between particular treatments, diseases, and (e.g. gene-gene), focused only on co-occurrence genes. It seems unlikely, however, that a user (which is symmetric as opposed to conditional would come up with likely combinations a priori. probability), or recognized concepts from only one (or few) ontologies. In the present work we mine quantified, directed (i.e., asymmetric) which explicitly (although non-statistically) drug/gene/disease relationships over the entirety relates drugs, genes, and diseases (and other of more than 19 million Pubmed abstracts and entities). PharmGKB offers relationships between drugs, diseases, and genes, based on specific papers and different types of evidence ranging from “clinical outcome” to simply “discussed”. (We dropped any marked “not We seek to extract all-way co-occurrence-based related”.) Note that, although we use these Bayesian networks among treatments (primarily relations in pharmGKB to guide our analysis, we drugs for this study), diseases, and genes. These do not prioritize the specific papers used in can be estimated from subsets of conditional and pharmGKB, but use the entire Pubmed database non-conditional probabilities which are in turn for our statistics. Thus no quantitative bias is derived from raw co-occurrence counts of introduced by the papers curated into pharmGKB. drug/disease/gene entities in domain-specific corpora such as Pubmed. For non-conditional Guided by the relations in pharmGKB 1 , we statistics, such a co-occurrence probability would combined information from a tagged Pubmed corpus created by processing all Medline abstracts) that mention these items together, abstracts 2 using the Mgrep tool (University of divided by the total number of documents Michigan). Mgrep uses all of the alternative contained in the corpus. The desired conditional strings for UMLS concepts3 and identifies their probabilities are: p(drug|gene), p(drug|disease), occurrence in the abstract using a radix tree p(drug|gene,disease), etc. One can easily see how to compute such conditional probabilities over an processing without sacrificing precision [5]. In appropriately annotated Pubmed database, our experience, this method has an average simply by counting the single and combinational precision of about 85% for diseases [6]. (We co-occurrences of all of these entities, and have not evaluated precision for other entities.) performing the obvious calculation, i.e., p(drug A|gene B, disease C) = (# distinct abstracts containing A and B and C)/(# distinct abstracts experiment contains ~19 million articles and containing B and C). Notice that more general ~200 concepts assigned to each article resulting relationships are conceivable, i.e., considering many-to-many relationships between drugs, Concept Unique Identifier (CUI) assignments. diseases, and genes. In the present experiment We combined these data with the highly reliable we limit our Bayesian network to a maximum of six conditional variables and a single target gene/DATA). Using only the relationships variable, thus extracting up to 26 conditional marked as “related” or “positively related”, we extracted 1,730 disease/drug/gene relationships combinatorial complexity, and hence the number with up to 6 conditional variables, and extracted of co-occurrence queries issued against our their respective conditional probability tables using co-occurrence statistics over the ~3 billion distinct Pubmed ID/CUI pairs, resulting in The problem with this approach is that an enormous number of combinations must be computed. If there are, say, 20,000 genes, a 1 Late 2007 snapshot of the pharmGKB database. thousand drugs or investigational drugs, and a thousand diseases, 222,000 combinations would 19,092 conditional probabilities (again, compare burgeoning set of direct relationships. By with ~222,000 for the full-joint distribution). directly extracting the conditional probabilities Although this is clearly still an offline process, as input into a Bayesian net, our method allows requiring several days, once extracted, these for a more compact representation of the desired tables serve as input for our Bayesian nets and dependencies than it would be possible via allow for an efficient execution of arbitrary capturing the full joint-distribution of all inferential queries; any conditional probability of variables involved in such a relationship. Any variables/entities expressed in a pharmGKB conditional probability of variables involved in relationship can be directly computed from these. the net can be efficiently derived via Bayesian inference. For example, marginalized conditional Results and Extensions
probabilities of the form p(disease|gene) can be directly calculated from a conditional probability The result of our method is a miniature Bayesian table initially extracted for p(drug|gene,disease) network for each of the pharmGKB relationships. without going back to the source to extract more For example: p(antidepressants | affective co-occurrence statistics. In this special setting, disorders, GNB3) = 0.33 (abbreviated: p(an|af,g) the obtained net always has a tree structure, = 0.33. (Values are rounded to 2 decimal places, permitting linear-time inference queries. This approach can be extended to extract arbitrary abbreviations.) That is, out of all the documents that mention “affective disorder” and gene relationships, by sampling co-occurrence “GNB3”, about one third also mention anti- statistics from pubmed for arbitrary text tokens. depressants. The subordinate relationships in this set include: p(~an|af,g) =.67, p(an|~af,g) = 0.04, Implementation
p(~an|~af,g) = 0.96, p(an|af,~g) = 0.11, p(~an|af,~g) = 0.89, p(an|~af, ~g) = 0.0, and The present work is implemented as an extension p(~an|~af,~g) = 1.0. (Note that complementary of the Trio system [7], a database system for the conditional probabilities add to 1.0) Many of integrated management of data, uncertainty and these subordinate relationships may, of course, lineage. Trio uses an extended relational schema to capture data uncertainty (in the form of p(azidothymidine | HIV, ABCC4) = 0.6. That is, alternative attribute values and confidences in 60% of the papers where HIV and ABCC4 are mentioned, azidothymidine is also mentioned. alternatives), as well as data lineage (i.e., pointers to internal or external sources of the Note that the order of the contexts (following the data). For the specific inference setting explored vertical bar in the c.p.) is not relevant, but the here, it turns out that the notion of lineage can targeted posterior (right side of the vertical bar) nicely be generalized to capturing arbitrary is relevant, and that the relationships are not relationships between entities (or records in a symmetric across the conditional (vertical bar). database), thus providing pointers to other Contrast, for example: p(mercaptopurine | entities (again other records), which allows for a azathioprine, thioguanine, TPMT) = 0.84, convenient way of encoding Bayesian nets p(thioguanine | azathioprine, mercaptopurine, directly on top of this extended relational setting. mercaptopurine, thioguanine, TPMT) = 0.89. present/absent combination of variables in a And the subordinate relationships: p(azathioprine pharmGKB relationship is encoded as a different | thioguanine, TPMT) = 0.86, p(thioguanine | alternative of such an “uncertain” record, along azathioprine, TPMT) = 0.44. The most clinically with a confidence value, which allows us to important results are, of course, the conditional probabilities of different treatments (drugs), for distributions (including cpt’s) for each record in the same disease&gene combination. For example: p(salmeterol | Asthma, ADRB2) = 0.07 and p(salbutamol | Asthma, ADRB2) = 0.16. Moreover, this affords a simple, declarative way of issuing true inference queries on top of the Aside from direct relationships, one may want to precomputed conditional and non-conditional assess indirect (i.e., inferred) relationships based on the very same precomputed nets. There are, of course, many more of these than the already about the potential biomarkers (genes) given the context of a disease&treatment combination. Interpretation aside, many aspects of this present would simply select the conditional probabilities analysis need improvement before this method p(mercaptopurine|azathioprine,thioguanine) from can be applied. First, as with most statistical the precomputed cpt’s for DRUGS. Conversely methods applied to natural language, we have (still based on the same input table DRUGS ignored the specific relationships between the capturing the cpt’s for the initial target variable entities, both in pharmGKB and in Pubmed, and mercapturine, but also pointers to the non- especially the possibility of negatively expressed conditional priors of azathioprine and relationships. Of course, we already know, from thioguanine), we can, for example, initiate an on- correlation, because we filter out those that are p(azathioprine|mercaptopurine), thus swapping marked as “not related” in that database. the direction of the conditional probability and Moreover, given that our statistics include a marginalizing the distribution (i.e., eliminating huge number of papers, it is unlikely that a large the conditional variable thioguanine) in a single, fraction of them are telling us that "drug X does NOT have any effect on disease Y" (etc.), especially as the scientific literature does not often report negative results. Second, the particular tagger that we used does a poor job of resolution is dependent on the UMLS CUIs. We The result of this inferential query is a new cpt use all synonymous strings for a CUI while for p(azathioprine|mercaptopurine) that had not doing the tagging, and the output only contains been precomputed, and whose computation is the CUI. This is not a particularly good solution triggered by the “COMPUTE INFERENCE” clause using the inferencing extension in Trio. Issuing such an inference query is much faster We recognize that our use of Mgrep, as well as over these simple (in our case tree-like) Bayesian our use of co-occurrence as a substitute for nets than going back to the entire Pubmed actual relationships may introduce significant database and mining for the respective co- inaccuracies. In a project in progress we are occurrence statistics at query processing time. generating parse trees of each sentence in the Issuing an inference query in Trio over the abstract and then only using the noun phrases for precomputed cpt’s takes less than a second, recognizing mentions of diseases and drugs. This whereas extracting the raw co-occurrence statistics from the entire set of Pubmed abstracts for a single pharmGKB relation with up to 6 More critically, because this method is focused variables may take several minutes in our current, relations that might be important, but which are Limitations and Directions
resolve this might be to build the method into a search engine and use the combinations that are The probabilities that we derive reflect only co- explicitly searched for as guidance (instead of occurrence in the literature, and not, for example, using pharmGKB); indeed, this is explicitly recommendations, so one must be cautious in enabled by the Trio infrastructure, and given the interpreting these results. What, then, are they precalculated Pubmed/CUI database, seeking any telling us, and is what they are telling us useful? given relationship set takes only a few minutes Because the literature is historical, these results over the entire set of annotated Pubmed abstracts. are not telling us what to try, but what has been There could be other sources of such guidance as tried, or, possibly, what has been suggested (if well. For example, one could use the literature not actually tried). Under this analysis one might itself: co-mentions in specific abstracts, either all regard a high conditional probability as a sort of of them (only a few million computations vs. ranking of hypotheses regarding potential 222,000), or perhaps a reduced hash that selects all treatments given the context of disease&gene unique co-mentions (certainly less than the combinations, and, symmetrically: hypotheses Regardless, of these proposed extensions to the 3. Wren, J.D. 2004. Extending the mutual information measure to rank inferred literature approaches to its limitations, exhaustive relationships. BMC Bioinformatics. 5:145.
evaluation is clearly needed to justify its utility. 4. Narayanasamy V, Mukhopadhyay S, Palakal Acknowledgements
transitive associations among biological objects MT’s work was supported by NSF grants IIS- from text. J Biomed Sci. 11(6):864-73. 0324431 and IIS-0414762 and by grants from the Boeing and Hewlett-Packard Corporations. 5. Dai, M., Shah, N.H., Xuan, W., Musen, M.A., NS’s work was supported by NIH grant U54 Watson, S.J., Athey, B., Meng, F. 2008. An HG004028 and a gift from CommerceNet. JS’s Efficient Solution for Mapping Free Text to work was supported by CollabRx, Inc. The Ontology Terms. Poster at AMIA Summit on Translational Bioinformatics, San Francisco. 6. Bhatia N., Shah N.H., Rubin D.L., Chiang A.P. References
Recognizers for Ontology-Based Indexing: 1. Zhu, S., Okuno, Y., Tsujimoto G., Mamitsuka MGREP vs. MetaMap. Paper accepted to the H. 2005. Mining literature co-occurrence data AMIA Summit on Translational Bioinformatics, using a probabilistic model. IPSJ SIG Technical 7. Benjelloun, O. Das Sarma, A., Halevy, A. 2. Jenssen, T-K, Lægreid, A, Komorowski, J., Hovig, E. A literature network of human genes databases with uncertainty and lineage. The for high-throughput analysis of gene expression. International Journal on Very Large Data Bases,

Source: http://www.mpi-inf.mpg.de/~mtb/pub/amia2009.pdf

Lebenslauf

Curriculum vitae Dr. med. Werner Lindemann Geburtsjahr: Tätigkeit: Chefarzt der Klinik für Viszeral-, Gefäß- und Thoraxchirurgie am Ortenau Klinikum Lahr-Ettenheim sowie Standortleitung Lahr des Darmzentrums Ortenau Gebiet, Facharzt, Schwerpunktkompetenz sowie Weiterbildungen und Zusatzqualifikationen: Facharzt für Chirurgie, Viszeralchirurgie, Gefäßchirurgie W

Copyright © 2010-2014 Medical Pdf Finder