Using knowledge graphs to streamline COVID-19 research

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Knowledge graphs are a way of organizing information so that it can be more productively explored and analyzed. Like all graphs, they consist of nodes — usually depicted as circles — and edges — usually depicted as line segments connecting nodes. In a knowledge graph, the nodes typically represent entities, and the edges indicate relationships between them. \n\nIn May, Amazon Web Services (Amazon Web Services) publicly released the COVID-19 Knowledge Graph (CKG), which organizes the information in the COVID-19 Open Research Dataset (++[CORD-19](https://www.semanticscholar.org/cord19)++), a growing repository of academic publications about COVID-19 and related topics created by a consortium led by the Allen Institute for AI. CKG powers Amazon Web Services CORD-19 ++[ranking](https://cord19.aws/)++ and recommendation system.\n\nIn a ++[paper](https://www.amazon.science/publications/covid-19-knowledge-graph-accelerating-information-retrieval-and-discovery-for-scientific-literature)++ we presented earlier this month at the AACL-IJCNLP ++[Workshop on Integrating Structured Knowledge and Neural Networks for NLP](https://sites.google.com/view/knlp/)++, we explain how we created CKG, and we describe several possible applications, including the ranking of papers on particular topics and the discovery of related papers.\n\n![image.png](https://dev-media.amazoncloud.cn/af86396eca61486b8a55c11d095b4522_image.png)\n\nAn example of the types of relationships captured by the COVID-19 Knowledge Graph.\n\n\n#### **How is the graph structured?**\n\n\nThe graph has five types of nodes: **paper nodes**, containing metadata about the papers, such as titles and ID numbers; **author nodes**, containing authors’ names; **institution nodes**, containing institutions’ names and locations; **concept nodes**, containing specific medical terms that appear in papers, such as ibuprofen, heart disfunction, and asthma; and **topic nodes**, containing general areas of study such as genomics, epidemiology, and virology. \n\nThe graph also has five types of edges: **authored_by**, linking a paper to its authors; **affiliated_with**, linking authors to their institutions; **associated_concept**, linking a paper to its associated concepts; **associated_topic**, linking a paper to its topics; and **citeps**, linking a paper to other papers that cite it.\n\n\n#### **How was the graph created?**\n\n\nThe standardized format of the papers in the CORD-19 database allows for easy extraction of title, abstract, body, authors and institutions, and citations.\n\nTo identify concepts, we rely on Amazon Web Services ++[Comprehend Medical](https://aws.amazon.com/comprehend/medical/)++, which extracts medical entities from the text and also classifies them into entity types. For instance, given the sentence “Abdominal ultrasound noted acute appendicitis”, Amazon Web Services Comprehend Medical would extract the entities *abdominal (anatomy)*, *ultrasound (test treatment procedure)*, and *acute appendicitis (medical condition)*.\n\nTo extract topics, we use an extension of latent Dirichlet allocation called Z-LDA, which is trained using the title, abstract, and body text from each paper. Z-LDA assumes that the terms most characteristic of a paper reflect some *topic*, and it selects one of those terms as the label for the topic based on frequency of occurrence across the corpus. A list of topics generated in this manner was whittled down to a final 10 topics with the help of medical professionals. \n\n\n#### **Example application: citation-based ranking**\n\n\nIn academia, a standard measure of a paper’s relevance is the number of publications that cite it. A graph structure makes it easy to count citations. But it also enables customized counts, such as citations by publications that deal with a specific topic or include specific concepts.\n\n![image.png](https://dev-media.amazoncloud.cn/d5cc5080a6634b98b1c8a19d2e09be77_image.png)\n\nThis table lists the most-cited papers in the CORD-19 database that feature the specific concepts asthma, heart disease, and respiratory malfunction, all of which are COVID-19 risk factors.\n\n\n#### **Similar-paper engine**\n\n\nGiven a paper, the similar-paper engine retrieves a list of k similar papers. It uses two different measures of similarity, which are combined in a final step.\n\nOne measure uses SciBert *embeddings*, which are built on top of the popular BERT language model but fine-tuned on specifically scientific texts. SciBert represents input sentences as points in a multidimensional space, such that sentences relating to the same scientific concepts tend to cluster together.\n\n![image.png](https://dev-media.amazoncloud.cn/3452010e988d423b849fc0925288b78f_image.png)\n\nAn example of how we average embeddings for the title, abstract, and body of a paper to create a final embedding.\n\nWe create separate embeddings for the title, abstract, and body of a paper and then average them to create a final embedding. Previous research suggests that title embeddings may be easier to distinguish from one another than body embeddings, while body embeddings carry richer information. So we chose an embedding scheme that gave both equal weight. The proximity in the representational space of the averaged embeddings indicates the similarity of the associated papers.\n\nThe second model uses a different kind of embedding, ++[knowledge graph embeddings](https://www.amazon.science/blog/amazons-open-source-tools-make-embedding-knowledge-graphs-much-more-efficient)++, which attempt to preserve relationships encoded in a knowledge graph. If two entities are connected in the graph by an edge representing a relationship, then the embedding of the first entity, when added to a vector representing the relationship, should produce a point in the vicinity (ideally, at the exact location) of the second entity.\n\nTo create our knowledge-graph-embedding network, we use the tool ++[DGL-KE](https://github.com/awslabs/dgl-ke)++, which was developed at Amazon Web Services and extends our earlier Deep Graph Library (DGL).\n\nAs training data, we extract sets of vector triplets *(h, r, t)* from CKG, where *h* is the head, *r* is the type of relation, and *t* is the tail. These triplets are the positive training examples. The negative examples are synthetically created by randomly replacing the head or tail of existing triplets. \n\nUsing these examples, we train our model to differentiate false links from real links. The result is an embedding for every node in the graph.\n\n![image.png](https://dev-media.amazoncloud.cn/3a0906dbe5844095b90b9716748c7497_image.png)\n\nA diagram of the process of knowledge graph embedding.\n\nAt the end of this process, we concatenate the semantic embeddings and knowledge graph embeddings, creating a new, higher-dimensional representational space. By computing the top-k closest vectors (cosine distance) in this space, we obtain the top-k most-similar papers.\n\nGiven the lack of ground truth for paper recommendations, we evaluate the algorithm through analytical quantitative and qualitative measures. These include but are not limited to popularity analysis, topic intersection between source paper and recommendations, low-dimension clustering, and abstract comparison. \n\nAdditional information about our approach can be found in a pair of posts on the Amazon Web Services blog, \"++[Exploring scientific research on COVID-19 with Amazon Neptune, Amazon Comprehend Medical, and the Tom Sawyer Graph Database Browser](https://aws.amazon.com/blogs/database/exploring-scientific-research-on-covid-19-with-amazon-neptune-amazon-comprehend-medical-and-the-tom-sawyer-graph-database-browser/)++\" and \"++[Building and querying the Amazon Web Services COVID-19 knowledge graph](https://aws.amazon.com/blogs/database/building-and-querying-the-aws-covid-19-knowledge-graph/)++\".\n\n**Acknowledgements:** Xiang Song, Colby Wise, Vassilis N. Ioannidis, George Price, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, George Karypis\n\nABOUT THE AUTHOR\n\n\n#### **[Miguel Romero Calvo](https://www.amazon.science/author/miguel-calvo)**\n\n\nMiguel Romero Calvo is a data scientist in the Amazon Machine Learning (ML) Solutions Lab at Amazon Web Services.","render":"<p>Knowledge graphs are a way of organizing information so that it can be more productively explored and analyzed. Like all graphs, they consist of nodes — usually depicted as circles — and edges — usually depicted as line segments connecting nodes. In a knowledge graph, the nodes typically represent entities, and the edges indicate relationships between them.</p>\n<p>In May, Amazon Web Services (Amazon Web Services) publicly released the COVID-19 Knowledge Graph (CKG), which organizes the information in the COVID-19 Open Research Dataset (<ins><a href=\"https://www.semanticscholar.org/cord19\" target=\"_blank\">CORD-19</a></ins>), a growing repository of academic publications about COVID-19 and related topics created by a consortium led by the Allen Institute for AI. CKG powers Amazon Web Services CORD-19 <ins><a href=\"https://cord19.aws/\" target=\"_blank\">ranking</a></ins> and recommendation system.</p>\n<p>In a <ins><a href=\"https://www.amazon.science/publications/covid-19-knowledge-graph-accelerating-information-retrieval-and-discovery-for-scientific-literature\" target=\"_blank\">paper</a></ins> we presented earlier this month at the AACL-IJCNLP <ins><a href=\"https://sites.google.com/view/knlp/\" target=\"_blank\">Workshop on Integrating Structured Knowledge and Neural Networks for NLP</a></ins>, we explain how we created CKG, and we describe several possible applications, including the ranking of papers on particular topics and the discovery of related papers.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/af86396eca61486b8a55c11d095b4522_image.png\" alt=\"image.png\" /></p>\n<p>An example of the types of relationships captured by the COVID-19 Knowledge Graph.</p>\n<h4><a id=\"How_is_the_graph_structured_11\"></a><strong>How is the graph structured?</strong></h4>\n<p>The graph has five types of nodes: <strong>paper nodes</strong>, containing metadata about the papers, such as titles and ID numbers; <strong>author nodes</strong>, containing authors’ names; <strong>institution nodes</strong>, containing institutions’ names and locations; <strong>concept nodes</strong>, containing specific medical terms that appear in papers, such as ibuprofen, heart disfunction, and asthma; and <strong>topic nodes</strong>, containing general areas of study such as genomics, epidemiology, and virology.</p>\n<p>The graph also has five types of edges: <strong>authored_by</strong>, linking a paper to its authors; <strong>affiliated_with</strong>, linking authors to their institutions; <strong>associated_concept</strong>, linking a paper to its associated concepts; <strong>associated_topic</strong>, linking a paper to its topics; and <strong>citeps</strong>, linking a paper to other papers that cite it.</p>\n<h4><a id=\"How_was_the_graph_created_19\"></a><strong>How was the graph created?</strong></h4>\n<p>The standardized format of the papers in the CORD-19 database allows for easy extraction of title, abstract, body, authors and institutions, and citations.</p>\n<p>To identify concepts, we rely on Amazon Web Services <ins><a href=\"https://aws.amazon.com/comprehend/medical/\" target=\"_blank\">Comprehend Medical</a></ins>, which extracts medical entities from the text and also classifies them into entity types. For instance, given the sentence “Abdominal ultrasound noted acute appendicitis”, Amazon Web Services Comprehend Medical would extract the entities <em>abdominal (anatomy)</em>, <em>ultrasound (test treatment procedure)</em>, and <em>acute appendicitis (medical condition)</em>.</p>\n<p>To extract topics, we use an extension of latent Dirichlet allocation called Z-LDA, which is trained using the title, abstract, and body text from each paper. Z-LDA assumes that the terms most characteristic of a paper reflect some <em>topic</em>, and it selects one of those terms as the label for the topic based on frequency of occurrence across the corpus. A list of topics generated in this manner was whittled down to a final 10 topics with the help of medical professionals.</p>\n<h4><a id=\"Example_application_citationbased_ranking_29\"></a><strong>Example application: citation-based ranking</strong></h4>\n<p>In academia, a standard measure of a paper’s relevance is the number of publications that cite it. A graph structure makes it easy to count citations. But it also enables customized counts, such as citations by publications that deal with a specific topic or include specific concepts.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/d5cc5080a6634b98b1c8a19d2e09be77_image.png\" alt=\"image.png\" /></p>\n<p>This table lists the most-cited papers in the CORD-19 database that feature the specific concepts asthma, heart disease, and respiratory malfunction, all of which are COVID-19 risk factors.</p>\n<h4><a id=\"Similarpaper_engine_39\"></a><strong>Similar-paper engine</strong></h4>\n<p>Given a paper, the similar-paper engine retrieves a list of k similar papers. It uses two different measures of similarity, which are combined in a final step.</p>\n<p>One measure uses SciBert <em>embeddings</em>, which are built on top of the popular BERT language model but fine-tuned on specifically scientific texts. SciBert represents input sentences as points in a multidimensional space, such that sentences relating to the same scientific concepts tend to cluster together.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/3452010e988d423b849fc0925288b78f_image.png\" alt=\"image.png\" /></p>\n<p>An example of how we average embeddings for the title, abstract, and body of a paper to create a final embedding.</p>\n<p>We create separate embeddings for the title, abstract, and body of a paper and then average them to create a final embedding. Previous research suggests that title embeddings may be easier to distinguish from one another than body embeddings, while body embeddings carry richer information. So we chose an embedding scheme that gave both equal weight. The proximity in the representational space of the averaged embeddings indicates the similarity of the associated papers.</p>\n<p>The second model uses a different kind of embedding, <ins><a href=\"https://www.amazon.science/blog/amazons-open-source-tools-make-embedding-knowledge-graphs-much-more-efficient\" target=\"_blank\">knowledge graph embeddings</a></ins>, which attempt to preserve relationships encoded in a knowledge graph. If two entities are connected in the graph by an edge representing a relationship, then the embedding of the first entity, when added to a vector representing the relationship, should produce a point in the vicinity (ideally, at the exact location) of the second entity.</p>\n<p>To create our knowledge-graph-embedding network, we use the tool <ins><a href=\"https://github.com/awslabs/dgl-ke\" target=\"_blank\">DGL-KE</a></ins>, which was developed at Amazon Web Services and extends our earlier Deep Graph Library (DGL).</p>\n<p>As training data, we extract sets of vector triplets <em>(h, r, t)</em> from CKG, where <em>h</em> is the head, <em>r</em> is the type of relation, and <em>t</em> is the tail. These triplets are the positive training examples. The negative examples are synthetically created by randomly replacing the head or tail of existing triplets.</p>\n<p>Using these examples, we train our model to differentiate false links from real links. The result is an embedding for every node in the graph.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/3a0906dbe5844095b90b9716748c7497_image.png\" alt=\"image.png\" /></p>\n<p>A diagram of the process of knowledge graph embedding.</p>\n<p>At the end of this process, we concatenate the semantic embeddings and knowledge graph embeddings, creating a new, higher-dimensional representational space. By computing the top-k closest vectors (cosine distance) in this space, we obtain the top-k most-similar papers.</p>\n<p>Given the lack of ground truth for paper recommendations, we evaluate the algorithm through analytical quantitative and qualitative measures. These include but are not limited to popularity analysis, topic intersection between source paper and recommendations, low-dimension clustering, and abstract comparison.</p>\n<p>Additional information about our approach can be found in a pair of posts on the Amazon Web Services blog, “<ins><a href=\"https://aws.amazon.com/blogs/database/exploring-scientific-research-on-covid-19-with-amazon-neptune-amazon-comprehend-medical-and-the-tom-sawyer-graph-database-browser/\" target=\"_blank\">Exploring scientific research on COVID-19 with Amazon Neptune, Amazon Comprehend Medical, and the Tom Sawyer Graph Database Browser</a></ins>” and “<ins><a href=\"https://aws.amazon.com/blogs/database/building-and-querying-the-aws-covid-19-knowledge-graph/\" target=\"_blank\">Building and querying the Amazon Web Services COVID-19 knowledge graph</a></ins>”.</p>\n<p><strong>Acknowledgements:</strong> Xiang Song, Colby Wise, Vassilis N. Ioannidis, George Price, Ninad Kulkarni, Ryan Brand, Parminder Bhatia, George Karypis</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Miguel_Romero_Calvohttpswwwamazonscienceauthormiguelcalvo_75\"></a><strong><a href=\"https://www.amazon.science/author/miguel-calvo\" target=\"_blank\">Miguel Romero Calvo</a></strong></h4>\n<p>Miguel Romero Calvo is a data scientist in the Amazon Machine Learning (ML) Solutions Lab at Amazon Web Services.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭