Improving “entity linking” between texts and knowledge bases

自然语言处理
海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Entity linking (EL) is the process of automatically linking entity mentions in text to the corresponding entries in a knowledge base (a database of facts that relate entities), such as Wikidata. For example, in the diagram below, we would aim to link the mention “England” to the entity “England Football Team” as opposed to the entity “England” the country.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/67e3ee9c346242f884b6c3a92dd0a3c1_%E4%B8%8B%E8%BD%BD.jpg)\n\nIn this sentence, the entity name “England” should be linked to the knowledge base entry for the English national football team, not the one for England the country.\n\nEntity linking is a common first step in natural-language-processing ([NLP](https://www.amazon.science/tag/nlp)) applications such as question answering, information extraction, and natural-language understanding. It is critical for bridging unstructured text with knowledge bases, which enables access to vast amounts of curated data.\n\nCurrent EL systems exhibit great performance on standard datasets, but they have several limitations when deployed in real-world applications. First, they are computationally intensive, which makes large-scale processing expensive.\n\nSecondly, most EL systems are designed to link to specific knowledge bases (typically Wikipedia) and cannot be easily adapted to other knowledge bases. Lastly, the most efficient existing methods cannot link texts to entities that were introduced into the knowledge base after training (a task known as zero-shot EL), meaning they must be frequently retrained to be kept up-to-date.\n\nIn the NAACL 2022 industry track, we ++[introduced a new EL system](https://www.amazon.science/publications/refined-an-efficient-zero-shot-capable-approach-to-end-to-end-entity-linking)++ called ReFinED, which addresses all three issues. We built on this work in a second paper in the main conference, which introduces ++[a novel method to incorporate additional knowledge base information](https://www.amazon.science/publications/improving-entity-disambiguation-by-reasoning-over-a-knowledge-base)++ into the model, further improving its accuracy.\n\nReFinED surpasses state-of-the-art performance on standard EL datasets by an average of 3.7 points in F1 score, a measure that factors in both false positives and false negatives, and it is 60 times faster than existing approaches with competitive performance.\n\nReFinED is capable of generalizing to large-scale knowledge bases such as Wikidata (which has 15 times as many entities as Wikipedia) and of zero-shot entity linking. The combination of speed, accuracy, and scale makes ReFinED an effective and cost-efficient system for extracting entities from web-scale datasets, for which the model has been successfully deployed within Amazon.\n\n#### **Entity linking with fine-grained types and descriptions**\n\nEL is challenging because entity mentions are often ambiguous. Therefore, EL systems must make effective use of context (surrounding words) to reliably disambiguate entity mentions.\n\nRecent EL systems use deep-learning methods to match mentions, not with entities directly, but with information stored in the knowledge base, such as textual entity descriptions or fine-grained entity types. This is advantageous for linking to entities not seen in the training data (zero-shot EL), because the information used to describe them will have properties the model has seen during training.\n\nHowever, such zero-shot-capable approaches are an order of magnitude more computationally expensive than non-zero-shot models, as they require numerous entity types and/or multiple forward passes through the model to encode mentions and descriptions. This makes large-scale processing prohibitively expensive for some applications.\n\nLike earlier zero-shot-capable models, ReFinED uses fine-grained entity types and entity descriptions to perform EL. But we use a simple Transformer-based encoder that yields better performance than that of more complex architectures, surpassing the state of the art on five EL datasets.\n\n![下载 1.jpg](https://dev-media.amazoncloud.cn/73faad9a2bef467b9bcf7908b1a30cec_%E4%B8%8B%E8%BD%BD%20%281%29.jpg)\n\nReFinED computes two scores, a description score and a typing score, which evaluate the match between an input sentence and candidate entities in a knowledge base.\n\nUnlike previous work, ReFinED performs mention detection (identifying entity mention spans), fine-grained entity typing (predicting entity types), and entity disambiguation (scoring entities) for all mentions within a document in a single forward pass, making it 60 times as fast as comparable models and therefore approximately 60 times more resource efficient to run.\n\nUnder the hood, ReFinED is a Transformer-based neural network that computes two scores, a description score and an entity typing score, to indicate how suitable an entity is for a mention.\n\n#### **Incorporating relationship data**\n\nOne shortcoming of this approach is that there may be some mentions whose candidate entities cannot be disambiguated with knowledge base entity description and types. As illustration, consider the following sentence, with the entity descriptions and types for two entities that “Clinton” could refer to:\n\n![下载 2.jpg](https://dev-media.amazoncloud.cn/7ec0d630edc741ea965121c92b0259bc_%E4%B8%8B%E8%BD%BD%20%282%29.jpg)\n\nSometimes, description and type information is not enough to distinguish two knowledge base entries.\n\nGiven only the context of the sentence and the knowledge base descriptions and types, it is not possible to correctly decide whether the sentence is referring to Hillary Clinton or Bill Clinton.\n\nOur second NAACL paper, “++[Improving entity disambiguation by reasoning over a knowledge base](https://www.amazon.science/publications/improving-entity-disambiguation-by-reasoning-over-a-knowledge-base)++”, addresses this drawback. We propose an approach that uses additional knowledge base facts associated with the candidate entities.\n\nKnowledge base facts encode the relations between pairs of entities, as in the following examples:\n\n![下载 3.jpg](https://dev-media.amazoncloud.cn/beb3d0bcd7f845938234cb7b4d3504cc_%E4%B8%8B%E8%BD%BD%20%283%29.jpg)\n\nWhere type and description information is inadequate to distinguish candidate entities, the model uses additional knowledge base facts.\n\nIn order to use this type of information, we gave our model an additional mechanism, which allows it to predict the relationships connecting pairs of mentions in the text. For example, the model would infer from the sentence context that “Clinton”’s place of birth and place of education are “Hope, Arkansas” and “Hot Springs High School”. We can then match these inferences against facts in the knowledge base.\n\nIn this case, as the diagram below shows, we would find that the two predictions match the knowledge base facts for Bill Clinton but not for Hillary Clinton. As a result, our model would boost the score for Bill Clinton and, hopefully, make the correct prediction.\n\n![下载 4.jpg](https://dev-media.amazoncloud.cn/3210d351249143feb9db434ab76d8a51_%E4%B8%8B%E8%BD%BD%20%284%29.jpg)\n\nThe addition of a mechanism that exploits knowledge base facts increases model accuracy.\n\nBy adding this mechanism to the model, we were able to increase the state-of-the-art performance by 1.3 F1 points on average across six commonly used datasets in the literature and by 12.7 F1 points on the “ShadowLink” dataset, which focuses on particularly challenging examples.\n\nABOUT THE AUTHOR\n#### **[Tom Ayoola](https://www.amazon.science/author/tom-ayoola)**\nTom Ayoola is an applied scientist in the Alexa AI organization.\n#### **[Joseph Fisher](https://www.amazon.science/author/joseph-fisher)**\nJoseph Fisher is an applied scientist in the Alexa AI organization.","render":"<p>Entity linking (EL) is the process of automatically linking entity mentions in text to the corresponding entries in a knowledge base (a database of facts that relate entities), such as Wikidata. For example, in the diagram below, we would aim to link the mention “England” to the entity “England Football Team” as opposed to the entity “England” the country.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/67e3ee9c346242f884b6c3a92dd0a3c1_%E4%B8%8B%E8%BD%BD.jpg\\" alt=\\"下载.jpg\\" /></p>\n<p>In this sentence, the entity name “England” should be linked to the knowledge base entry for the English national football team, not the one for England the country.</p>\n<p>Entity linking is a common first step in natural-language-processing (<a href=\\"https://www.amazon.science/tag/nlp\\" target=\\"_blank\\">NLP</a>) applications such as question answering, information extraction, and natural-language understanding. It is critical for bridging unstructured text with knowledge bases, which enables access to vast amounts of curated data.</p>\\n<p>Current EL systems exhibit great performance on standard datasets, but they have several limitations when deployed in real-world applications. First, they are computationally intensive, which makes large-scale processing expensive.</p>\n<p>Secondly, most EL systems are designed to link to specific knowledge bases (typically Wikipedia) and cannot be easily adapted to other knowledge bases. Lastly, the most efficient existing methods cannot link texts to entities that were introduced into the knowledge base after training (a task known as zero-shot EL), meaning they must be frequently retrained to be kept up-to-date.</p>\n<p>In the NAACL 2022 industry track, we <ins><a href=\\"https://www.amazon.science/publications/refined-an-efficient-zero-shot-capable-approach-to-end-to-end-entity-linking\\" target=\\"_blank\\">introduced a new EL system</a></ins> called ReFinED, which addresses all three issues. We built on this work in a second paper in the main conference, which introduces <ins><a href=\\"https://www.amazon.science/publications/improving-entity-disambiguation-by-reasoning-over-a-knowledge-base\\" target=\\"_blank\\">a novel method to incorporate additional knowledge base information</a></ins> into the model, further improving its accuracy.</p>\n<p>ReFinED surpasses state-of-the-art performance on standard EL datasets by an average of 3.7 points in F1 score, a measure that factors in both false positives and false negatives, and it is 60 times faster than existing approaches with competitive performance.</p>\n<p>ReFinED is capable of generalizing to large-scale knowledge bases such as Wikidata (which has 15 times as many entities as Wikipedia) and of zero-shot entity linking. The combination of speed, accuracy, and scale makes ReFinED an effective and cost-efficient system for extracting entities from web-scale datasets, for which the model has been successfully deployed within Amazon.</p>\n<h4><a id=\\"Entity_linking_with_finegrained_types_and_descriptions_18\\"></a><strong>Entity linking with fine-grained types and descriptions</strong></h4>\\n<p>EL is challenging because entity mentions are often ambiguous. Therefore, EL systems must make effective use of context (surrounding words) to reliably disambiguate entity mentions.</p>\n<p>Recent EL systems use deep-learning methods to match mentions, not with entities directly, but with information stored in the knowledge base, such as textual entity descriptions or fine-grained entity types. This is advantageous for linking to entities not seen in the training data (zero-shot EL), because the information used to describe them will have properties the model has seen during training.</p>\n<p>However, such zero-shot-capable approaches are an order of magnitude more computationally expensive than non-zero-shot models, as they require numerous entity types and/or multiple forward passes through the model to encode mentions and descriptions. This makes large-scale processing prohibitively expensive for some applications.</p>\n<p>Like earlier zero-shot-capable models, ReFinED uses fine-grained entity types and entity descriptions to perform EL. But we use a simple Transformer-based encoder that yields better performance than that of more complex architectures, surpassing the state of the art on five EL datasets.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/73faad9a2bef467b9bcf7908b1a30cec_%E4%B8%8B%E8%BD%BD%20%281%29.jpg\\" alt=\\"下载 1.jpg\\" /></p>\n<p>ReFinED computes two scores, a description score and a typing score, which evaluate the match between an input sentence and candidate entities in a knowledge base.</p>\n<p>Unlike previous work, ReFinED performs mention detection (identifying entity mention spans), fine-grained entity typing (predicting entity types), and entity disambiguation (scoring entities) for all mentions within a document in a single forward pass, making it 60 times as fast as comparable models and therefore approximately 60 times more resource efficient to run.</p>\n<p>Under the hood, ReFinED is a Transformer-based neural network that computes two scores, a description score and an entity typing score, to indicate how suitable an entity is for a mention.</p>\n<h4><a id=\\"Incorporating_relationship_data_36\\"></a><strong>Incorporating relationship data</strong></h4>\\n<p>One shortcoming of this approach is that there may be some mentions whose candidate entities cannot be disambiguated with knowledge base entity description and types. As illustration, consider the following sentence, with the entity descriptions and types for two entities that “Clinton” could refer to:</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/7ec0d630edc741ea965121c92b0259bc_%E4%B8%8B%E8%BD%BD%20%282%29.jpg\\" alt=\\"下载 2.jpg\\" /></p>\n<p>Sometimes, description and type information is not enough to distinguish two knowledge base entries.</p>\n<p>Given only the context of the sentence and the knowledge base descriptions and types, it is not possible to correctly decide whether the sentence is referring to Hillary Clinton or Bill Clinton.</p>\n<p>Our second NAACL paper, “<ins><a href=\\"https://www.amazon.science/publications/improving-entity-disambiguation-by-reasoning-over-a-knowledge-base\\" target=\\"_blank\\">Improving entity disambiguation by reasoning over a knowledge base</a></ins>”, addresses this drawback. We propose an approach that uses additional knowledge base facts associated with the candidate entities.</p>\n<p>Knowledge base facts encode the relations between pairs of entities, as in the following examples:</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/beb3d0bcd7f845938234cb7b4d3504cc_%E4%B8%8B%E8%BD%BD%20%283%29.jpg\\" alt=\\"下载 3.jpg\\" /></p>\n<p>Where type and description information is inadequate to distinguish candidate entities, the model uses additional knowledge base facts.</p>\n<p>In order to use this type of information, we gave our model an additional mechanism, which allows it to predict the relationships connecting pairs of mentions in the text. For example, the model would infer from the sentence context that “Clinton”’s place of birth and place of education are “Hope, Arkansas” and “Hot Springs High School”. We can then match these inferences against facts in the knowledge base.</p>\n<p>In this case, as the diagram below shows, we would find that the two predictions match the knowledge base facts for Bill Clinton but not for Hillary Clinton. As a result, our model would boost the score for Bill Clinton and, hopefully, make the correct prediction.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/3210d351249143feb9db434ab76d8a51_%E4%B8%8B%E8%BD%BD%20%284%29.jpg\\" alt=\\"下载 4.jpg\\" /></p>\n<p>The addition of a mechanism that exploits knowledge base facts increases model accuracy.</p>\n<p>By adding this mechanism to the model, we were able to increase the state-of-the-art performance by 1.3 F1 points on average across six commonly used datasets in the literature and by 12.7 F1 points on the “ShadowLink” dataset, which focuses on particularly challenging examples.</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\\"Tom_Ayoolahttpswwwamazonscienceauthortomayoola_65\\"></a><strong><a href=\\"https://www.amazon.science/author/tom-ayoola\\" target=\\"_blank\\">Tom Ayoola</a></strong></h4>\n<p>Tom Ayoola is an applied scientist in the Alexa AI organization.</p>\n<h4><a id=\\"Joseph_Fisherhttpswwwamazonscienceauthorjosephfisher_67\\"></a><strong><a href=\\"https://www.amazon.science/author/joseph-fisher\\" target=\\"_blank\\">Joseph Fisher</a></strong></h4>\n<p>Joseph Fisher is an applied scientist in the Alexa AI organization.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭