Improving unsupervised sentence-pair comparison

{"value":"Many tasks in natural-language processing and information retrieval involve pairwise comparisons of sentences — for example, sentence similarity detection, paraphrase identification, question-answer entailment, and textual entailment.\n\nThe most accurate method of sentence comparison is so-called cross-encoding, which maps sentences against each other on a pair-by-pair basis. Training cross-encoders, however, requires annotated training data, which is labor intensive to collect.\n\nHow can we train completely unsupervised models for sentence-pair tasks, eliminating the need for data annotation?\n\nAt this year’s International Conference on Learning Representations ([ICLR](https://www.amazon.science/conferences-and-events/iclr-2022)), we are presenting an unsupervised sentence-pair model we call a trans-encoder ([paper](https://www.amazon.science/publications/trans-encoder-unsupervised-sentence-pair-modelling-through-self-and-mutual-distillations), [code](https://github.com/amzn/trans-encoder)), which improves on the prior state of the art by up to 5% on sentence similarity benchmarks.\n\n\n#### **A tale of two encoders**\n\n\nToday, there are basically two paradigms for sentence-pair tasks: cross-encoders and bi-encoders. The choice between the two comes down to the standard trade-off between computational efficiency and performance.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/d328eb0963234e6cb2719d7d5e8a35d9_%E4%B8%8B%E8%BD%BD.jpg)\n\nCross-encoder (left) and bi-encoder (right).\n\n**Cross-encoder**. In a cross-encoder, two sequences are concatenated and sent in one pass to the sentence pair model, which is usually built atop a Transformer-based language model like [BERT ](https://arxiv.org/abs/1810.04805)or [RoBERTa](https://arxiv.org/abs/1907.11692). The attention heads of a Transformer can directly model which elements of one sequence correlate with which elements of the other, enabling the computation of an accurate classification/relevance score.\n\nHowever, a cross-encoder needs to compute a new encoding for every pair of input sentences, resulting in high computational overhead. Cross-encoding is thus impractical for tasks like information retrieval and clustering, which involve massive pairwise sentence comparisons. Also, converting pretrained language models (PLMs) into cross-encoders always requires fine-tuning on annotated data.\n\n**Bi-encoder**. By contrast, in a bi-encoder, each sentence is encoded separately and mapped to a common embedding space, where the distances between them can be measured. As the encoded sentences can be cached and reused, bi-encoding is much more efficient, and the outputs of a bi-encoder can be used off-the-shelf as sentence embeddings for downstream tasks.\n\nThat said, it is well known that in supervised learning, bi-encoders underperform cross-encoders, since they don’t explicitly model interactions between sentences.\n\n\n#### **Trans-encoder: The best of both worlds**\n\nIn our ICLR paper, we ask whether we can leverage the advantages of both bi- and cross-encoders to bootstrap an accurate sentence-pair model in an unsupervised manner.\n\nOur answer — the trans-encoder — is built on the following intuition: As a starting point, we can use bi-encoder representations to fine-tune a cross-encoder. With its more powerful inter-sentence modeling, the cross-encoder should extract more knowledge from the PLMs than the bi-encoder can given the same input data. In turn, the more powerful cross-encoder can distill its knowledge back into the bi-encoder, improving the accuracy of the more computationally practical model. We can repeat this cycle to iteratively bootstrap from both the bi- and cross-encoders.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/4b26410038094a8b9eb9b5abdcf429ec_%E4%B8%8B%E8%BD%BD.jpg)\n\nThe trans-encoder training process, in which a bi-encoder trained in an unsupervised fashion creates training targets for a cross-encoder, which in turn outputs training targets for the bi-encoder.\n\nSpecifically, the process of training a trans-encoder is as follows:\n\n**Step 1. Transform PLMs into effective bi-encoders**. To transform existing PLMs into bi-encoders, we leverage a simple contrastive tuning procedure. Given a sentence, we encode it twice, with two different PLMs. Because of dropout — a standard technique in which a fraction of neural-network nodes are randomly dropped during each pass through the training data, to prevent bottlenecks — the two PLMs will produce slightly different encodings.\n\nThe bi-encoder is then trained to maximize the similarity of the two almost-identical encodings. This step primes the PLMs to be good at embedding sequences. Details can be found in prior work[ Mirror-BERT](https://arxiv.org/abs/2104.08027) and [SimCSE](https://arxiv.org/abs/2104.08821).\n\n**Step 2. Self-distillation: bi- to cross-encoder**. After obtaining a reasonably good bi-encoder from step one, we use it to create training data for a cross-encoder. Specifically, we label sentence pairs with the pairwise similarity scores computed by the bi-encoder and use them as training targets for a cross-encoder built on top of a new PLM.\n\n**Step 3. Self-distillation: Cross- to bi-encoder**. A natural next step is to distil the extra knowledge gained from the cross-encoder back into bi-encoder form, which is more useful for downstream tasks. More important, a better bi-encoder can produce even more self-labeled data for tuning the cross-encoder. In this way we can repeat steps two and three, continually bootstrapping the encoder performance.\n\nOur paper proposes other techniques, such as mutual distillation, to improve our model’s performance. Please refer to Section 2.4 of the [paper](https://openreview.net/forum?id=AmUhwTOHgm) for more details.\n\n\n#### **Benchmark: A new state-of-the-art for sentence similarity**\n\n\nWe experiment with the trans-encoder on seven sentence textual similarity (STS) benchmarks. We observe significant improvements upon previous unsupervised sentence-pair models across all datasets.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/e2a7550443f149e9a721837c1a548f22_%E4%B8%8B%E8%BD%BD.jpg)\n\nTrans-encoder performance on the sentence textual similarity (STS) benchmarks STS 2012-2017, STS-B, and SICK-R.\n\n\nWe also benchmark binary-classification and domain transfer tasks. Please refer to section 5 of the [paper ](https://openreview.net/forum?id=AmUhwTOHgm)for more details.\n\nABOUT THE AUTHOR\n\n#### **Fangyu Liu**\n\nFangyu Liu, a PhD student in computation, cognition, and language at the University of Cambridge, was an intern at Amazon when the work was done.\n\n#### **[Yunlong Jiao](https://www.amazon.science/author/yunlong-jiao)**\n\nYunlong Jiao is an applied scientist with Alexa Shopping.","render":"Many tasks in natural-language processing and information retrieval involve pairwise comparisons of sentences — for example, sentence similarity detection, paraphrase identification, question-answer entailment, and textual entailment.\nThe most accurate method of sentence comparison is so-called cross-encoding, which maps sentences against each other on a pair-by-pair basis. Training cross-encoders, however, requires annotated training data, which is labor intensive to collect.\nHow can we train completely unsupervised models for sentence-pair tasks, eliminating the need for data annotation?\nAt this year’s International Conference on Learning Representations (<a href=\"https://www.amazon.science/conferences-and-events/iclr-2022\" target=\"_blank\">ICLR</a>), we are presenting an unsupervised sentence-pair model we call a trans-encoder (<a href=\"https://www.amazon.science/publications/trans-encoder-unsupervised-sentence-pair-modelling-through-self-and-mutual-distillations\" target=\"_blank\">paper</a>, <a href=\"https://github.com/amzn/trans-encoder\" target=\"_blank\">code</a>), which improves on the prior state of the art by up to 5% on sentence similarity benchmarks.\n<h4><a id=\"A_tale_of_two_encoders_9\"></a>A tale of two encoders</h4>\nToday, there are basically two paradigms for sentence-pair tasks: cross-encoders and bi-encoders. The choice between the two comes down to the standard trade-off between computational efficiency and performance.\n<img src=\"https://dev-media.amazoncloud.cn/d328eb0963234e6cb2719d7d5e8a35d9_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nCross-encoder (left) and bi-encoder (right).\nCross-encoder. In a cross-encoder, two sequences are concatenated and sent in one pass to the sentence pair model, which is usually built atop a Transformer-based language model like <a href=\"https://arxiv.org/abs/1810.04805\" target=\"_blank\">BERT </a>or <a href=\"https://arxiv.org/abs/1907.11692\" target=\"_blank\">RoBERTa</a>. The attention heads of a Transformer can directly model which elements of one sequence correlate with which elements of the other, enabling the computation of an accurate classification/relevance score.\nHowever, a cross-encoder needs to compute a new encoding for every pair of input sentences, resulting in high computational overhead. Cross-encoding is thus impractical for tasks like information retrieval and clustering, which involve massive pairwise sentence comparisons. Also, converting pretrained language models (PLMs) into cross-encoders always requires fine-tuning on annotated data.\nBi-encoder. By contrast, in a bi-encoder, each sentence is encoded separately and mapped to a common embedding space, where the distances between them can be measured. As the encoded sentences can be cached and reused, bi-encoding is much more efficient, and the outputs of a bi-encoder can be used off-the-shelf as sentence embeddings for downstream tasks.\nThat said, it is well known that in supervised learning, bi-encoders underperform cross-encoders, since they don’t explicitly model interactions between sentences.\n<h4><a id=\"Transencoder_The_best_of_both_worlds_27\"></a>Trans-encoder: The best of both worlds</h4>\nIn our ICLR paper, we ask whether we can leverage the advantages of both bi- and cross-encoders to bootstrap an accurate sentence-pair model in an unsupervised manner.\nOur answer — the trans-encoder — is built on the following intuition: As a starting point, we can use bi-encoder representations to fine-tune a cross-encoder. With its more powerful inter-sentence modeling, the cross-encoder should extract more knowledge from the PLMs than the bi-encoder can given the same input data. In turn, the more powerful cross-encoder can distill its knowledge back into the bi-encoder, improving the accuracy of the more computationally practical model. We can repeat this cycle to iteratively bootstrap from both the bi- and cross-encoders.\n<img src=\"https://dev-media.amazoncloud.cn/4b26410038094a8b9eb9b5abdcf429ec_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nThe trans-encoder training process, in which a bi-encoder trained in an unsupervised fashion creates training targets for a cross-encoder, which in turn outputs training targets for the bi-encoder.\nSpecifically, the process of training a trans-encoder is as follows:\nStep 1. Transform PLMs into effective bi-encoders. To transform existing PLMs into bi-encoders, we leverage a simple contrastive tuning procedure. Given a sentence, we encode it twice, with two different PLMs. Because of dropout — a standard technique in which a fraction of neural-network nodes are randomly dropped during each pass through the training data, to prevent bottlenecks — the two PLMs will produce slightly different encodings.\nThe bi-encoder is then trained to maximize the similarity of the two almost-identical encodings. This step primes the PLMs to be good at embedding sequences. Details can be found in prior work<a href=\"https://arxiv.org/abs/2104.08027\" target=\"_blank\"> Mirror-BERT</a> and <a href=\"https://arxiv.org/abs/2104.08821\" target=\"_blank\">SimCSE</a>.\nStep 2. Self-distillation: bi- to cross-encoder. After obtaining a reasonably good bi-encoder from step one, we use it to create training data for a cross-encoder. Specifically, we label sentence pairs with the pairwise similarity scores computed by the bi-encoder and use them as training targets for a cross-encoder built on top of a new PLM.\nStep 3. Self-distillation: Cross- to bi-encoder. A natural next step is to distil the extra knowledge gained from the cross-encoder back into bi-encoder form, which is more useful for downstream tasks. More important, a better bi-encoder can produce even more self-labeled data for tuning the cross-encoder. In this way we can repeat steps two and three, continually bootstrapping the encoder performance.\nOur paper proposes other techniques, such as mutual distillation, to improve our model’s performance. Please refer to Section 2.4 of the <a href=\"https://openreview.net/forum?id=AmUhwTOHgm\" target=\"_blank\">paper</a> for more details.\n<h4><a id=\"Benchmark_A_new_stateoftheart_for_sentence_similarity_50\"></a>Benchmark: A new state-of-the-art for sentence similarity</h4>\nWe experiment with the trans-encoder on seven sentence textual similarity (STS) benchmarks. We observe significant improvements upon previous unsupervised sentence-pair models across all datasets.\n<img src=\"https://dev-media.amazoncloud.cn/e2a7550443f149e9a721837c1a548f22_%E4%B8%8B%E8%BD%BD.jpg\" alt=\"下载.jpg\" />\nTrans-encoder performance on the sentence textual similarity (STS) benchmarks STS 2012-2017, STS-B, and SICK-R.\nWe also benchmark binary-classification and domain transfer tasks. Please refer to section 5 of the <a href=\"https://openreview.net/forum?id=AmUhwTOHgm\" target=\"_blank\">paper </a>for more details.\nABOUT THE AUTHOR\n<h4><a id=\"Fangyu_Liu_64\"></a>Fangyu Liu</h4>\nFangyu Liu, a PhD student in computation, cognition, and language at the University of Cambridge, was an intern at Amazon when the work was done.\n<h4><a id=\"Yunlong_Jiaohttpswwwamazonscienceauthoryunlongjiao_68\"></a><a href=\"https://www.amazon.science/author/yunlong-jiao\" target=\"_blank\">Yunlong Jiao</a></h4>\nYunlong Jiao is an applied scientist with Alexa Shopping.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家