New dataset, metrics enable evaluation of bias in language models

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"Language models, which encode the probabilities of particular sequences of words, have been much in the news lately for their almost uncanny ability to produce long, largely coherent texts from just a few seed words, or “prompts”.\n\nLanguage models are also crucial to commercial AI systems that perform tasks like automatic speech recognition, machine translation, and question answering, among other things.\n\nOne reason language models can produce such convincing synthetic texts is that they’re trained on real texts. When the real texts encode harmful social biases, the resulting language models may, too — and so will the applications that depend on them.\n\nEarlier this month, at the ACM Conference on Fairness, Accountability, and Transparency (FAccT), my colleagues and I presented ++[a paper](https://www.amazon.science/publications/bold-dataset-and-metrics-for-measuring-biases-in-open-ended-language-generation)++ in which we describe a new dataset, with more than 23,000 text generation prompts, for testing language models for bias.\n\nWe also described a set of metrics for automatically measuring bias in the resulting texts and showed that they correlated well with human assessments of bias.\n\nOur dataset, which we call ++[BOLD](https://github.com/amazon-research/bold)++, for bias in open-ended language generation dataset, is designed to measure bias across five categories: profession, gender, race, religious belief, and political ideology. \n\nEach of our prompts consists of the first six to nine words of a sentence on Wikipedia. To extract the prompts, we first identified articles that fit into any of our five categories. For professions, we found articles classified according to the ++[18 high-level profession classifications](https://en.wikipedia.org/wiki/Lists_of_occupations)++ in Wikipedia’s taxonomy. To avoid confounding attitudes toward professions with attitudes toward gender, we used only articles about male and female actors to produce our gender prompts. Binary gender classification is one of the limitations of this initial version of our dataset.\n\n![三.gif](https://dev-media.amazoncloud.cn/bdad0546bff8472a887e81fa86d682db_%E4%B8%89.gif)\n\nAn example of our prompt selection method and of the way in which language models can generate texts whose sentiments are more negative than those of the prompt sources.\n\nCREDIT: GLYNIS CONDON\n\nThe racial categories we considered are European Americans, African Americans, Asian Americans, and Latino/Hispanic Americans. From Wikipedia’s list of political ideologies, we selected the classes socialism, populism, nationalism, liberalism, fascism, democracy, conservatism, communism, anarchism, left-wing, and right-wing. Finally, we also used the most common classes from Wikipedia’s list of religious and spiritual beliefs: Sikhism, Judaism, Islam, Hinduism, Christianity, Buddhism, and atheism.\n\nFrom every article that fit into one of these categories, we extracted sentences in which the relevant category term — the name of a profession or a religion, for instance, or of a person whose race or gender was identified in the article metadata — occurred no later than the sixth word of the sentence. Then we kept only enough of the sentence to include the category term and the first five words other than the category term. (Since category terms range in length from one to four words, prompts range in length from six to nine words.)\n\nTo evaluate bias in the sentences that language models produce from these prompts, we measure five properties: (1) sentiment, or whether the individual words of the sentence indicate positive or negative disposition toward the topic; (2) toxicity, or whether the language used is disrespectful, abusive, unpleasant, or harmful; (3) regard, or whether the sentence as a whole indicates positive or negative disposition, regardless of the valences of individual words; (4) psycholinguistic norms, or the emotions conveyed by word choices, such as joy, anger, or sadness; and (5) gender polarity, or whether a particular class of prompts produces sentences that skew more male or more female.\n\n\n#### **Methods**\n\n\nTo measure sentiment and regard, we used off-the-shelf classifiers. To measure toxicity, we used a BERT language model fine-tuned on a public toxic-comment dataset. \n\nTo measure psycholinguist norms, we first expanded an existing lexicon of words and their emotional values by using deep learning to predict the emotional values of words not yet in the lexicon. We use a weighted average to aggregate the emotional values of individual words into a value for a complete sentence.\n\nWe took two different approaches to measures of gender polarity. In one, we first use embeddings — vector representations of words that capture something of their semantic content — to determine whether particular words are more usually associated with men or women. Then we use a weighted average to aggregate the gender polarities of individual words into a sentence-level score.\n\nIn the other approach, we simply take the most gender-polar word in the text and, if it crosses a threshold that we determined empirically, based on annotated texts, we designate the whole text as having that gender polarity.\n\nWe applied these metrics as follows: for professions, we measure gender polarity (do texts generated from profession prompts skew more toward one gender or the other); for gender, we measure the sentiment and psycholinguistic norms of texts produced from gender-specific prompts; for race, we measure the sentiment and the regard of texts produced from race-specific prompts; for religious beliefs and political ideologies, we measure sentiment, with an additional comparison of the psycholinguistic norms for Islam and Christianity.\n\nWe applied our methodology to five popular language models: BERT, GPT-2, and the CTRL models CTRL-WIKI, CTRL-THT, and CTRL-OPN.\n\nWe did indeed find evidence of bias: for instance, atheism and Islam generated more negative sentiment than other religious or spiritual beliefs, and prompts using the names of African Americans generated more negative sentiment and toxic language than those using the names of people from other ethnic groups.\n\nTo validate our metrics, we took a subset of our scored synthetic texts and gave them to workers recruited through Amazon Mechanical Turk for assessment. Our metrics performed well, with accuracy rates and true-negative rates of better than 90% for gender polarity and better than 80% for sentiment and toxicity.\n\nThis is a strong signal that existing language models do indeed reflect biases in the texts used to create them and that remediating those biases should a subject of further study.\n\nABOUT THE AUTHOR\n\n#### **[Jwala Dhamala](https://www.amazon.science/author/jwala-dhamala)**\n\nJwala Dhamala is a research scientist in the Alexa AI Natural Understanding group.\n\n\n","render":"Language models, which encode the probabilities of particular sequences of words, have been much in the news lately for their almost uncanny ability to produce long, largely coherent texts from just a few seed words, or “prompts”.\nLanguage models are also crucial to commercial AI systems that perform tasks like automatic speech recognition, machine translation, and question answering, among other things.\nOne reason language models can produce such convincing synthetic texts is that they’re trained on real texts. When the real texts encode harmful social biases, the resulting language models may, too — and so will the applications that depend on them.\nEarlier this month, at the ACM Conference on Fairness, Accountability, and Transparency (FAccT), my colleagues and I presented <ins><a href=\"https://www.amazon.science/publications/bold-dataset-and-metrics-for-measuring-biases-in-open-ended-language-generation\" target=\"_blank\">a paper</a></ins> in which we describe a new dataset, with more than 23,000 text generation prompts, for testing language models for bias.\nWe also described a set of metrics for automatically measuring bias in the resulting texts and showed that they correlated well with human assessments of bias.\nOur dataset, which we call <ins><a href=\"https://github.com/amazon-research/bold\" target=\"_blank\">BOLD</a></ins>, for bias in open-ended language generation dataset, is designed to measure bias across five categories: profession, gender, race, religious belief, and political ideology.\nEach of our prompts consists of the first six to nine words of a sentence on Wikipedia. To extract the prompts, we first identified articles that fit into any of our five categories. For professions, we found articles classified according to the <ins><a href=\"https://en.wikipedia.org/wiki/Lists_of_occupations\" target=\"_blank\">18 high-level profession classifications</a></ins> in Wikipedia’s taxonomy. To avoid confounding attitudes toward professions with attitudes toward gender, we used only articles about male and female actors to produce our gender prompts. Binary gender classification is one of the limitations of this initial version of our dataset.\n<img src=\"https://dev-media.amazoncloud.cn/bdad0546bff8472a887e81fa86d682db_%E4%B8%89.gif\" alt=\"三.gif\" />\nAn example of our prompt selection method and of the way in which language models can generate texts whose sentiments are more negative than those of the prompt sources.\nCREDIT: GLYNIS CONDON\nThe racial categories we considered are European Americans, African Americans, Asian Americans, and Latino/Hispanic Americans. From Wikipedia’s list of political ideologies, we selected the classes socialism, populism, nationalism, liberalism, fascism, democracy, conservatism, communism, anarchism, left-wing, and right-wing. Finally, we also used the most common classes from Wikipedia’s list of religious and spiritual beliefs: Sikhism, Judaism, Islam, Hinduism, Christianity, Buddhism, and atheism.\nFrom every article that fit into one of these categories, we extracted sentences in which the relevant category term — the name of a profession or a religion, for instance, or of a person whose race or gender was identified in the article metadata — occurred no later than the sixth word of the sentence. Then we kept only enough of the sentence to include the category term and the first five words other than the category term. (Since category terms range in length from one to four words, prompts range in length from six to nine words.)\nTo evaluate bias in the sentences that language models produce from these prompts, we measure five properties: (1) sentiment, or whether the individual words of the sentence indicate positive or negative disposition toward the topic; (2) toxicity, or whether the language used is disrespectful, abusive, unpleasant, or harmful; (3) regard, or whether the sentence as a whole indicates positive or negative disposition, regardless of the valences of individual words; (4) psycholinguistic norms, or the emotions conveyed by word choices, such as joy, anger, or sadness; and (5) gender polarity, or whether a particular class of prompts produces sentences that skew more male or more female.\n<h4><a id=\"Methods_27\"></a>Methods</h4>\nTo measure sentiment and regard, we used off-the-shelf classifiers. To measure toxicity, we used a BERT language model fine-tuned on a public toxic-comment dataset.\nTo measure psycholinguist norms, we first expanded an existing lexicon of words and their emotional values by using deep learning to predict the emotional values of words not yet in the lexicon. We use a weighted average to aggregate the emotional values of individual words into a value for a complete sentence.\nWe took two different approaches to measures of gender polarity. In one, we first use embeddings — vector representations of words that capture something of their semantic content — to determine whether particular words are more usually associated with men or women. Then we use a weighted average to aggregate the gender polarities of individual words into a sentence-level score.\nIn the other approach, we simply take the most gender-polar word in the text and, if it crosses a threshold that we determined empirically, based on annotated texts, we designate the whole text as having that gender polarity.\nWe applied these metrics as follows: for professions, we measure gender polarity (do texts generated from profession prompts skew more toward one gender or the other); for gender, we measure the sentiment and psycholinguistic norms of texts produced from gender-specific prompts; for race, we measure the sentiment and the regard of texts produced from race-specific prompts; for religious beliefs and political ideologies, we measure sentiment, with an additional comparison of the psycholinguistic norms for Islam and Christianity.\nWe applied our methodology to five popular language models: BERT, GPT-2, and the CTRL models CTRL-WIKI, CTRL-THT, and CTRL-OPN.\nWe did indeed find evidence of bias: for instance, atheism and Islam generated more negative sentiment than other religious or spiritual beliefs, and prompts using the names of African Americans generated more negative sentiment and toxic language than those using the names of people from other ethnic groups.\nTo validate our metrics, we took a subset of our scored synthetic texts and gave them to workers recruited through Amazon Mechanical Turk for assessment. Our metrics performed well, with accuracy rates and true-negative rates of better than 90% for gender polarity and better than 80% for sentiment and toxicity.\nThis is a strong signal that existing language models do indeed reflect biases in the texts used to create them and that remediating those biases should a subject of further study.\nABOUT THE AUTHOR\n<h4><a id=\"Jwala_Dhamalahttpswwwamazonscienceauthorjwaladhamala_50\"></a><a href=\"https://www.amazon.science/author/jwala-dhamala\" target=\"_blank\">Jwala Dhamala</a></h4>\nJwala Dhamala is a research scientist in the Alexa AI Natural Understanding group.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家