{"value":"At the 2021 Conference of the European Association for Computational Linguistics ([EACL](https://2021.eacl.org/)), we received honorable mention in the best-long-paper category for our paper \"Hidden biases in unreliable news detection datasets”, coauthored with Xiang Zhou (while he was an Amazon intern) and Mohit Bansal from the University of North California at Chapel Hill.\n\nIn this paper, we studied datasets used by the research community for developing models to automatically identify unreliable news. We found that the datasets had biases that are responsible for much of the accuracy in identifying unreliable news that previous papers reported. This suggests that models built on these datasets will not generalize well in a real-world setting. \n\n![image.png](https://dev-media.amazoncloud.cn/cc6d729b7a1148979086b98fc0f78689_image.png)\n\nA word cloud of keywords and keyword phrases from the titles of news articles in one dataset that are correlated with incorrect prediction of article accuracy. The size of a word indicates the strength of the correlation. Models trained on the dataset are most prone to errors on the topics of politics and world news and most accurate on sports and entertainment (see below), an indication of bias in the dataset.\n\nTo provide the research community a path forward, we followed up the analysis with a detailed study of the structure of the bias, guidelines for reducing the bias in existing datasets, and guidelines for developing higher-quality datasets in the future. \n\n#### **Data collection**\n\nWe started our analysis by looking at the data collection strategies used for creating unreliable-news-article datasets. Creating such datasets requires collecting news articles and their corresponding labels (for instance, “reliable” or “unreliable”). \n\nAs expected, collecting the labels is the most challenging task. Some fact-checking websites (e.g., PolitiFact, GossipCop) assign labels to individual articles. While this provides accurate labels, the process is both time consuming and expensive, resulting in comparatively small datasets. \n\nAn approach that scales better is assigning a reliability (or bias) score to each news outlet (or site, such as cnn.com or nytimes.com). This is as an easy way to create large-scale datasets, but it generates noisy labels. We studied biases in datasets that take both approaches — site- and article-level labeling.\n\n#### **Keyword correlations**\n\nAs a representative example of a dataset that is annotated at the article level, we studied the popular FakeNewsNet dataset. We trained a simple (logistic-regression) model to predict the labels (“reliable” or “unreliable”) of news items in the dataset on the basis of keywords and found that its accuracy (78%) was almost as high as that of a state-of-the-art BERT-based model (81%). Examining the keywords that drove the model’s performance, we found that celebrity names (“Brad”, “Pitt”, “Jenner”, etc.) predicted the “unreliable” label, while neutral terms like “2018” or “season” predicted the “reliable” label. \n\nThese results indicate that the ability to predict the labels of the articles in such datasets may depend on the presence of simple keywords that flag topics, such as celebrity news, rather than any deeper pattern. This implies that the dataset composition is biased, because it has strong correlations between topic words and the unreliable-news label. (It doesn’t mean that articles mentioning Brad Pitt or celebrities in general are intrinsically unreliable.)\n\nThis is partly due to biases in the fact-checking sites’ article selections. Another source of bias is that in the process of constructing FakeNewsNet, the authors used a web search engine, with its own proprietary news-ranking and verification processes, to retrieve the full texts of the news articles (which are not provided by fact-checking sites). This sometimes results in mismatches, in which unreliable content is replaced with reliable content without an update to the label. \n\n#### **Site classification**\n\nWe also studied the NELA dataset, which uses site-level labels. We find even more challenges with site-level labeling, mostly due to the weak labeling process, where an article from a supposedly unreliable news source can be factual and vice versa.\n\nWhile the literature reports models that are highly accurate at labeling news articles from NELA and similar datasets as reliable or unreliable, we found that much of the accuracy is due to having articles from the same sites in both training and test data. This means that the model can ignore the task of identifying unreliable content and just learn that particular sites are reliable or unreliable. \n\nTo demonstrate this point, we conducted a “random labels” experiment, where we randomly shuffled all the site-level labels such that they no longer represented the reliability of the site but were just an arbitrary feature of the site itself. We found that the models trained using random labels performed within 2% of the accuracy of models trained on the true labels. (These models are learning to identify sites, but that’s not practically useful, because the site name is included in any given article’s web address.)\n\nWe also show that while using a clean train/test site split is necessary, it is not sufficient to measure a given model’s generalization power. We further tested different site splits and found that the performance varies depending on how similar the sites in the test and training sets are: higher accuracy on a test set is correlated with higher similarity between the sites in the training set and test set. \n\nWe then took properly split datasets — with low similarity between train and test sets — trained models on them, and examined what kinds of articles were most prone to be erroneously identified as reliable or unreliable. We discovered that the models are most prone to errors when the topics are politics and world news and most accurate on sports and entertainment. Reliability of news is important on any topic, but the finding that model performance is degraded on politics and world news topics underscores the importance of improving data for unreliable-news detection. \n\n![image.png](https://dev-media.amazoncloud.cn/d32c4793c4df401c9f043a28f86b00f1_image.png)\n\nA word cloud of keywords and keyword phrases from the titles of news articles in one dataset that are correlated with correct prediction of article accuracy. The size of a word indicates the strength of the correlation.\n\n#### **Recommendations**\n\nOur paper showed that, to ensure that improvements in model performance reflect real unreliable-news detection capabilities, the community needs to make several changes in data collection, dataset construction, and experimental design. To facilitate these changes, we provide a table of recommended best practices (see below). We hope that this paper will stimulate quality improvments in unreliable-news modeling, analysis, and data. All of our code is licensed under Apache 2 and is [available on GitHub](https://github.com/alexa/unreliable-news-detection-biases).\n\n![image.png](https://dev-media.amazoncloud.cn/80d3d0f0821e445694a11245bd3428a6_image.png)\n\nABOUT THE AUTHOR\n\n#### **[Heba Elfardy](https://www.amazon.science/author/heba-elfardy)**\n\nHeba Elfardy is an applied scientist in the Alexa AI organization.\n\n#### **[Christos Christodoulopoulos](https://www.amazon.science/author/christos-christodoulopoulos)**\n\nChristos Christodoulopoulos is an applied scientist in the Alexa AI organization.\n\n#### **[Thomas Butler](https://www.amazon.science/author/thomas-butler)**\n\nThomas Butler is a senior applied scientist in the Alexa AI organization.","render":"<p>At the 2021 Conference of the European Association for Computational Linguistics (<a href=\"https://2021.eacl.org/\" target=\"_blank\">EACL</a>), we received honorable mention in the best-long-paper category for our paper "Hidden biases in unreliable news detection datasets”, coauthored with Xiang Zhou (while he was an Amazon intern) and Mohit Bansal from the University of North California at Chapel Hill.</p>\n<p>In this paper, we studied datasets used by the research community for developing models to automatically identify unreliable news. We found that the datasets had biases that are responsible for much of the accuracy in identifying unreliable news that previous papers reported. This suggests that models built on these datasets will not generalize well in a real-world setting.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/cc6d729b7a1148979086b98fc0f78689_image.png\" alt=\"image.png\" /></p>\n<p>A word cloud of keywords and keyword phrases from the titles of news articles in one dataset that are correlated with incorrect prediction of article accuracy. The size of a word indicates the strength of the correlation. Models trained on the dataset are most prone to errors on the topics of politics and world news and most accurate on sports and entertainment (see below), an indication of bias in the dataset.</p>\n<p>To provide the research community a path forward, we followed up the analysis with a detailed study of the structure of the bias, guidelines for reducing the bias in existing datasets, and guidelines for developing higher-quality datasets in the future.</p>\n<h4><a id=\"Data_collection_10\"></a><strong>Data collection</strong></h4>\n<p>We started our analysis by looking at the data collection strategies used for creating unreliable-news-article datasets. Creating such datasets requires collecting news articles and their corresponding labels (for instance, “reliable” or “unreliable”).</p>\n<p>As expected, collecting the labels is the most challenging task. Some fact-checking websites (e.g., PolitiFact, GossipCop) assign labels to individual articles. While this provides accurate labels, the process is both time consuming and expensive, resulting in comparatively small datasets.</p>\n<p>An approach that scales better is assigning a reliability (or bias) score to each news outlet (or site, such as cnn.com or nytimes.com). This is as an easy way to create large-scale datasets, but it generates noisy labels. We studied biases in datasets that take both approaches — site- and article-level labeling.</p>\n<h4><a id=\"Keyword_correlations_18\"></a><strong>Keyword correlations</strong></h4>\n<p>As a representative example of a dataset that is annotated at the article level, we studied the popular FakeNewsNet dataset. We trained a simple (logistic-regression) model to predict the labels (“reliable” or “unreliable”) of news items in the dataset on the basis of keywords and found that its accuracy (78%) was almost as high as that of a state-of-the-art BERT-based model (81%). Examining the keywords that drove the model’s performance, we found that celebrity names (“Brad”, “Pitt”, “Jenner”, etc.) predicted the “unreliable” label, while neutral terms like “2018” or “season” predicted the “reliable” label.</p>\n<p>These results indicate that the ability to predict the labels of the articles in such datasets may depend on the presence of simple keywords that flag topics, such as celebrity news, rather than any deeper pattern. This implies that the dataset composition is biased, because it has strong correlations between topic words and the unreliable-news label. (It doesn’t mean that articles mentioning Brad Pitt or celebrities in general are intrinsically unreliable.)</p>\n<p>This is partly due to biases in the fact-checking sites’ article selections. Another source of bias is that in the process of constructing FakeNewsNet, the authors used a web search engine, with its own proprietary news-ranking and verification processes, to retrieve the full texts of the news articles (which are not provided by fact-checking sites). This sometimes results in mismatches, in which unreliable content is replaced with reliable content without an update to the label.</p>\n<h4><a id=\"Site_classification_26\"></a><strong>Site classification</strong></h4>\n<p>We also studied the NELA dataset, which uses site-level labels. We find even more challenges with site-level labeling, mostly due to the weak labeling process, where an article from a supposedly unreliable news source can be factual and vice versa.</p>\n<p>While the literature reports models that are highly accurate at labeling news articles from NELA and similar datasets as reliable or unreliable, we found that much of the accuracy is due to having articles from the same sites in both training and test data. This means that the model can ignore the task of identifying unreliable content and just learn that particular sites are reliable or unreliable.</p>\n<p>To demonstrate this point, we conducted a “random labels” experiment, where we randomly shuffled all the site-level labels such that they no longer represented the reliability of the site but were just an arbitrary feature of the site itself. We found that the models trained using random labels performed within 2% of the accuracy of models trained on the true labels. (These models are learning to identify sites, but that’s not practically useful, because the site name is included in any given article’s web address.)</p>\n<p>We also show that while using a clean train/test site split is necessary, it is not sufficient to measure a given model’s generalization power. We further tested different site splits and found that the performance varies depending on how similar the sites in the test and training sets are: higher accuracy on a test set is correlated with higher similarity between the sites in the training set and test set.</p>\n<p>We then took properly split datasets — with low similarity between train and test sets — trained models on them, and examined what kinds of articles were most prone to be erroneously identified as reliable or unreliable. We discovered that the models are most prone to errors when the topics are politics and world news and most accurate on sports and entertainment. Reliability of news is important on any topic, but the finding that model performance is degraded on politics and world news topics underscores the importance of improving data for unreliable-news detection.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/d32c4793c4df401c9f043a28f86b00f1_image.png\" alt=\"image.png\" /></p>\n<p>A word cloud of keywords and keyword phrases from the titles of news articles in one dataset that are correlated with correct prediction of article accuracy. The size of a word indicates the strength of the correlation.</p>\n<h4><a id=\"Recommendations_42\"></a><strong>Recommendations</strong></h4>\n<p>Our paper showed that, to ensure that improvements in model performance reflect real unreliable-news detection capabilities, the community needs to make several changes in data collection, dataset construction, and experimental design. To facilitate these changes, we provide a table of recommended best practices (see below). We hope that this paper will stimulate quality improvments in unreliable-news modeling, analysis, and data. All of our code is licensed under Apache 2 and is <a href=\"https://github.com/alexa/unreliable-news-detection-biases\" target=\"_blank\">available on GitHub</a>.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/80d3d0f0821e445694a11245bd3428a6_image.png\" alt=\"image.png\" /></p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Heba_Elfardyhttpswwwamazonscienceauthorhebaelfardy_50\"></a><strong><a href=\"https://www.amazon.science/author/heba-elfardy\" target=\"_blank\">Heba Elfardy</a></strong></h4>\n<p>Heba Elfardy is an applied scientist in the Alexa AI organization.</p>\n<h4><a id=\"Christos_Christodoulopouloshttpswwwamazonscienceauthorchristoschristodoulopoulos_54\"></a><strong><a href=\"https://www.amazon.science/author/christos-christodoulopoulos\" target=\"_blank\">Christos Christodoulopoulos</a></strong></h4>\n<p>Christos Christodoulopoulos is an applied scientist in the Alexa AI organization.</p>\n<h4><a id=\"Thomas_Butlerhttpswwwamazonscienceauthorthomasbutler_58\"></a><strong><a href=\"https://www.amazon.science/author/thomas-butler\" target=\"_blank\">Thomas Butler</a></strong></h4>\n<p>Thomas Butler is a senior applied scientist in the Alexa AI organization.</p>\n"}