{"value":"Differential privacy provides a way to quantify the privacy risks posed by aggregate statistics based on private data. One of its key ideas is that adding noise to data before generating a statistic can protect privacy.\n\nIn the context of machine learning, that means adding noise to training examples before they’re used to train a model. But while that makes it harder for an attacker to identify individual data in the training set, it also tends to reduce the model's accuracy.\n\nAt the 16th conference of the European Chapter of the Association for Computational Linguistics ([EACL](https://2021.eacl.org/)), my colleagues and I will present [a paper](https://www.amazon.science/publications/adept-auto-encoder-based-differentially-private-text-transformation) where we propose a new differentially private text transformation algorithm, known as ADePT (for autoencoder-based differentially private text), that preserves privacy without losing model utility.\n\n![image.png](https://dev-media.amazoncloud.cn/94620d638631494b9943bb1cafc4efb3_image.png)\n\nAdding noise between the encoder and decoder of an autoencoder transforms the input text while preserving its utility as a training example for a natural-language-understanding system.\nCREDIT: GLYNIS CONDON\n\nADePT uses an autoencoder, a neural network that’s trained to output exactly what it takes as input. But in-between input and output, the network squeezes its representation of the input data into a relatively small vector. During training, the network learns to produce a vector — an encoding — that preserves enough information about the input that it can be faithfully reconstructed, or decoded.\n\nWith ADePT, we train an autoencoder on phrases typical of the natural-language-understanding (NLU) system we want to build. But at run time, we add noise to the encoding vector before it passes to the decoder. Consequently, the vector that the decoder sees doesn’t exactly encode the input phrase; it encodes a phrase near the input phrase in the representation space.\n\nThe output of the decoder is thus an approximation of the input, rather than a reconstruction of it. For instance, given the input “What are the flights on January first 1992 from Boston to San Francisco?”, our noisy autoencoder output the question “What are the flights on Thursday going from Dallas to San Francisco?” We use the transformed phrases, rather than the original inputs, to train our NLU model.\n\nThe idea behind differential privacy is that, statistically, it should be impossible to tell whether a given data item is or is not in the dataset used to produce an aggregate statistic (or, in this case, to train a machine learning model). More precisely, the difference between the probabilities that the item is or isn’t in the dataset should be below a threshold value.\n\nAccordingly, to evaluate the privacy protection afforded by our transformation algorithm, we test it against an attack known as a membership inference attack (MIA). MIA infers whether a given data point was part of target model’s training data. The attacker trains an attack model that is essentially a binary classifier that classifies an input sample as a member (present in training data) or a non-member (not present in training data). The more accurate this attack model, the less privacy protection the transformation provides. \n\nIn our tests, the target of the attack is an intent classifier trained on transformed versions of the data in the widely used datasets ATIS and SNIPS. Below are some anecdotal examples showing that our model’s text transformations offer greater semantic coherence than baseline:\n\n![image.png](https://dev-media.amazoncloud.cn/b0da2ebab03a4e9987cdf4002b579740_image.png)\n\nOverall, our experiments show that our transformation technique significantly improves model performance over the previous state of the art, while also improving robustness against membership inference attack.\n\nABOUT THE AUTHOR\n\n#### **[Satyapriya Krishna](https://www.amazon.science/author/satyapriya-krishna)**\n\nSatyapriya Krishna is an applied scientist in the Alexa AI organization.","render":"<p>Differential privacy provides a way to quantify the privacy risks posed by aggregate statistics based on private data. One of its key ideas is that adding noise to data before generating a statistic can protect privacy.</p>\n<p>In the context of machine learning, that means adding noise to training examples before they’re used to train a model. But while that makes it harder for an attacker to identify individual data in the training set, it also tends to reduce the model’s accuracy.</p>\n<p>At the 16th conference of the European Chapter of the Association for Computational Linguistics (<a href=\"https://2021.eacl.org/\" target=\"_blank\">EACL</a>), my colleagues and I will present <a href=\"https://www.amazon.science/publications/adept-auto-encoder-based-differentially-private-text-transformation\" target=\"_blank\">a paper</a> where we propose a new differentially private text transformation algorithm, known as ADePT (for autoencoder-based differentially private text), that preserves privacy without losing model utility.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/94620d638631494b9943bb1cafc4efb3_image.png\" alt=\"image.png\" /></p>\n<p>Adding noise between the encoder and decoder of an autoencoder transforms the input text while preserving its utility as a training example for a natural-language-understanding system.<br />\nCREDIT: GLYNIS CONDON</p>\n<p>ADePT uses an autoencoder, a neural network that’s trained to output exactly what it takes as input. But in-between input and output, the network squeezes its representation of the input data into a relatively small vector. During training, the network learns to produce a vector — an encoding — that preserves enough information about the input that it can be faithfully reconstructed, or decoded.</p>\n<p>With ADePT, we train an autoencoder on phrases typical of the natural-language-understanding (NLU) system we want to build. But at run time, we add noise to the encoding vector before it passes to the decoder. Consequently, the vector that the decoder sees doesn’t exactly encode the input phrase; it encodes a phrase near the input phrase in the representation space.</p>\n<p>The output of the decoder is thus an approximation of the input, rather than a reconstruction of it. For instance, given the input “What are the flights on January first 1992 from Boston to San Francisco?”, our noisy autoencoder output the question “What are the flights on Thursday going from Dallas to San Francisco?” We use the transformed phrases, rather than the original inputs, to train our NLU model.</p>\n<p>The idea behind differential privacy is that, statistically, it should be impossible to tell whether a given data item is or is not in the dataset used to produce an aggregate statistic (or, in this case, to train a machine learning model). More precisely, the difference between the probabilities that the item is or isn’t in the dataset should be below a threshold value.</p>\n<p>Accordingly, to evaluate the privacy protection afforded by our transformation algorithm, we test it against an attack known as a membership inference attack (MIA). MIA infers whether a given data point was part of target model’s training data. The attacker trains an attack model that is essentially a binary classifier that classifies an input sample as a member (present in training data) or a non-member (not present in training data). The more accurate this attack model, the less privacy protection the transformation provides.</p>\n<p>In our tests, the target of the attack is an intent classifier trained on transformed versions of the data in the widely used datasets ATIS and SNIPS. Below are some anecdotal examples showing that our model’s text transformations offer greater semantic coherence than baseline:</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/b0da2ebab03a4e9987cdf4002b579740_image.png\" alt=\"image.png\" /></p>\n<p>Overall, our experiments show that our transformation technique significantly improves model performance over the previous state of the art, while also improving robustness against membership inference attack.</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Satyapriya_Krishnahttpswwwamazonscienceauthorsatyapriyakrishna_29\"></a><strong><a href=\"https://www.amazon.science/author/satyapriya-krishna\" target=\"_blank\">Satyapriya Krishna</a></strong></h4>\n<p>Satyapriya Krishna is an applied scientist in the Alexa AI organization.</p>\n"}