Explaining changes in real-world data

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"The success of deep learning is a testament to the power of statistical correlation: if certain image features are consistently correlated with the label “cat”, you can teach a machine learning model to identify cats.\n\nBut sometimes, correlation is not enough; you need to identify causation. For example, during the COVID-19 pandemic, a retailer might have seen a sharp decline in its inventory for a particular product. What caused that decline? An increase in demand? A shortage in supply? Delays in shipping? The failure of a forecasting model? The remedy might vary depending on the cause.\n\nEarlier this month, at the International Conference on Artificial Intelligence and Statistics (AISTATS), my colleagues and I [presented a new technique](https://www.amazon.science/publications/why-did-the-distribution-change) for identifying the causes of shifts in a probability distribution. Our approach involves causal graphs, which are [graphical](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) blueprints of sequential processes.\n\nEach node of the graph, together with its incoming edges, represents a causal mechanism, or the probability that a given event will follow from the event that precedes it. We show how to compute the contribution that changes in the individual mechanisms make to changes in the probability of the final outcome.\n\nWe tested our approach using simulated data, so that we could stipulate the probabilities of the individual causal mechanisms, giving us a ground truth to measure against. Our approach yielded estimates that were very close to the ground truth — a deviation of only 0.29 according to [L](https://en.wikipedia.org/wiki/Taxicab_geometry)[1](https://en.wikipedia.org/wiki/Taxicab_geometry) [distance](https://en.wikipedia.org/wiki/Taxicab_geometry). And we achieved that performance even at small sample sizes — as few as 500 samples, drawn at random from the probability distributions we stipulated.\n\nConsider a causal graph, which represents factors contributing to the amount of inventory that a retailer has on hand. (This is a drastic simplification; the causal graphs for real-world inventory counts might have dozens of factors, rather than five.)\n\n![image.png](https://dev-media.amazoncloud.cn/c94fa9868d9845069d3a860854e5d29e_image.png)\n\nIn this simplified model, the simulation system estimates the cost (X1) of replenishing inventory; the forecasting algorithm estimates demand (X2); a planning algorithm (X3) determines the size and timing of purchase orders; bidding (X4) occurs opportunistically, as when a large supply of some product becomes available at a discounted rate; and together, all those factors contributed to the inventory on hand (X5).\n\nEach input-output relation in this network has an associated conditional probability distribution, or causal mechanism. The probabilities associated with the individual causal mechanisms determine the joint distribution of all the variables (X1-X5), or the probability that any given combination of variables will occur together. That in turn determines the probability distribution of the target variable — the amount of inventory on hand.\n\nA large change to the final outcome may be accompanied by changes to all the causal mechanisms in the graph. Our technique identifies the causal mechanism whose change is most responsible for the change in outcome.\n\nOur fundamental insight is that any given causal mechanism in the graph could, in principle, change without affecting the others. So given a causal graph, the initial causal mechanisms, and data that imply new causal mechanisms, we update the causal mechanisms one by one to determine the influence each has on the outcome.\n\n![image.png](https://dev-media.amazoncloud.cn/76a475081643459e9595d4f75c899c19_image.png)\n\nIn this version of the graph, the mechanism for cost has been updated, followed by the mechanism for demand, which accounts for 25% of the total change in on-hand inventory.\n\nThe problem with this approach is that our measurement of each node’s contribution depends on the order in which we update the nodes. The measurement evaluates the consequences of changing the node’s causal mechanism given every possible value of the other variables in the graph. But the probabilities of those values change when we update causal mechanisms. So we’ll get different measurements, depending on which causal mechanisms have been updated.\n\nTo address this problem, we run through every permutation of the update order and average the per-node results, an adaptation of a technique from game theory called computing the Shapley value.\n\nIn practice, of course, causal mechanisms are something we have to infer from data; we’re not given probability distributions in advance. But to test our approach, we created a simple causal graph in which we could stipulate the distributions. Then, using those distributions, we generated data samples.\n\nAcross 100 different random changes to the causal mechanisms of our graph, our method performed very well; with 500 data samples per change, it achieved an average deviation from ground truth of 0.29 as measured by [L](https://en.wikipedia.org/wiki/Taxicab_geometry)[1](https://en.wikipedia.org/wiki/Taxicab_geometry) [distance](https://en.wikipedia.org/wiki/Taxicab_geometry). Our ground truth is at least a 3-D vector (6-D at most), with at least one component whose magnitude is at least one (five at most). Ther\n\nWe tested different volumes of data samples, from 500 to 4,000, but adding more samples had little effect on the accuracy of the approximation.\n\nInternally, we have also applied our technique to questions of supply chain management. For a particular family of products, we were able to identify the reasons for a steady decline in on-hand inventory during the pandemic, when that figure had held steady for the preceding year. \n\nABOUT THE AUTHOR\n#### **[Kailash Budhathoki](https://www.amazon.science/author/kailash-budhathoki)**\nKailash Budhathoki is an applied scientist at Amazon.","render":"The success of deep learning is a testament to the power of statistical correlation: if certain image features are consistently correlated with the label “cat”, you can teach a machine learning model to identify cats.\nBut sometimes, correlation is not enough; you need to identify causation. For example, during the COVID-19 pandemic, a retailer might have seen a sharp decline in its inventory for a particular product. What caused that decline? An increase in demand? A shortage in supply? Delays in shipping? The failure of a forecasting model? The remedy might vary depending on the cause.\nEarlier this month, at the International Conference on Artificial Intelligence and Statistics (AISTATS), my colleagues and I <a href=\"https://www.amazon.science/publications/why-did-the-distribution-change\" target=\"_blank\">presented a new technique</a> for identifying the causes of shifts in a probability distribution. Our approach involves causal graphs, which are <a href=\"https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)\" target=\"_blank\">graphical</a> blueprints of sequential processes.\nEach node of the graph, together with its incoming edges, represents a causal mechanism, or the probability that a given event will follow from the event that precedes it. We show how to compute the contribution that changes in the individual mechanisms make to changes in the probability of the final outcome.\nWe tested our approach using simulated data, so that we could stipulate the probabilities of the individual causal mechanisms, giving us a ground truth to measure against. Our approach yielded estimates that were very close to the ground truth — a deviation of only 0.29 according to <a href=\"https://en.wikipedia.org/wiki/Taxicab_geometry\" target=\"_blank\">L</a><a href=\"https://en.wikipedia.org/wiki/Taxicab_geometry\" target=\"_blank\">1</a> <a href=\"https://en.wikipedia.org/wiki/Taxicab_geometry\" target=\"_blank\">distance</a>. And we achieved that performance even at small sample sizes — as few as 500 samples, drawn at random from the probability distributions we stipulated.\nConsider a causal graph, which represents factors contributing to the amount of inventory that a retailer has on hand. (This is a drastic simplification; the causal graphs for real-world inventory counts might have dozens of factors, rather than five.)\n<img src=\"https://dev-media.amazoncloud.cn/c94fa9868d9845069d3a860854e5d29e_image.png\" alt=\"image.png\" />\nIn this simplified model, the simulation system estimates the cost (X1) of replenishing inventory; the forecasting algorithm estimates demand (X2); a planning algorithm (X3) determines the size and timing of purchase orders; bidding (X4) occurs opportunistically, as when a large supply of some product becomes available at a discounted rate; and together, all those factors contributed to the inventory on hand (X5).\nEach input-output relation in this network has an associated conditional probability distribution, or causal mechanism. The probabilities associated with the individual causal mechanisms determine the joint distribution of all the variables (X1-X5), or the probability that any given combination of variables will occur together. That in turn determines the probability distribution of the target variable — the amount of inventory on hand.\nA large change to the final outcome may be accompanied by changes to all the causal mechanisms in the graph. Our technique identifies the causal mechanism whose change is most responsible for the change in outcome.\nOur fundamental insight is that any given causal mechanism in the graph could, in principle, change without affecting the others. So given a causal graph, the initial causal mechanisms, and data that imply new causal mechanisms, we update the causal mechanisms one by one to determine the influence each has on the outcome.\n<img src=\"https://dev-media.amazoncloud.cn/76a475081643459e9595d4f75c899c19_image.png\" alt=\"image.png\" />\nIn this version of the graph, the mechanism for cost has been updated, followed by the mechanism for demand, which accounts for 25% of the total change in on-hand inventory.\nThe problem with this approach is that our measurement of each node’s contribution depends on the order in which we update the nodes. The measurement evaluates the consequences of changing the node’s causal mechanism given every possible value of the other variables in the graph. But the probabilities of those values change when we update causal mechanisms. So we’ll get different measurements, depending on which causal mechanisms have been updated.\nTo address this problem, we run through every permutation of the update order and average the per-node results, an adaptation of a technique from game theory called computing the Shapley value.\nIn practice, of course, causal mechanisms are something we have to infer from data; we’re not given probability distributions in advance. But to test our approach, we created a simple causal graph in which we could stipulate the distributions. Then, using those distributions, we generated data samples.\nAcross 100 different random changes to the causal mechanisms of our graph, our method performed very well; with 500 data samples per change, it achieved an average deviation from ground truth of 0.29 as measured by <a href=\"https://en.wikipedia.org/wiki/Taxicab_geometry\" target=\"_blank\">L</a><a href=\"https://en.wikipedia.org/wiki/Taxicab_geometry\" target=\"_blank\">1</a> <a href=\"https://en.wikipedia.org/wiki/Taxicab_geometry\" target=\"_blank\">distance</a>. Our ground truth is at least a 3-D vector (6-D at most), with at least one component whose magnitude is at least one (five at most). Ther\nWe tested different volumes of data samples, from 500 to 4,000, but adding more samples had little effect on the accuracy of the approximation.\nInternally, we have also applied our technique to questions of supply chain management. For a particular family of products, we were able to identify the reasons for a steady decline in on-hand inventory during the pandemic, when that figure had held steady for the preceding year.\nABOUT THE AUTHOR\n<h4><a id=\"Kailash_Budhathokihttpswwwamazonscienceauthorkailashbudhathoki_39\"></a><a href=\"https://www.amazon.science/author/kailash-budhathoki\" target=\"_blank\">Kailash Budhathoki</a></h4>\nKailash Budhathoki is an applied scientist at Amazon.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家