{"value":"Time series forecasting is often hierarchical: a utility company might, for instance, want to forecast energy consumption or supply at regional, state, and national levels; a retailer may want to forecast sales according to increasingly general product features, such as color, model, brand, product category, and so on. \n\nPreviously, the state-of-the-art approach in hierarchical time series forecasting was to learn a separate, local model for each time series in the hierarchy and then apply some postprocessing to reconcile the different levels — to ensure that the sales figure for a certain brand of camera is the sum of the sales figures for the different camera models within that brand, and so on.\n\nThis approach has two main drawbacks. It doesn’t allow different levels of the hierarchy to benefit from each other’s forecasts: at lower levels, historical data often has characteristics such as sparsity or burstiness that can be “aggregated away” at higher levels. And the reconciliation procedure, which is geared to the average case, can flatten out nonlinearities that are highly predictive in particular cases.\n\nIn [a paper we’re presenting](https://www.amazon.science/publications/end-to-end-learning-of-coherent-probabilistic-forecasts-for-hierarchical-time-series) at the International Conference on Machine Learning ([ICML](https://www.amazon.science/conferences-and-events/icml-2021)), we describe a new approach to hierarchical time series forecasting that uses a single machine learning model, trained end to end, to simultaneously predict outputs at every level of the hierarchy and to reconcile them. Contrary to all but one approach from the literature, our approach also allows for probabilistic forecasts — as opposed to single-value (point) forecasts — which are crucial for intelligent downstream decision making. \n\n![image.png](https://dev-media.amazoncloud.cn/db53782725974325b0328db489f02a7c_image.png)\n\nThe researchers' method enforces coherence, or agreement among different levels of a hierarchical time series, through projection. The plane (S) is the subspace of coherent samples; yt+h is a sample from the standard distribution (which is always coherent); ŷt+h is the transformation of the sample into a sample from a learned distribution; and ỹt+h is the projection of ŷt+h back into the coherent subspace.\n\nIn tests, we compared our approach to nine previous models on five different datasets. On four of the datasets, our model outperformed all nine baselines, with reductions in error rate ranging from 6% to 19% relative to the second-best performer (which varied from case to case).\n\nOne baseline model had an 8% lower error rate than ours on one dataset, but that same baseline’s methodology means that it didn’t work on another dataset at all. And on the other three datasets, our model had an advantage that ranged from 13% to 44%.\n\n#### **Ensuring trainability**\n\nOur model has two main components. The first is neural network that takes a hierarchical time series as input and outputs a probabilistic forecast for each level of the hierarchy. Probabilistic forecasts enable more intelligent decision making because they allow us to minimize a notion of [expected](https://en.wikipedia.org/wiki/Expected_value) future costs.\n\nThe second component of our model selects a sample from that distribution and ensures its coherence — that is, it ensures that the values at each level of the hierarchy are sums of the values of the levels below.\n\nOne technical challenge in designing an end-to-end hierarchical forecasting model is ensuring trainability via standard methods. Stochastic gradient descent — the learning algorithm for most neural networks — requires differentiable functions; however, in our model, the reconciliation step requires sampling from a probability distribution. This is not ordinarily a differentiable function, but we make it one by using the reparameterization trick.\n\nThe distribution output by our model’s first component is characterized by a set of parameters; in a normal (Gaussian) distribution, those parameters are mean and variance. Instead of sampling directly from that distribution, we sample from the standard distribution: in the Gaussian case, that’s the distribution with a mean of 0 and a variance of 1.\n\nWe can convert a sample from the standard distribution into a sample from the learned distribution with a function whose coefficients are the parameters of the learned distribution. Here’s the equation for the Gaussian case, where m and S are the mean and variance, respectively, of the learned distribution, and z is the sample from the standard distribution:\n\n![image.png](https://dev-media.amazoncloud.cn/cf2c27b68d94414ca17f012877e9a111_image.png)\n\nWith this trick, we move the randomness (the sampling procedure) outside the neural network; given z, the above function is deterministic and differentiable. This allows us to incorporate the sampling step into our end-to-end network. While we’ve used the Gaussian distribution as an example, the reparametrization trick works for a wider class of distributions.\n\nWe incorporate the reconciliation step into our network by recasting it as an optimization problem, which we solve as a subroutine of our model’s overall parameter learning. In our model, we represent the hierarchical relationship between time series as a matrix.\n\n![image.png](https://dev-media.amazoncloud.cn/460cdfdcacf54c659d057e4db9af4dc9_image.png)\n\nAt left is a hierarchy and at right the matrix that defines it. The columns of the matrix correspond to the entries at the lowest level of the hierarchy (the b’s of the leaf nodes), and the first three rows indicate levels of the hierarchy (summing the b’s, as encoded by the first row of the matrix, leads to y.)\n\nIn the space of possible samples from the learned distribution, the hierarchy matrix defines a subspace of samples that meet the hierarchical constraint. After transforming our standard-distribution sample into a sample from our learned distribution, we project it back down to the subspace defined by the hierarchy matrix (see animation, above).\n\nEnforcing the coherence constraint thus becomes a matter of minimizing the distance between the transformed sample and its projection, an optimization problem that we can readily solve as part of the overall parameter learning.\n\n![image.png](https://dev-media.amazoncloud.cn/a5c7b371d07a4ea9ba0341b2b799b93b_image.png)\n\nThe complete architecture of our end-to-end architecture for predicting hierarchical time series.\n\nIn principle, enforcing coherence could lower the accuracy of the model’s predictions. But in practice, the coherence constraint appears to improve the model’s accuracy: it enforces the sharing of information across the hierarchy, and forecasting at higher levels of the hierarchy is often easier. Because of this sharing, we see consistent improvement in accuracy at the lowest level of the hierarchy.\n\nIn our experiments, we used a DeepVAR network for time series prediction and solved the reconciliation problem in closed form. But our approach is more general and can be used with many state-of-the-art neural forecasting networks, prediction distributions, projection methods, or loss functions, making it adaptable to a wide range of use cases.\n\nABOUT THE AUTHOR\n\n#### **[Syama Sundar Rangapuram](https://www.amazon.science/author/syama-sundar-rangapuram)**\n\nSyama Sundar Rangapuram is a senior applied scientist with Amazon Web Services.\n\n#### **[Konstantinos Benidis](https://www.amazon.science/author/konstantinos-benidis)**\n\nKonstantinos Benidis is an applied scientist with Amazon's Last Mile team.\n\n#### **[Pedro Mercado](https://www.amazon.science/author/pedro-mercado)**\n\nPedro Mercado is an applied scientist with Amazon Web Services.\n\n#### **[Jan Gasthaus](https://www.amazon.science/author/jan-gasthaus)**\n\nJan Gasthaus is a senior machine learning scientist with Amazon Web Services.\n\n#### **Tim Januschowski**\n","render":"<p>Time series forecasting is often hierarchical: a utility company might, for instance, want to forecast energy consumption or supply at regional, state, and national levels; a retailer may want to forecast sales according to increasingly general product features, such as color, model, brand, product category, and so on.</p>\n<p>Previously, the state-of-the-art approach in hierarchical time series forecasting was to learn a separate, local model for each time series in the hierarchy and then apply some postprocessing to reconcile the different levels — to ensure that the sales figure for a certain brand of camera is the sum of the sales figures for the different camera models within that brand, and so on.</p>\n<p>This approach has two main drawbacks. It doesn’t allow different levels of the hierarchy to benefit from each other’s forecasts: at lower levels, historical data often has characteristics such as sparsity or burstiness that can be “aggregated away” at higher levels. And the reconciliation procedure, which is geared to the average case, can flatten out nonlinearities that are highly predictive in particular cases.</p>\n<p>In <a href=\"https://www.amazon.science/publications/end-to-end-learning-of-coherent-probabilistic-forecasts-for-hierarchical-time-series\" target=\"_blank\">a paper we’re presenting</a> at the International Conference on Machine Learning (<a href=\"https://www.amazon.science/conferences-and-events/icml-2021\" target=\"_blank\">ICML</a>), we describe a new approach to hierarchical time series forecasting that uses a single machine learning model, trained end to end, to simultaneously predict outputs at every level of the hierarchy and to reconcile them. Contrary to all but one approach from the literature, our approach also allows for probabilistic forecasts — as opposed to single-value (point) forecasts — which are crucial for intelligent downstream decision making.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/db53782725974325b0328db489f02a7c_image.png\" alt=\"image.png\" /></p>\n<p>The researchers’ method enforces coherence, or agreement among different levels of a hierarchical time series, through projection. The plane (S) is the subspace of coherent samples; yt+h is a sample from the standard distribution (which is always coherent); ŷt+h is the transformation of the sample into a sample from a learned distribution; and ỹt+h is the projection of ŷt+h back into the coherent subspace.</p>\n<p>In tests, we compared our approach to nine previous models on five different datasets. On four of the datasets, our model outperformed all nine baselines, with reductions in error rate ranging from 6% to 19% relative to the second-best performer (which varied from case to case).</p>\n<p>One baseline model had an 8% lower error rate than ours on one dataset, but that same baseline’s methodology means that it didn’t work on another dataset at all. And on the other three datasets, our model had an advantage that ranged from 13% to 44%.</p>\n<h4><a id=\"Ensuring_trainability_16\"></a><strong>Ensuring trainability</strong></h4>\n<p>Our model has two main components. The first is neural network that takes a hierarchical time series as input and outputs a probabilistic forecast for each level of the hierarchy. Probabilistic forecasts enable more intelligent decision making because they allow us to minimize a notion of <a href=\"https://en.wikipedia.org/wiki/Expected_value\" target=\"_blank\">expected</a> future costs.</p>\n<p>The second component of our model selects a sample from that distribution and ensures its coherence — that is, it ensures that the values at each level of the hierarchy are sums of the values of the levels below.</p>\n<p>One technical challenge in designing an end-to-end hierarchical forecasting model is ensuring trainability via standard methods. Stochastic gradient descent — the learning algorithm for most neural networks — requires differentiable functions; however, in our model, the reconciliation step requires sampling from a probability distribution. This is not ordinarily a differentiable function, but we make it one by using the reparameterization trick.</p>\n<p>The distribution output by our model’s first component is characterized by a set of parameters; in a normal (Gaussian) distribution, those parameters are mean and variance. Instead of sampling directly from that distribution, we sample from the standard distribution: in the Gaussian case, that’s the distribution with a mean of 0 and a variance of 1.</p>\n<p>We can convert a sample from the standard distribution into a sample from the learned distribution with a function whose coefficients are the parameters of the learned distribution. Here’s the equation for the Gaussian case, where m and S are the mean and variance, respectively, of the learned distribution, and z is the sample from the standard distribution:</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/cf2c27b68d94414ca17f012877e9a111_image.png\" alt=\"image.png\" /></p>\n<p>With this trick, we move the randomness (the sampling procedure) outside the neural network; given z, the above function is deterministic and differentiable. This allows us to incorporate the sampling step into our end-to-end network. While we’ve used the Gaussian distribution as an example, the reparametrization trick works for a wider class of distributions.</p>\n<p>We incorporate the reconciliation step into our network by recasting it as an optimization problem, which we solve as a subroutine of our model’s overall parameter learning. In our model, we represent the hierarchical relationship between time series as a matrix.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/460cdfdcacf54c659d057e4db9af4dc9_image.png\" alt=\"image.png\" /></p>\n<p>At left is a hierarchy and at right the matrix that defines it. The columns of the matrix correspond to the entries at the lowest level of the hierarchy (the b’s of the leaf nodes), and the first three rows indicate levels of the hierarchy (summing the b’s, as encoded by the first row of the matrix, leads to y.)</p>\n<p>In the space of possible samples from the learned distribution, the hierarchy matrix defines a subspace of samples that meet the hierarchical constraint. After transforming our standard-distribution sample into a sample from our learned distribution, we project it back down to the subspace defined by the hierarchy matrix (see animation, above).</p>\n<p>Enforcing the coherence constraint thus becomes a matter of minimizing the distance between the transformed sample and its projection, an optimization problem that we can readily solve as part of the overall parameter learning.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/a5c7b371d07a4ea9ba0341b2b799b93b_image.png\" alt=\"image.png\" /></p>\n<p>The complete architecture of our end-to-end architecture for predicting hierarchical time series.</p>\n<p>In principle, enforcing coherence could lower the accuracy of the model’s predictions. But in practice, the coherence constraint appears to improve the model’s accuracy: it enforces the sharing of information across the hierarchy, and forecasting at higher levels of the hierarchy is often easier. Because of this sharing, we see consistent improvement in accuracy at the lowest level of the hierarchy.</p>\n<p>In our experiments, we used a DeepVAR network for time series prediction and solved the reconciliation problem in closed form. But our approach is more general and can be used with many state-of-the-art neural forecasting networks, prediction distributions, projection methods, or loss functions, making it adaptable to a wide range of use cases.</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\"Syama_Sundar_Rangapuramhttpswwwamazonscienceauthorsyamasundarrangapuram_52\"></a><strong><a href=\"https://www.amazon.science/author/syama-sundar-rangapuram\" target=\"_blank\">Syama Sundar Rangapuram</a></strong></h4>\n<p>Syama Sundar Rangapuram is a senior applied scientist with Amazon Web Services.</p>\n<h4><a id=\"Konstantinos_Benidishttpswwwamazonscienceauthorkonstantinosbenidis_56\"></a><strong><a href=\"https://www.amazon.science/author/konstantinos-benidis\" target=\"_blank\">Konstantinos Benidis</a></strong></h4>\n<p>Konstantinos Benidis is an applied scientist with Amazon’s Last Mile team.</p>\n<h4><a id=\"Pedro_Mercadohttpswwwamazonscienceauthorpedromercado_60\"></a><strong><a href=\"https://www.amazon.science/author/pedro-mercado\" target=\"_blank\">Pedro Mercado</a></strong></h4>\n<p>Pedro Mercado is an applied scientist with Amazon Web Services.</p>\n<h4><a id=\"Jan_Gasthaushttpswwwamazonscienceauthorjangasthaus_64\"></a><strong><a href=\"https://www.amazon.science/author/jan-gasthaus\" target=\"_blank\">Jan Gasthaus</a></strong></h4>\n<p>Jan Gasthaus is a senior machine learning scientist with Amazon Web Services.</p>\n<h4><a id=\"Tim_Januschowski_68\"></a><strong>Tim Januschowski</strong></h4>\n"}