Using machine learning for virtual-machine placement in the cloud

海外精选

海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时，内容中提到的“AWS” 是 “Amazon Web Services” 的缩写，在此网站不作为商标展示。

{"value":"In the cloud, load balancing, or distributing tasks evenly across servers, is essential to providing reliable service. It prevents individual servers from getting overloaded, which degrades their performance.\n\nThe simplest way to prevent server overloads is to cap the number of tasks assigned to each server. But this may result in inefficient resource use, as tasks can vary greatly in their computational demands. The ideal approach to load-balancing would allocate tasks to the minimum number of servers required to prevent overloads.\n\nLast week, at the Conference on Machine Learning and Systems ([MLSys](https://www.amazon.science/conferences-and-events/mlsys-2021)), we presented a [new algorithm](https://www.amazon.science/publications/fireplace-placing-firecracker-virtual-machines-with-hindsight-imitation) for optimizing task distribution, called FirePlace. FirePlace is built around a decision-tree machine learning model, which we train using simulations based on historical data.\n\n![image.png](https://dev-media.amazoncloud.cn/c73931d06d5b49429774e0742dd01376_image.png)\n\nDeciding how to allocate virtual machines (VMs) to cloud servers is a difficult challenge, as the VMs' resource consumption (represented here by the size of the VMs) varies over time. FirePlace combines simulation and machine learning to address that challenge.\nCREDIT: GLYNIS CONDON\n\nIn experiments, we found that FirePlace outperformed both more-complex models, such as long-short-term-memory models and reinforcement learning models, and simpler baselines that have proved effective in practice, such as the power-of-two algorithm.\n\n#### **Firecracker placement**\n\nThe name FirePlace comes from the Firecracker virtual machine (VM), which is used by Amazon Web Services’ (AWS) [Lambda](https://aws.amazon.com/lambda/) service. Lambda provides function execution as a service, sparing customers from provisioning infrastructure themselves and lowering their costs, since they are billed for function execution duration.\n\nIn cloud computing, virtual machines enable secure execution of customer code by moderating that code’s access to server operating systems. Traditionally, a cloud computing service might allot one VM to each application running on its servers. Firecracker, however, allots a separate VM to each function.\n\nFirecracker VMs are secure and lightweight and can be packed densely into servers. Their small size gives them efficiency advantages, but it also makes them less predictable: the resource consumption of a large program is easier to estimate than the resource consumption of a single program function. Optimizing the placement of Firecracker VMs required a new approach to load balancing; hence FirePlace.\n\nFirePlace uses a decision tree model that takes as input the resource consumption status of multiple servers in the fleet; to ensure that the model can deliver a decision within milliseconds, those servers are randomly sampled. The model’s output is the assignment of a new VM to one of the input servers.\n\n#### **Training by simulation**\n\nTo train the model, we use historical data about real Firecracker VMs’ resource consumption, represented as time series. During training, when the model is presented with a new VM to place, each of the currently allocated VMs is at a particular step in its time series. We run a simulation to compute those VMs’ future resource consumption, and on that basis, we can optimize the placement of the new VM. The optimized placement then becomes the training label for the current input. \n\nIn our experiments, our baseline was the surprisingly effective power-of-two algorithm, which is widely used in cloud computing. It randomly picks two servers as potential recipients for a new VM, then selects the least loaded of the two. \n\nWe also compared our approach to one that used neural networks — a long-short-term-memory network (LSTM) and a temporal convolutional network (TCN) — that were trained to predict the future resource consumption of a given VM, based on its resource consumption up to that time.\n\nFinally, we also compared our system to one that used reinforcement learning to learn optimal placement of a VM, given its previous decisions about VM placement. The learned model performed well for smaller datasets, but as we increase the number of VMs for placement, the complexity of the problem increases, and reinforcement learning models fail to converge to a competitive solution.\n\nWe evaluated these approaches according to how many servers they needed to serve a given load, given a fixed limit on server overloads; the lower the number of servers, the better. FirePlace improved upon the power-of-two baseline algorithm by 10%. The LSTM and TCN approaches were too inaccurate to be competitive.\n\nLambda has begun to introduce the FirePlace approach in production, where in future it can provide real-world validation of our experimental results.\n\nABOUT THE AUTHOR\n\n#### **[Christopher Kakovitch](https://www.amazon.science/author/christopher-kakovitch)**\n\nChristopher Kakovitch is a senior research scientist with Amazon Web Services.","render":"In the cloud, load balancing, or distributing tasks evenly across servers, is essential to providing reliable service. It prevents individual servers from getting overloaded, which degrades their performance.\nThe simplest way to prevent server overloads is to cap the number of tasks assigned to each server. But this may result in inefficient resource use, as tasks can vary greatly in their computational demands. The ideal approach to load-balancing would allocate tasks to the minimum number of servers required to prevent overloads.\nLast week, at the Conference on Machine Learning and Systems (<a href=\\"https://www.amazon.science/conferences-and-events/mlsys-2021\\" target=\\"_blank\\">MLSys</a>), we presented a <a href=\\"https://www.amazon.science/publications/fireplace-placing-firecracker-virtual-machines-with-hindsight-imitation\\" target=\\"_blank\\">new algorithm</a> for optimizing task distribution, called FirePlace. FirePlace is built around a decision-tree machine learning model, which we train using simulations based on historical data.\\n<img src=\\"https://dev-media.amazoncloud.cn/c73931d06d5b49429774e0742dd01376_image.png\\" alt=\\"image.png\\" />\nDeciding how to allocate virtual machines (VMs) to cloud servers is a difficult challenge, as the VMs’ resource consumption (represented here by the size of the VMs) varies over time. FirePlace combines simulation and machine learning to address that challenge. \\nCREDIT: GLYNIS CONDON\nIn experiments, we found that FirePlace outperformed both more-complex models, such as long-short-term-memory models and reinforcement learning models, and simpler baselines that have proved effective in practice, such as the power-of-two algorithm.\n<h4><a id=\\"Firecracker_placement_13\\"></a>Firecracker placement</h4>\\nThe name FirePlace comes from the Firecracker virtual machine (VM), which is used by Amazon Web Services’ (AWS) <a href=\\"https://aws.amazon.com/lambda/\\" target=\\"_blank\\">Lambda</a> service. Lambda provides function execution as a service, sparing customers from provisioning infrastructure themselves and lowering their costs, since they are billed for function execution duration.\\nIn cloud computing, virtual machines enable secure execution of customer code by moderating that code’s access to server operating systems. Traditionally, a cloud computing service might allot one VM to each application running on its servers. Firecracker, however, allots a separate VM to each function.\nFirecracker VMs are secure and lightweight and can be packed densely into servers. Their small size gives them efficiency advantages, but it also makes them less predictable: the resource consumption of a large program is easier to estimate than the resource consumption of a single program function. Optimizing the placement of Firecracker VMs required a new approach to load balancing; hence FirePlace.\nFirePlace uses a decision tree model that takes as input the resource consumption status of multiple servers in the fleet; to ensure that the model can deliver a decision within milliseconds, those servers are randomly sampled. The model’s output is the assignment of a new VM to one of the input servers.\n<h4><a id=\\"Training_by_simulation_23\\"></a>Training by simulation</h4>\\nTo train the model, we use historical data about real Firecracker VMs’ resource consumption, represented as time series. During training, when the model is presented with a new VM to place, each of the currently allocated VMs is at a particular step in its time series. We run a simulation to compute those VMs’ future resource consumption, and on that basis, we can optimize the placement of the new VM. The optimized placement then becomes the training label for the current input.\nIn our experiments, our baseline was the surprisingly effective power-of-two algorithm, which is widely used in cloud computing. It randomly picks two servers as potential recipients for a new VM, then selects the least loaded of the two.\nWe also compared our approach to one that used neural networks — a long-short-term-memory network (LSTM) and a temporal convolutional network (TCN) — that were trained to predict the future resource consumption of a given VM, based on its resource consumption up to that time.\nFinally, we also compared our system to one that used reinforcement learning to learn optimal placement of a VM, given its previous decisions about VM placement. The learned model performed well for smaller datasets, but as we increase the number of VMs for placement, the complexity of the problem increases, and reinforcement learning models fail to converge to a competitive solution.\nWe evaluated these approaches according to how many servers they needed to serve a given load, given a fixed limit on server overloads; the lower the number of servers, the better. FirePlace improved upon the power-of-two baseline algorithm by 10%. The LSTM and TCN approaches were too inaccurate to be competitive.\nLambda has begun to introduce the FirePlace approach in production, where in future it can provide real-world validation of our experimental results.\nABOUT THE AUTHOR\n<h4><a id=\\"Christopher_Kakovitchhttpswwwamazonscienceauthorchristopherkakovitch_39\\"></a><a href=\\"https://www.amazon.science/author/christopher-kakovitch\\" target=\\"_blank\\">Christopher Kakovitch</a></h4>\nChristopher Kakovitch is a senior research scientist with Amazon Web Services.\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家