当内卷风波及代码领域，看Alluxio将会采取怎样的块分配策略

{"value":"![image.png](https://dev-media.amazoncloud.cn/22a94ab9a2a245b1801aba62752ede09_image.png)\n本期分享主题：**《Alluxio块分配策略详解》**\n\n全文主要围绕3个部分进行介绍：【策略详解概述】、【块分配策略介绍】、【代码层面解读】\n\n话不多说，直接上干货↓\n\n## 策略详解概述\nAlluxio 的 Worker 负责存储用户的数据资源，并且数据会以 Block 形式存储在 Worker 的存储目录(tiered storage)中，而存储目录可以有 MEM/SSD/HDD 等多个不同的 level，同一 level 又可以由多个目录组成，那么当用户通过 Alluxio 读写数据的时候，Alluxio 又是如何决定要把一个 Block 放在哪一个目录中呢？本文就从代码角度来分析一下 Block 存储目录的选取流程。\n\n## 块分配策略介绍\nAlluxio 使用块分配策略(Block Allocation Policies) 来定义如何在多个存储目录（同一层或不同层）中分配 Block。\n\n## 目前 Alluxio 的块分配策略主要有 3 种：\n01. alluxio.worker.block.allocator.GreedyAllocator\n\n从顶层到底层，将 Block 分配到第一个能够容纳 Block 的存储目录中。\n\n02. alluxio.worker.block.allocator.MaxFreeAllocator\n\n将Block分配到剩余可用空间最大的存储目录中。\n\n03. alluxio.worker.block.allocator.RoundRobinAllocator\n\n从顶层到底层，循环将 Block 放入每一个存储目录中。\n\n默认使用的策略是 MaxFreeAllocator ，这可以通过 property alluxio.worker.allocator.class 来更改。\n![image.png](https://dev-media.amazoncloud.cn/593a5de7de434830b81349478b8b2f00_image.png)\n\n# # 代码层面解读 #\n## allocateSpace\n代码层面负责给 Block 分配存储目录的函数是 allocateSpace。\n\n当读写数据时，如果需要在 Worker 中存储数据, 会按照 Block Allocation Policies 来为 Block 请求一个存储目录来保存数据，allocateSpace 首先会尝试在传入参数 options 中指定的位置 (默认是 level0) 中寻找，如果指定位置没有找到合适的空间，则会尝试在所有的存储目录中寻找。\n\n```\nprivate StorageDirView allocateSpace(long sessionId, AllocateOptions options)\n throws WorkerOutOfSpaceException, IOException {\n while (true) {\n if (options.isForceLocation()) {\n //...\n } else {\n //...\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n options.getLocation(), allocatorView, false);\n if (dirView != null) {\n return dirView;\n }\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n BlockStoreLocation.anyTier(), allocatorView, false);\n if (dirView != null) {\n return dirView;\n }\n }\n //...\n}\n```\n\n可以看到，上文讲述的 allocateSpace 所选取的存储目录是通过动态计算所得到的，但是在某些时候,也需要数据可以写入指定的存储目录或者 level，所以 allocateSpace 也支持通过 evict Block 来让某一个存储目录腾出足够的空间以容纳新的数据(这里后文会详细介绍)。\n\n```\nprivate StorageDirView allocateSpace(long sessionId, AllocateOptions options) {\n StorageDirView dirView;\n//...\n while (true) {\n if (options.isForceLocation()) {\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n options.getLocation(), allocatorView, true);\n if (dirView != null) {\n return dirView;\n }\n freeSpace(sessionId, options.getSize(), options.getSize(), options.getLocation());\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n options.getLocation(), allocatorView.refreshView(), true); \n //...\n }\n }\n //...\n } else {\n //...\n }\n return dirView;\n }\n }\n```\n\n从allocateSpace 中具体寻找合适存储目录的类就是 Allocator 接口的几个实现类，也就是上一章节中提出的几个块分配策略： GreedyAllocator/MaxFreeAllocator/RoundRobinAllocator\n\n## Allocator\n```\npublic interface Allocator {\n\n //...\n StorageDirView allocateBlockWithView(long sessionId, long blockSize, BlockStoreLocation location,\n BlockMetadataView view, boolean skipReview);\n}\n```\n\nAllocator 接口中的 allocateBlockWithView 主要负责寻找合适的存储目录，需要关注的参数主要有 3 个：\n\n✓ blockSize: 想要分配多大的空间\n\n✓ location: 想要在哪分配空间\n\n✓ skipReview: 是否跳过 Review\n\n以默认的 alluxio.worker.block.allocator.MaxFreeAllocator#MaxFreeAllocator 为例，可以看到如果想要在所有的位置寻找存储目录，那么它就逐个检查所有存储目录，并返回最大的那个：\n\n```\nprivate StorageDirView allocateBlock(long sessionId, long blockSize,\n BlockStoreLocation location, boolean skipReview) {\n\n if (location.equals(BlockStoreLocation.anyTier())) {\n for (StorageTierView tierView : mMetadataView.getTierViews()) {\n candidateDirView = getCandidateDirInTier(tierView, blockSize, BlockStoreLocation.ANY_MEDIUM);\n if (candidateDirView != null) { // Review\n if (skipReview || mReviewer.acceptAllocation(candidateDirView)) {\n break;\n }\n }\n }\n } \n //...\n return candidateDirView;\n }\n\nprivate StorageDirView getCandidateDirInTier(StorageTierView tierView,\n long blockSize, String mediumType) {\n StorageDirView candidateDirView = null;\n long maxFreeBytes = blockSize - 1;\n for (StorageDirView dirView : tierView.getDirViews()) {\n if ((mediumType.equals(BlockStoreLocation.ANY_MEDIUM)\n || dirView.getMediumType().equals(mediumType))\n && dirView.getAvailableBytes() > maxFreeBytes) {\n maxFreeBytes = dirView.getAvailableBytes();\n candidateDirView = dirView;\n }\n }\n return candidateDirView;\n }\n```\n\n从代码中可以看到，当找到 candidateDirView 后，还需要经过一个 Review 的流程, 才能最终决定返回哪个存储目录，那么 Review 的流程又有什么用呢？\n\n## Block Allocation Review Policies\nReview 是对块分配策略的一种补充，它给块分配策略带来了一些额外的限制(例如SoftLimit/HardLimit)以及一定随机性。目前 Alluxio 有两 Review 策略：\n\n01 alluxio.worker.block.reviewer.AcceptingReviewer\n\n直接通过所有 Review，相当于没有进行 Review\n\n02 alluxio.worker.block.reviewer.ProbabilisticBufferReviewer\n\n会根据当前的可用空间剩余大小来 Reviewer 之前的分配结果\n\n## 以默认的 ProbabilisticBufferReviewer 来看：\n\n```\npublic class ProbabilisticBufferReviewer implements Reviewer {\n //...\n double getProbability(StorageDirView dirView) {\n //...\n if (availableBytes > mSoftLimitBytes) {\n return 1.0;\n }\n if (availableBytes <= mHardLimitBytes) {\n return 0.0;\n }\n \n double x = capacityBytes - availableBytes;\n double k = 1.0 / (mHardLimitBytes - mSoftLimitBytes); // If HardLimit = SoftLimit, then we would have returned in the previous if-else\n double b = (capacityBytes - mHardLimitBytes + 0.0) / (mSoftLimitBytes - mHardLimitBytes);\n double y = k * x + b;\n return y;\n }\n}\n```\n\nProbabilisticBufferReviewer\n\n- 如果当前的存储目录剩余空间小于 mHardLimitBytes ，那么就会直接返回 0，表示未通过 review ;\n\n- 如果当前的存储目录剩余空间大于 mSoftLimitBytes ，那么就会直接返回 1，表示通过 review ;\n\n- 如果剩余空间大小介于 mHardLimitBytes 与 mSoftLimitBytes 之间，那么就会返回(0, 1) 之间的一个值。\n\n## freeSpace\n上文中提到 allocateSpace 支持通过 evict Block 来让某一个存储目录腾出足够的空间以容纳新的数据。那么当进行 evict 的时候，又是如何决定该 evict 哪一个 Block 呢？\n\n**evict Block 是通过 freeSpace 完成的，如果空间不足，在 freeSpace 中会逐个淘汰 Block，直到腾出了足够的空间。**\n\n```\npublic synchronized void freeSpace(long sessionId, long minContiguousBytes,\n long minAvailableBytes, BlockStoreLocation location) {\n Iterator<Long> evictionCandidates = mBlockIterator.getIterator(location, BlockOrder.NATURAL);\n while (true) {\n //...\n if (contiguousSpaceFound && availableBytesFound) {\n break;\n }\n\n if (!evictionCandidates.hasNext()) {\n break;\n }\n long blockToDelete = evictionCandidates.next();\n if (evictorView.isBlockEvictable(blockToDelete)) { // 有一些 block 是不会被 evict 的\n try {\n BlockMeta blockMeta = mMetaManager.getBlockMeta(blockToDelete);\n removeBlockFileAndMeta(blockMeta);\n //...\n }\n //...\n }\n```\n\n**当然, 有一些 block 不会被 evict ：**\n\n```\npublic boolean isBlockEvictable(long blockId) {\n boolean pinned = isBlockPinned(blockId);\n boolean locked = isBlockLocked(blockId);\n boolean marked = isBlockMarked(blockId);\n boolean isEvictable = !pinned && !locked && !marked;\n if (!isEvictable) {\n LOG.debug(\"Block not evictable: {}. Pinned: {}, Locked: {}, Marked: {}\", blockId, pinned,\n locked, marked);\n }\n\n return isEvictable;\n }\n```\n\n**evict 的规则目前有 2 种：**\n\n01 alluxio.worker.block.annotator.LRUAnnotator\n\nLRU 规则, 即首先会被淘汰的是最久未被访问的\n\n02 alluxio.worker.block.annotator.LRFUAnnotator\n\n能够以LRFU 与 LRU 的规则结合, 也可以通过参数让该规则接近 LRFU 或者 LRU\n\n默认使用的策略是 LRUAnnotator ，这可以通过 property alluxio.worker.block.annotator.class 来更改。\n\n**以默认的 LRUAnnotator 来看：**\n\n```\npublic class LRUAnnotator implements BlockAnnotator<LRUAnnotator.LRUSortedField> {\n private static final Logger LOG = LoggerFactory.getLogger(LRUAnnotator.class);\n\n private AtomicLong mLRUClock;\n\n @Override\n public BlockSortedField updateSortedField(long blockId, LRUSortedField oldValue) {\n long clockValue = mLRUClock.incrementAndGet();\n return new LRUSortedField(clockValue);\n }\n\n /**\n * Sorted-field for LRU.\n */\n protected class LRUSortedField implements BlockSortedField {\n private Long mClockValue;\n\n private LRUSortedField(long clockValue) {\n mClockValue = clockValue;\n }\n\n @Override\n public int compareTo(BlockSortedField o) {\n Preconditions.checkState(o instanceof LRUSortedField);\n return mClockValue.compareTo(((LRUSortedField) o).mClockValue);\n }\n //...\n}\n```\n\n可以看到 LRUAnnotator 内部通过一个单调递增的 AtomicLong 来标识每个 Block 的访问顺序， AtomicLong 越大，代表越先被访问。","render":"<img src=\"https://dev-media.amazoncloud.cn/22a94ab9a2a245b1801aba62752ede09_image.png\" alt=\"image.png\" /> \n本期分享主题：《Alluxio块分配策略详解》\n全文主要围绕3个部分进行介绍：【策略详解概述】、【块分配策略介绍】、【代码层面解读】\n话不多说，直接上干货↓\n<h2><a id=\"_7\"></a>策略详解概述</h2>\nAlluxio 的 Worker 负责存储用户的数据资源，并且数据会以 Block 形式存储在 Worker 的存储目录(tiered storage)中，而存储目录可以有 MEM/SSD/HDD 等多个不同的 level，同一 level 又可以由多个目录组成，那么当用户通过 Alluxio 读写数据的时候，Alluxio 又是如何决定要把一个 Block 放在哪一个目录中呢？本文就从代码角度来分析一下 Block 存储目录的选取流程。\n<h2><a id=\"_10\"></a>块分配策略介绍</h2>\nAlluxio 使用块分配策略(Block Allocation Policies) 来定义如何在多个存储目录（同一层或不同层）中分配 Block。\n<h2><a id=\"_Alluxio__3__13\"></a>目前 Alluxio 的块分配策略主要有 3 种：</h2>\n<ol>\n<li>alluxio.worker.block.allocator.GreedyAllocator</li>\n</ol>\n从顶层到底层，将 Block 分配到第一个能够容纳 Block 的存储目录中。\n<ol start=\"2\">\n<li>alluxio.worker.block.allocator.MaxFreeAllocator</li>\n</ol>\n将Block分配到剩余可用空间最大的存储目录中。\n<ol start=\"3\">\n<li>alluxio.worker.block.allocator.RoundRobinAllocator</li>\n</ol>\n从顶层到底层，循环将 Block 放入每一个存储目录中。\n默认使用的策略是 MaxFreeAllocator ，这可以通过 property alluxio.worker.allocator.class 来更改。 \n<img src=\"https://dev-media.amazoncloud.cn/593a5de7de434830b81349478b8b2f00_image.png\" alt=\"image.png\" />\n<h1><a id=\"__29\"></a># 代码层面解读</h1>\n<h2><a id=\"allocateSpace_30\"></a>allocateSpace</h2>\n代码层面负责给 Block 分配存储目录的函数是 allocateSpace。\n当读写数据时，如果需要在 Worker 中存储数据, 会按照 Block Allocation Policies 来为 Block 请求一个存储目录来保存数据，allocateSpace 首先会尝试在传入参数 options 中指定的位置 (默认是 level0) 中寻找，如果指定位置没有找到合适的空间，则会尝试在所有的存储目录中寻找。\n<pre><code class=\"lang-\">private StorageDirView allocateSpace(long sessionId, AllocateOptions options)\n throws WorkerOutOfSpaceException, IOException {\n while (true) {\n if (options.isForceLocation()) {\n //...\n } else {\n //...\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n options.getLocation(), allocatorView, false);\n if (dirView != null) {\n return dirView;\n }\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n BlockStoreLocation.anyTier(), allocatorView, false);\n if (dirView != null) {\n return dirView;\n }\n }\n //...\n}\n</code></pre>\n可以看到，上文讲述的 allocateSpace 所选取的存储目录是通过动态计算所得到的，但是在某些时候,也需要数据可以写入指定的存储目录或者 level，所以 allocateSpace 也支持通过 evict Block 来让某一个存储目录腾出足够的空间以容纳新的数据(这里后文会详细介绍)。\n<pre><code class=\"lang-\">private StorageDirView allocateSpace(long sessionId, AllocateOptions options) {\n StorageDirView dirView;\n//...\n while (true) {\n if (options.isForceLocation()) {\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n options.getLocation(), allocatorView, true);\n if (dirView != null) {\n return dirView;\n }\n freeSpace(sessionId, options.getSize(), options.getSize(), options.getLocation());\n dirView = mAllocator.allocateBlockWithView(sessionId, options.getSize(),\n options.getLocation(), allocatorView.refreshView(), true); \n //...\n }\n }\n //...\n } else {\n //...\n }\n return dirView;\n }\n }\n</code></pre>\n从allocateSpace 中具体寻找合适存储目录的类就是 Allocator 接口的几个实现类，也就是上一章节中提出的几个块分配策略： GreedyAllocator/MaxFreeAllocator/RoundRobinAllocator\n<h2><a id=\"Allocator_88\"></a>Allocator</h2>\n<pre><code class=\"lang-\">public interface Allocator {\n\n //...\n StorageDirView allocateBlockWithView(long sessionId, long blockSize, BlockStoreLocation location,\n BlockMetadataView view, boolean skipReview);\n}\n</code></pre>\nAllocator 接口中的 allocateBlockWithView 主要负责寻找合适的存储目录，需要关注的参数主要有 3 个：\n✓ blockSize: 想要分配多大的空间\n✓ location: 想要在哪分配空间\n✓ skipReview: 是否跳过 Review\n以默认的 alluxio.worker.block.allocator.MaxFreeAllocator#MaxFreeAllocator 为例，可以看到如果想要在所有的位置寻找存储目录，那么它就逐个检查所有存储目录，并返回最大的那个：\n<pre><code class=\"lang-\">private StorageDirView allocateBlock(long sessionId, long blockSize,\n BlockStoreLocation location, boolean skipReview) {\n\n if (location.equals(BlockStoreLocation.anyTier())) {\n for (StorageTierView tierView : mMetadataView.getTierViews()) {\n candidateDirView = getCandidateDirInTier(tierView, blockSize, BlockStoreLocation.ANY_MEDIUM);\n if (candidateDirView != null) { // Review\n if (skipReview || mReviewer.acceptAllocation(candidateDirView)) {\n break;\n }\n }\n }\n } \n //...\n return candidateDirView;\n }\n\nprivate StorageDirView getCandidateDirInTier(StorageTierView tierView,\n long blockSize, String mediumType) {\n StorageDirView candidateDirView = null;\n long maxFreeBytes = blockSize - 1;\n for (StorageDirView dirView : tierView.getDirViews()) {\n if ((mediumType.equals(BlockStoreLocation.ANY_MEDIUM)\n || dirView.getMediumType().equals(mediumType))\n && dirView.getAvailableBytes() > maxFreeBytes) {\n maxFreeBytes = dirView.getAvailableBytes();\n candidateDirView = dirView;\n }\n }\n return candidateDirView;\n }\n</code></pre>\n从代码中可以看到，当找到 candidateDirView 后，还需要经过一个 Review 的流程, 才能最终决定返回哪个存储目录，那么 Review 的流程又有什么用呢？\n<h2><a id=\"Block_Allocation_Review_Policies_144\"></a>Block Allocation Review Policies</h2>\nReview 是对块分配策略的一种补充，它给块分配策略带来了一些额外的限制(例如SoftLimit/HardLimit)以及一定随机性。目前 Alluxio 有两 Review 策略：\n01 alluxio.worker.block.reviewer.AcceptingReviewer\n直接通过所有 Review，相当于没有进行 Review\n02 alluxio.worker.block.reviewer.ProbabilisticBufferReviewer\n会根据当前的可用空间剩余大小来 Reviewer 之前的分配结果\n<h2><a id=\"_ProbabilisticBufferReviewer__155\"></a>以默认的 ProbabilisticBufferReviewer 来看：</h2>\n<pre><code class=\"lang-\">public class ProbabilisticBufferReviewer implements Reviewer {\n //...\n double getProbability(StorageDirView dirView) {\n //...\n if (availableBytes > mSoftLimitBytes) {\n return 1.0;\n }\n if (availableBytes <= mHardLimitBytes) {\n return 0.0;\n }\n \n double x = capacityBytes - availableBytes;\n double k = 1.0 / (mHardLimitBytes - mSoftLimitBytes); // If HardLimit = SoftLimit, then we would have returned in the previous if-else\n double b = (capacityBytes - mHardLimitBytes + 0.0) / (mSoftLimitBytes - mHardLimitBytes);\n double y = k * x + b;\n return y;\n }\n}\n</code></pre>\nProbabilisticBufferReviewer\n<ul>\n<li>\n如果当前的存储目录剩余空间小于 mHardLimitBytes ，那么就会直接返回 0，表示未通过 review ;\n</li>\n<li>\n如果当前的存储目录剩余空间大于 mSoftLimitBytes ，那么就会直接返回 1，表示通过 review ;\n</li>\n<li>\n如果剩余空间大小介于 mHardLimitBytes 与 mSoftLimitBytes 之间，那么就会返回(0, 1) 之间的一个值。\n</li>\n</ul>\n<h2><a id=\"freeSpace_186\"></a>freeSpace</h2>\n上文中提到 allocateSpace 支持通过 evict Block 来让某一个存储目录腾出足够的空间以容纳新的数据。那么当进行 evict 的时候，又是如何决定该 evict 哪一个 Block 呢？\nevict Block 是通过 freeSpace 完成的，如果空间不足，在 freeSpace 中会逐个淘汰 Block，直到腾出了足够的空间。\n<pre><code class=\"lang-\">public synchronized void freeSpace(long sessionId, long minContiguousBytes,\n long minAvailableBytes, BlockStoreLocation location) {\n Iterator<Long> evictionCandidates = mBlockIterator.getIterator(location, BlockOrder.NATURAL);\n while (true) {\n //...\n if (contiguousSpaceFound && availableBytesFound) {\n break;\n }\n\n if (!evictionCandidates.hasNext()) {\n break;\n }\n long blockToDelete = evictionCandidates.next();\n if (evictorView.isBlockEvictable(blockToDelete)) { // 有一些 block 是不会被 evict 的\n try {\n BlockMeta blockMeta = mMetaManager.getBlockMeta(blockToDelete);\n removeBlockFileAndMeta(blockMeta);\n //...\n }\n //...\n }\n</code></pre>\n当然, 有一些 block 不会被 evict ：\n<pre><code class=\"lang-\">public boolean isBlockEvictable(long blockId) {\n boolean pinned = isBlockPinned(blockId);\n boolean locked = isBlockLocked(blockId);\n boolean marked = isBlockMarked(blockId);\n boolean isEvictable = !pinned && !locked && !marked;\n if (!isEvictable) {\n LOG.debug("Block not evictable: {}. Pinned: {}, Locked: {}, Marked: {}", blockId, pinned,\n locked, marked);\n }\n\n return isEvictable;\n }\n</code></pre>\nevict 的规则目前有 2 种：\n01 alluxio.worker.block.annotator.LRUAnnotator\nLRU 规则, 即首先会被淘汰的是最久未被访问的\n02 alluxio.worker.block.annotator.LRFUAnnotator\n能够以LRFU 与 LRU 的规则结合, 也可以通过参数让该规则接近 LRFU 或者 LRU\n默认使用的策略是 LRUAnnotator ，这可以通过 property alluxio.worker.block.annotator.class 来更改。\n以默认的 LRUAnnotator 来看：\n<pre><code class=\"lang-\">public class LRUAnnotator implements BlockAnnotator<LRUAnnotator.LRUSortedField> {\n private static final Logger LOG = LoggerFactory.getLogger(LRUAnnotator.class);\n\n private AtomicLong mLRUClock;\n\n @Override\n public BlockSortedField updateSortedField(long blockId, LRUSortedField oldValue) {\n long clockValue = mLRUClock.incrementAndGet();\n return new LRUSortedField(clockValue);\n }\n\n /**\n * Sorted-field for LRU.\n */\n protected class LRUSortedField implements BlockSortedField {\n private Long mClockValue;\n\n private LRUSortedField(long clockValue) {\n mClockValue = clockValue;\n }\n\n @Override\n public int compareTo(BlockSortedField o) {\n Preconditions.checkState(o instanceof LRUSortedField);\n return mClockValue.compareTo(((LRUSortedField) o).mClockValue);\n }\n //...\n}\n</code></pre>\n可以看到 LRUAnnotator 内部通过一个单调递增的 AtomicLong 来标识每个 Block 的访问顺序， AtomicLong 越大，代表越先被访问。\n"}

亚马逊云科技解决方案基于行业客户应用场景及技术领域的解决方案

联系亚马逊云科技专家