Speeding database queries by rewriting redundancies

0
0
{"value":"SQL database queries often include repetitions of the same operation. For instance, finding an entry in a table that corresponds to a particular person might involve pulling up all the entries with the person’s first name and all the entries with the person’s last name and computing their intersection. If the first- and last-name searches require querying the same database table twice, it’s a redundancy that can increase retrieval time.\n\nIn a paper we presented last week at the IEEE International Conference on Data Engineering (ICDE), we describe a [method for rewriting complex SQL queries](https://www.amazon.science/publications/computation-reuse-via-fusion-in-amazon-athena) so as to eliminate such redundancies. Sometimes, that involves retrieving a superset of entries and then winnowing them down according to additional criteria. But in general, a little extra computation after retrieval is more efficient than multiple queries of the same table.\n\n![下载.jpg](https://dev-media.amazoncloud.cn/948b1124f50d49bab399d550ba5d19ab_%E4%B8%8B%E8%BD%BD.jpg)\n\nA query plan is the sequence of steps required to execute a SQL query. At left is a diagram depicting the standard query plan for a complex query in the TPC-DS dataset. At right is the much simpler query plan produced by the Amazon researchers' new rewriting rules.\n\nIn experiments on the TPC-DS benchmark database, with 3TB of data, our techniques improved the overall execution time on 99 queries by 14%, compared to the baseline. When restricted to those queries that are directly transformed by our rewrite rules, we observed 60% improvement in performance, with some queries executing more than six times as quickly.\n\n\n#### **Query rewrite**\n\n\nA query plan is a sequence of steps such as data scans and aggregations that are used to execute a query. Query plan optimization is the process of choosing the most efficient query plan from a large number of possible alternatives.\n\nThe focus of our work is to identify subqueries computing on overlapping data and fuse them into a single computation with compensating actions (post-retrieval computations) to reconstruct the original results. It does not require the subqueries to be syntactically the same or to produce the same output.\n\nConsider the below query as an example:\n\n```\\nWITH cte as (...complex_subquery...)\\nSELECT customer_id FROM cte WHERE fname = 'John'\\nUNION ALL\\nSELECT customer_id FROM cte WHERE lname = 'Smith'\\n```\n\nThe query uses the block **cte **twice in the FROM clause. This is suboptimal, especially if the duplicated computation is expensive. Our technique can identify such patterns and rewrite them. For instance, the above query becomes\n\n```\\nWITH cte as (...complex_subquery...)\\nSELECT customer_id FROM cte, (VALUES (1), (2)) T(tag)\\nWHERE (fname = 'John' AND tag=1)\\n OR (lname = 'Smith' AND tag=2)\\n```\n\nAlthough the common expressions are not exactly the same (there are different filter conditions in the WHERE clause), we are able to rewrite the query into a fragment that generates a superset of the required rows and columns and handles the differences via compensating actions.\n\nNote that in general, not all queries with repeated expressions can be rewritten by eliminating the duplicated work. However, beyond the query pattern shown here, there are several scenarios in which rewrites are applicable.\n\n\n#### **Building blocks of the rewriting rules**\n\n\nHere are the primitives that will be used in our new query plan optimization rules. Specifically, we define a function Fuse that takes two input plans and returns either ⊥ (when fusion is not possible) or a 4-tuple fused result. If Fuse(P1, P2) = (P, M, L, R), then\n\n- P is the resulting fused plan. The schema of P includes all output columns in P1 and, optionally, additional output columns from P2.\n- M is a mapping from the output columns of P2 to output columns of P.\n- L and R are two filter conditions defined over the output columns of P to restore P1 and P2, respectively.\n\nSemantically, we can reconstruct P1 and P2 as follows:\n\n```\\nP1 = ProjectoutCols(P1)(FilterL(P))\\nP2 = ProjectM(outCols(P2))(FilterR(P))\\n```\n\nwhere outCols(P) denotes the output columns of plan P.\n\nFuse is a recursive function that can handle different operators such as table scan, filter, projection, join, aggregation, and distinct aggregation.\n\n\n#### **Optimization rules**\n\n\nWe have introduced several optimization rules that rewrite the query plan based on the primitives defined above. New rules can be added if we find prevalent enough patterns, and a semantically equivalent representation is available.\n\n\n- **GroupByJoinToWindow**\n This rule transforms a common pattern in which an expression is aggregated and joined back to itself to obtain additional information on the aggregated rows. Intuitively, it is a calculation that extends an input relation with aggregates computed on a subset of columns. Window functions operate in this manner and can be used to rewrite the original pattern.\n- **JoinOnKeys**\n This rule addresses a common pattern in which similar subqueries, which return different views of the same data, are self-joined together. Because of the existence of keys, each row from the left matches with at most one row from the right. Therefore, we are extending each row that matches with columns from both sides.\n- **UnionAllOnJoin**\n This rule handles scenarios in which customers combine results of two computations that are very similar overall but differ on a single table (e.g., they union together some analytical insight applied over different fact tables).\n- **UnionAll**\n This rule corresponds to the example query in the preceding section. It is a common pattern that customers use to compute a common expression and then union non-necessarily disjoint subsets of the result with different projections.\n\nThe work presented in the paper is already used in production. It is worth noting that although Athena benefits from it, the same techniques are applicable to other database systems, since they do not require implementing new operators or execution models.\n\nWe are glad to see that, as a result, our customers are running their queries faster and, because of less data scanned, lowering their bills.\n\nABOUT THE AUTHOR\n\n#### **[Wei Zheng](https://www.amazon.science/author/wei-zheng)**\n\nWei Zheng is a senior software engineer with Amazon Web Services.","render":"<p>SQL database queries often include repetitions of the same operation. For instance, finding an entry in a table that corresponds to a particular person might involve pulling up all the entries with the person’s first name and all the entries with the person’s last name and computing their intersection. If the first- and last-name searches require querying the same database table twice, it’s a redundancy that can increase retrieval time.</p>\n<p>In a paper we presented last week at the IEEE International Conference on Data Engineering (ICDE), we describe a <a href=\\"https://www.amazon.science/publications/computation-reuse-via-fusion-in-amazon-athena\\" target=\\"_blank\\">method for rewriting complex SQL queries</a> so as to eliminate such redundancies. Sometimes, that involves retrieving a superset of entries and then winnowing them down according to additional criteria. But in general, a little extra computation after retrieval is more efficient than multiple queries of the same table.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/948b1124f50d49bab399d550ba5d19ab_%E4%B8%8B%E8%BD%BD.jpg\\" alt=\\"下载.jpg\\" /></p>\n<p>A query plan is the sequence of steps required to execute a SQL query. At left is a diagram depicting the standard query plan for a complex query in the TPC-DS dataset. At right is the much simpler query plan produced by the Amazon researchers’ new rewriting rules.</p>\n<p>In experiments on the TPC-DS benchmark database, with 3TB of data, our techniques improved the overall execution time on 99 queries by 14%, compared to the baseline. When restricted to those queries that are directly transformed by our rewrite rules, we observed 60% improvement in performance, with some queries executing more than six times as quickly.</p>\n<h4><a id=\\"Query_rewrite_11\\"></a><strong>Query rewrite</strong></h4>\\n<p>A query plan is a sequence of steps such as data scans and aggregations that are used to execute a query. Query plan optimization is the process of choosing the most efficient query plan from a large number of possible alternatives.</p>\n<p>The focus of our work is to identify subqueries computing on overlapping data and fuse them into a single computation with compensating actions (post-retrieval computations) to reconstruct the original results. It does not require the subqueries to be syntactically the same or to produce the same output.</p>\n<p>Consider the below query as an example:</p>\n<pre><code class=\\"lang-\\">WITH cte as (...complex_subquery...)\\nSELECT customer_id FROM cte WHERE fname = 'John'\\nUNION ALL\\nSELECT customer_id FROM cte WHERE lname = 'Smith'\\n</code></pre>\\n<p>The query uses the block **cte **twice in the FROM clause. This is suboptimal, especially if the duplicated computation is expensive. Our technique can identify such patterns and rewrite them. For instance, the above query becomes</p>\n<pre><code class=\\"lang-\\">WITH cte as (...complex_subquery...)\\nSELECT customer_id FROM cte, (VALUES (1), (2)) T(tag)\\nWHERE (fname = 'John' AND tag=1)\\n OR (lname = 'Smith' AND tag=2)\\n</code></pre>\\n<p>Although the common expressions are not exactly the same (there are different filter conditions in the WHERE clause), we are able to rewrite the query into a fragment that generates a superset of the required rows and columns and handles the differences via compensating actions.</p>\n<p>Note that in general, not all queries with repeated expressions can be rewritten by eliminating the duplicated work. However, beyond the query pattern shown here, there are several scenarios in which rewrites are applicable.</p>\n<h4><a id=\\"Building_blocks_of_the_rewriting_rules_41\\"></a><strong>Building blocks of the rewriting rules</strong></h4>\\n<p>Here are the primitives that will be used in our new query plan optimization rules. Specifically, we define a function Fuse that takes two input plans and returns either ⊥ (when fusion is not possible) or a 4-tuple fused result. If Fuse(P1, P2) = (P, M, L, R), then</p>\n<ul>\\n<li>P is the resulting fused plan. The schema of P includes all output columns in P1 and, optionally, additional output columns from P2.</li>\n<li>M is a mapping from the output columns of P2 to output columns of P.</li>\n<li>L and R are two filter conditions defined over the output columns of P to restore P1 and P2, respectively.</li>\n</ul>\\n<p>Semantically, we can reconstruct P1 and P2 as follows:</p>\n<pre><code class=\\"lang-\\">P1 = ProjectoutCols(P1)(FilterL(P))\\nP2 = ProjectM(outCols(P2))(FilterR(P))\\n</code></pre>\\n<p>where outCols§ denotes the output columns of plan P.</p>\n<p>Fuse is a recursive function that can handle different operators such as table scan, filter, projection, join, aggregation, and distinct aggregation.</p>\n<h4><a id=\\"Optimization_rules_62\\"></a><strong>Optimization rules</strong></h4>\\n<p>We have introduced several optimization rules that rewrite the query plan based on the primitives defined above. New rules can be added if we find prevalent enough patterns, and a semantically equivalent representation is available.</p>\n<ul>\\n<li><strong>GroupByJoinToWindow</strong><br />\\nThis rule transforms a common pattern in which an expression is aggregated and joined back to itself to obtain additional information on the aggregated rows. Intuitively, it is a calculation that extends an input relation with aggregates computed on a subset of columns. Window functions operate in this manner and can be used to rewrite the original pattern.</li>\n<li><strong>JoinOnKeys</strong><br />\\nThis rule addresses a common pattern in which similar subqueries, which return different views of the same data, are self-joined together. Because of the existence of keys, each row from the left matches with at most one row from the right. Therefore, we are extending each row that matches with columns from both sides.</li>\n<li><strong>UnionAllOnJoin</strong><br />\\nThis rule handles scenarios in which customers combine results of two computations that are very similar overall but differ on a single table (e.g., they union together some analytical insight applied over different fact tables).</li>\n<li><strong>UnionAll</strong><br />\\nThis rule corresponds to the example query in the preceding section. It is a common pattern that customers use to compute a common expression and then union non-necessarily disjoint subsets of the result with different projections.</li>\n</ul>\\n<p>The work presented in the paper is already used in production. It is worth noting that although Athena benefits from it, the same techniques are applicable to other database systems, since they do not require implementing new operators or execution models.</p>\n<p>We are glad to see that, as a result, our customers are running their queries faster and, because of less data scanned, lowering their bills.</p>\n<p>ABOUT THE AUTHOR</p>\n<h4><a id=\\"Wei_Zhenghttpswwwamazonscienceauthorweizheng_83\\"></a><strong><a href=\\"https://www.amazon.science/author/wei-zheng\\" target=\\"_blank\\">Wei Zheng</a></strong></h4>\n<p>Wei Zheng is a senior software engineer with Amazon Web Services.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭