Optimize your Amazon Redshift query performance with automated materialized views

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[Amazon Redshift](https://aws.amazon.com/redshift/) is a fast, fully managed cloud data warehouse database that makes it cost-effective to analyze your data using standard SQL and business intelligence tools. Amazon Redshift allows you to analyze structured and semi-structured data and seamlessly query data lakes and operational databases, using Amazon Web Servicesdesigned hardware and automated machine learning (ML)-based tuning to deliver top-tier price-performance at scale.\n\nAlthough Amazon Redshift provides excellent price performance out of the box, it offers additional optimizations that can improve this performance and allow you to achieve even faster query response times from your data warehouse.\n\nFor example, you can physically tune tables in a data model to minimize the amount of data scanned and distributed within a cluster, which speeds up operations such as table joins and range-bound scans. Amazon Redshift now automates this tuning with the [automatic table optimization (ATO)](https://aws.amazon.com/blogs/big-data/automate-your-amazon-redshift-performance-tuning-with-automatic-table-optimization/) feature.\n\nAnother optimization for reducing query runtime is to precompute query results in the form of a materialized view. Materialized views store precomputed query results that future similar queries can use. This improves query performance because many computation steps can be skipped and the precomputed results returned directly. Unlike a simple cache, many materialized views can be incrementally refreshed when DML changes are applied on the underlying (base) tables and can be used by other similar queries, not just the query used to create the materialized view.\n\nAmazon Redshift introduced materialized views in March 2020. In June 2020, support for external tables was added. With these releases, you could use materialized views on both local and external tables to deliver low-latency performance by using precomputed views in your queries. However, this approach required you to be aware of what materialized views were available on the cluster, and if they were up to date.\n\nIn November 2020, materialized view automatic refresh and query rewrite features were added. With materialized view-aware automatic rewriting, data analysts get the benefit of materialized views for their queries and dashboards without having to query the materialized view directly. The analyst may not even be aware the materialized views exist. The auto rewrite feature enables this by rewriting queries to use materialized views without the query needing to explicitly reference them. In addition, auto refresh keeps materialized views up to date when base table data is changed, and there are available cluster resources for the materialized view maintenance.\n\nHowever, materialized views still have to be manually created, monitored, and maintained by data engineers or DBAs. To reduce this overhead, Amazon Redshift has introduced the [Automated Materialized View](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html) (AutoMV) feature, which goes one step further and automatically creates materialized views for queries with common recurring joins and aggregations.\n\nThis post explains what materialized views are, how manual materialized views work and the benefits they provide, and what’s required to build and maintain manual materialized views to achieve performance improvements and optimization. Then we explain how this is greatly simplified with the new automated materialized view feature.\n\n\n#### **Manually create materialized views**\n\n\nA materialized view is a database object that stores precomputed query results in a materialized (persisted) dataset. Similar queries can use the precomputed results from the materialized view and skip the expensive tasks of reading the underlying tables and performing joins and aggregates, thereby improving the query performance.\n\nFor example, you can improve the performance of a dashboard by materializing the results of its queries into a materialized view or multiple materialized views. When the dashboard is opened or refreshed, it can use the precomputed results from the materialized view instead of rereading the base tables and reprocessing the queries. By creating a materialized view once and querying it multiple times, redundant processing can be avoided, improving query performance and freeing up resources for other processing on the database.\n\nTo demonstrate this, we use the following query, which returns daily order and sales numbers. It joins two tables and aggregates at the day level.\n\n```\nSET enable_result_cache_for_session TO OFF;\n\nSELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate >= '1997-01-01'\nAND o.o_orderdate < '1998-01-01'\nGROUP BY o.o_orderdate\nORDER BY 1;\n```\n\nAt the top of the query, we set ```enable_result_cache_for_session``` to ```OFF```. This setting disables the results cache, so we can see the full processing runtime each time we run the query. Unlike a materialized view, the results cache is a simple cache that stores the results of a single query in memory, it can’t be used by other similar queries, is not updated when the base tables are modified, and because it isn’t persisted, can be aged-out of memory by more frequently used queries.\n\nWhen we run this query on a 10-node ra3.4xl cluster with the [TPC-H 3 TB](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH) dataset, it returns in approximately 20 seconds. If we need to run this query or similar queries more than once, we can create a materialized view with the [CREATE MATERIALIZED VIEW](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-create-sql-command.html) command and query the materialized view object directly, which has the same structure as a table:\n\n```\nCREATE MATERIALIZED VIEW mv_daily_sales\nAS\nSELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate >= '1997-01-01'\nAND o.o_orderdate < '1998-01-01'\nGROUP BY o.o_orderdate;\n\nSELECT order_date\n ,ext_price_total\nFROM mv_daily_sales\nORDER BY 1;\n```\n\nBecause the join and aggregations have been precomputed, it runs in approximately 900 milliseconds, a performance improvement of 96%.\n\nAs we have just shown, you can query the materialized view directly; however, Amazon Redshift can automatically rewrite a query to use one or more materialized views. The [query rewrite](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-rewrite.html) feature transparently rewrites the query as it’s being run to retrieve precomputed results from a materialized view. This process is automatically triggered on [eligible](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-rewrite.html#mv_auto-rewrite_limitations) and up-to-date materialized views, if the query contains the same base tables and joins, and has similar aggregations as the materialized view.\n\nFor example, if we rerun the sales query, because it’s eligible for rewriting, it’s automatically rewritten to use the ```mv_daily_sales``` materialized view. We start with the original query:\n\n```\nSELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate >= '1997-01-01'\nAND o.o_orderdate < '1998-01-01'\nGROUP BY o.o_orderdate\nORDER BY 1;\n```\n\nInternally, the query is rewritten to the following SQL and run. This process is completely transparent to the user.\n\n```\nSELECT order_date\n ,ext_price_total\nFROM mv_daily_sales\nORDER BY 1;\n```\n\nThe rewriting can be confirmed by looking at the query’s explain plan:\n\n```\nEXPLAIN SELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate >= '1997-01-01'\nAND o.o_orderdate < '1998-01-01'\nGROUP BY o.o_orderdate;\n\n+------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+------------------------------------------------------------------------------------------------+\n|XN HashAggregate (cost=5.47..5.97 rows=200 width=31) |\n| -> XN Seq Scan on mv_tbl__mv_daily_sales__0 derived_table1 (cost=0.00..3.65 rows=365 width=31)|\n+------------------------------------------------------------------------------------------------+\n```\n\nThe plan shows the query has been rewritten and has retrieved the results from the ```mv_daily_sales``` materialized view, not the query’s base tables: ```orders``` and ```lineitem```.\n\nOther queries that use the same base tables and level of aggregation, or a level of aggregation derived from the materialized view’s level, are also rewritten. For example:\n\n```\nEXPLAIN SELECT date_trunc('month', o.o_orderdate) AS order_month\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate >= '1997-01-01'\nAND o.o_orderdate < '1998-01-01'\nGROUP BY order_month;\n\n+------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+------------------------------------------------------------------------------------------------+\n|XN HashAggregate (cost=7.30..10.04 rows=365 width=19) |\n| -> XN Seq Scan on mv_tbl__mv_daily_sales__0 derived_table1 (cost=0.00..5.47 rows=365 width=19)|\n+------------------------------------------------------------------------------------------------+\n```\n\nIf data in the ```orders``` or ```lineitem``` table changes, ```mv_daily_sales``` becomes stale; this means the materialized view isn’t reflecting the state of its base tables. If we update a row in ```lineitem``` and check the [stv_mv_info](https://docs.aws.amazon.com/redshift/latest/dg/r_STV_MV_INFO.html) system table, we can see the ```is_stale``` flag is set to ```t``` (true):\n\n```\nUPDATE lineitem\nSET l_extendedprice = 5000\nWHERE l_orderkey = 2362252519\nAND l_linenumber = 1;\n\nSELECT name\n ,is_stale\nFROM stv_mv_info\nWHERE name = 'mv_daily_sales';\n\n+--------------+--------+\n|name |is_stale|\n+--------------+--------+\n|mv_daily_sales|t |\n+--------------+--------+\n```\n\nWe can now manually refresh the materialized view using the [REFRESH MATERIALIZED VIEW](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-refresh-sql-command.html\n) statement:\n\n```\nREFRESH MATERIALIZED VIEW mv_daily_sales;\n\nSELECT name\n ,is_stale\nFROM stv_mv_info\nWHERE name = 'mv_daily_sales';\n\n+--------------+--------+\n|name |is_stale|\n+--------------+--------+\n|mv_daily_sales|f |\n+--------------+--------+\n```\n\nThere are two types of materialized view refresh: full and incremental. A full refresh reruns the underlying SQL statement and rebuilds the whole materialized view. An incremental refresh only updates specific rows affected by the source data change. To see if a materialized view is eligible for incremental refreshes, view the ```state``` column in the ```stv_mv_info``` system table. A state of ```0``` indicates the materialized view will be fully refreshed, and a state of ```1``` indicates the materialized view will be incrementally refreshed.\n\n```\nSELECT name\n ,state\nFROM stv_mv_info\nWHERE name = 'mv_daily_sales';\n\n+--------------+--------+\n|name |state |\n+--------------+--------+\n|mv_daily_sales| 1|\n+--------------+--------+\n```\n\nYou can schedule manual refreshes on the Amazon Redshift console if you need to refresh a materialized view at fixed periods, such as once per hour. For more information, refer to [Scheduling a query on the Amazon Redshift console](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-schedule-query.html).\n\nAs well as the ability to do a manual refresh, Amazon Redshift can also automatically refresh materialized views. The auto refresh feature intelligently determines when to refresh the materialized view, and if you have multiple materialized views, which order to refresh them in. Amazon Redshift considers the benefit of refreshing a materialized view (how often the materialized view is used, what performance gain the materialized view provides) and the cost (resources required for the refresh, current system load, available system resources).\n\nThis intelligent refreshing has a number of benefits. Because not all materialized views are equally important, deciding when and in which order to refresh materialized views on a large system is a complex task for a DBA to solve. Also, the DBA needs to consider other workloads running on the system, and try to ensure the latency of critical workloads is not increased by the effect of refreshing materialized views. The auto refresh feature helps remove the need for a DBA to do these difficult and time-consuming tasks.\n\nYou can set a materialized view to be automatically refreshed in the ```CREATE MATERIALIZED VIEW``` statement with the ```AUTO REFRESH YES``` parameter:\n\n```\nCREATE MATERIALIZED VIEW mv_daily_sales\nAUTO REFRESH YES\nAS\nSELECT ...\n```\n\nNow when the source data of the materialized view changes, the materialized view is automatically refreshed. We can view the status of the refresh in the [svl_mv_refresh_status](https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_MV_REFRESH_STATUS.html) system table. For example:\n\n```\nUPDATE lineitem\nSET l_extendedprice = 6000\nWHERE l_orderkey = 2362252519\nAND l_linenumber = 1;\n\nSELECT mv_name\n ,starttime\n ,endtime\n ,status\n ,refresh_type\nFROM svl_mv_refresh_status\nWHERE mv_name = 'mv_daily_sales';\n\n+--------------+--------------------------+--------------------------+---------------------------------------------+------------+\n|mv_name |starttime |endtime |status |refresh_type|\n+--------------+--------------------------+--------------------------+---------------------------------------------+------------+\n|mv_daily_sales|2022-05-06 14:07:24.857074|2022-05-06 14:07:33.342346|Refresh successfully updated MV incrementally|Auto |\n+--------------+--------------------------+--------------------------+---------------------------------------------+------------+\n\n```\n\nTo remove a materialized view, we use the [DROP MATERIALIZED VIEW](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-drop-sql-command.html) command:\n\n```\nDROP MATERIALIZED VIEW mv_daily_sales;\n```\n\nNow that you’ve seen what materialized views are, their benefits, and how they are created, used, and removed, let’s discuss the drawbacks. Designing and implementing a set of materialized views to help improve overall query performance on a database requires a skilled resource to perform several involved and time-consuming tasks:\n\n- Analyzing queries run on the system\n- Identifying which queries are run regularly and provide business benefit\n- Prioritizing the identified queries\n- Determining if the performance improvement is worth creating a materialized view and storing the dataset\n- Physically creating and refreshing the materialized views\n- Monitoring the usage of the materialized views\n- Dropping materialized views that are rarely or never used or can’t be refreshed due to the structure of base tables changing\n\nSignificant skill, effort, and time is required to design and create materialized views that provide an overall benefit. Also, ongoing monitoring is needed to identify poorly designed or underutilized materialized views that are occupying resources without providing gains.\n\nAmazon Redshift now has a feature to automate this process, [Automated Materialized Views](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html) (AutoMVs). We explain how AutoMVs work and how to use them on your cluster in the following sections.\n\n\n#### **Automatically create materialized views**\n\n\nWhen the AutoMV feature is enabled on an Amazon Redshift cluster (it’s enabled by default), Amazon Redshift monitors recently run queries and identifies any that could have their performance improved by a materialized view. Expensive parts of the query, such as aggregates and joins that can be persisted into materialized views and reused by future queries, are then extracted from the main query and any subqueries. The extracted query parts are then rewritten into create materialized view statements (candidate materialized views) and stored for further processing.\n\nThe candidate materialized views are not just one-to-one copies of queries; extra processing is applied to create generalized materialized views that can be used by queries similar to the original query. In the following example, the result set is limited by the filters ```o_orderpriority = '1-URGENT'``` and ```l_shipmode ='AIR'```. Therefore, a materialized view built from this result set could only serve queries selecting that limited range of data.\n\n```\nSELECT o.o_orderdate\n ,SUM(l.l_extendedprice)\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderpriority = '1-URGENT'\nAND l.l_shipmode ='AIR'\nGROUP BY o.o_orderdate;\n```\n\nAmazon Redshift uses many techniques to create generalized materialized views; one of these techniques is called predicate elevation. To apply predicate elevation to this query, the filtered columns ```o_orderpriority``` and ```l_shipmode``` are moved into the GROUP BY clause, thereby storing the full range of data in the materialized view, which allows similar queries to use the same materialized view. This approach is driven by dashboard-like workloads that often issue identical queries with different filter predicates.\n\n```\nSELECT o.o_orderdate\n ,o.o_orderpriority\n ,l.l_shipmode\n ,SUM(l.l_extendedprice)\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nGROUP BY o.o_orderdate\n ,o.o_orderpriority\n ,l.l_shipmode;\n```\n\nIn the next processing step, ML algorithms are applied to calculate which of the candidate materialized views provides the best performance benefit and system-wide performance optimization. The algorithms follow similar logic to the auto refresh feature mentioned previously. For each candidate materialized view, Amazon Redshift calculates a benefit, which corresponds to the expected performance improvement should the materialized view be materialized and used in the workload. In addition, it calculates a cost corresponding to the system resources required to create and maintain the candidate. Existing manual materialized views are also considered; an AutoMV will not be created if a manual materialized view already exists that covers the same scope, and manual materialized views have auto refresh priority over AutoMVs.\n\nThe list of materialized views is then sorted in order of overall cost-benefit, taking into consideration [workload management](https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html) (WLM) [query priorities](https://docs.aws.amazon.com/redshift/latest/dg/query-priority.html), with materialized views related to queries on a higher priority queue ordered before materialized views related to queries on a lower priority queue. After the list of materialized views has been fully sorted, they’re automatically created and populated in the background in the prioritized order.\n\nThe created AutoMVs are then monitored by a background process that checks their activity, such as how often they have been queried and refreshed. If the process determines that an AutoMV is not being used or refreshed, for example due to the base table’s structure changing, it is dropped.\n\n##### **Example**\n\nTo demonstrate this process in action, we use the following query taken from the 3 TB [Cloud DW Benchmark](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH), a performance testing benchmark derived from TPC-H. You can load the benchmark data into your cluster and follow along with the example.\n\n```\nSET enable_result_cache_for_session TO OFF;\n\nSELECT /* TPC-H Q12 */\n l_shipmode\n , SUM(CASE\n WHEN o_orderpriority = '1-URGENT'\n OR o_orderpriority = '2-HIGH'\n THEN 1\n ELSE 0\n END) AS high_line_count\n , SUM(CASE\n WHEN o_orderpriority '1-URGENT'\n AND o_orderpriority '2-HIGH'\n THEN 1\n ELSE 0\n END) AS low_line_count\nFROM orders\n , lineitem\nWHERE o_orderkey = l_orderkey\nAND l_shipmode IN ('MAIL', 'SHIP')\nAND l_commitdate < l_receiptdate\nAND l_shipdate = DATE '1994-01-01'\nAND l_receiptdate < DATEADD(YEAR, 1, CAST('1994-01-01' AS DATE))\nGROUP BY l_shipmode\nORDER BY l_shipmode;\n```\n\nWe run the query three times and then wait for 30 minutes. On a 10-node ra3.4xl cluster, the query runs in approximately 8 seconds.\n\nDuring the 30 minutes, Amazon Redshift assesses the benefit of materializing candidate AutoMVs. It computes a sorted list of candidate materialized views and creates the most beneficial ones with incremental refresh, auto refresh, and query rewrite enabled. When the query or similar queries run, they’re automatically and transparently rewritten to use one or more of the created AutoMVs.\n\nOngoing, if data in the base tables is modified (i.e. the AutoMV becomes stale), an incremental refresh automatically runs, inserting, updating, and deleting rows in the AutoMV to bring its data to the latest state.\n\nRerunning the query shows that it runs in approximately 800 milliseconds, a performance improvement of 90%. We can confirm the query is using the AutoMV by checking the explain plan:\n\n```\nEXPLAIN SELECT /* TPC-H Q12 */\n l_shipmode\n ,\n SUM(CASE\n WHEN o_orderpriority = '1-URGENT'\n OR o_orderpriority = '2-HIGH'\n THEN 1\n ELSE 0\n END) AS high_line_count\n , SUM(CASE\n WHEN o_orderpriority <> '1-URGENT'\n AND o_orderpriority <> '2-HIGH'\n THEN 1\n ELSE 0\n END) AS low_line_count\nFROM orders\n , lineitem\nWHERE o_orderkey = l_orderkey\nAND l_shipmode IN ('MAIL', 'SHIP')\nAND l_commitdate < l_receiptdate\nAND l_shipdate < l_commitdate\nAND l_receiptdate >= DATE '1994-01-01'\nAND l_receiptdate < DATEADD(YEAR, 1, CAST('1994-01-01' AS DATE))\nGROUP BY l_shipmode\nORDER BY l_shipmode;\n\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|XN Merge (cost=1000000000354.23..1000000000354.23 rows=1 width=30) |\n| Merge Key: derived_table1.grvar_1 |\n| -> XN Network (cost=1000000000354.23..1000000000354.23 rows=1 width=30) |\n| Send to leader |\n| -> XN Sort (cost=1000000000354.23..1000000000354.23 rows=1 width=30) |\n| Sort Key: derived_table1.grvar_1 |\n| -> XN HashAggregate (cost=354.21..354.22 rows=1 width=30) |\n| -> XN Seq Scan on mv_tbl__auto_mv_2000__0 derived_table1 (cost=0.00..349.12 rows=679 width=30) |\n| Filter: ((grvar_2 < '1995-01-01'::date) AND (grvar_2 >= '1994-01-01'::date) AND ((grvar_1 = 'SHIP'::bpchar) OR (grvar_1 = 'MAIL'::bpchar)))|\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n```\n\nTo demonstrate how AutoMVs can also improve the performance of similar queries, we change some of the filters on the original query. In the following example, we change the filter on ```l_shipmode``` from ```IN ('MAIL', 'SHIP')``` to ```IN ('TRUCK', 'RAIL', 'AIR')```, and change the filter on ```l_receiptdate``` to the first 6 months of the previous year. The query runs in approximately 900 milliseconds and, looking at the explain plan, we confirm it’s using the AutoMV:\n\n```\nEXPLAIN SELECT /* TPC-H Q12 modified */\n l_shipmode\n , SUM(CASE\n WHEN o_orderpriority = '1-URGENT'\n OR o_orderpriority = '2-HIGH'\n THEN 1\n ELSE 0\n END) AS high_line_count\n , SUM(CASE\n WHEN o_orderpriority <> '1-URGENT'\n AND o_orderpriority <> '2-HIGH'\n THEN 1\n ELSE 0\n END) AS low_line_count\nFROM orders\n , lineitem\nWHERE o_orderkey = l_orderkey\nAND l_shipmode IN ('TRUCK', 'RAIL', 'AIR')\nAND l_commitdate < l_receiptdate\nAND l_shipdate < l_commitdate\nAND l_receiptdate >= DATE '1993-01-01'\nAND l_receiptdate < DATE '1993-07-01'\nGROUP BY l_shipmode\nORDER BY l_shipmode;\n\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|XN Merge (cost=1000000000396.30..1000000000396.31 rows=1 width=30) |\n| Merge Key: derived_table1.grvar_1 |\n| -> XN Network (cost=1000000000396.30..1000000000396.31 rows=1 width=30) |\n| Send to leader |\n| -> XN Sort (cost=1000000000396.30..1000000000396.31 rows=1 width=30) |\n| Sort Key: derived_table1.grvar_1 |\n| -> XN HashAggregate (cost=396.29..396.29 rows=1 width=30) |\n| -> XN Seq Scan on mv_tbl__auto_mv_2000__0 derived_table1 (cost=0.00..392.76 rows=470 width=30) |\n| Filter: ((grvar_2 < '1993-07-01'::date) AND (grvar_2 >= '1993-01-01'::date) AND ((grvar_1 = 'AIR'::bpchar) OR (grvar_1 = 'RAIL'::bpchar) OR (grvar_1 = 'TRUCK'::bpchar)))|\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n```\n\nThe AutoMV feature is transparent to users and is fully system managed. Therefore, unlike manual materialized views, AutoMVs are not visible to users and can’t be queried directly. They also don’t appear in any system tables like ```stv_mv_info``` or ```svl_mv_refresh_status```.\n\nFinally, if the AutoMV hasn’t been used for some time by the workload, it’s automatically dropped and the storage released. When we rerun the query after this, the runtime returns to the original 8 seconds because the query is now using the base tables. This can be confirmed by examining the explain plan.\n\nThis example illustrates that the AutoMV feature reduces the effort and time required to create and maintain materialized views.\n\n\n#### **Performance tests and results**\n\n\nTo see how well AutoMVs work in practice, we ran tests using the 1 TB and 3 TB versions of the [Cloud DW benchmark derived from TPC-H](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH). This test consists of a power run script with 22 queries that is run three times with the results cache off. The tests were run with two different clusters: 4-node ra3.4xlarge and 2-node ra3.16xlarge with a concurrency of 1 and 5.\n\nThe Cloud DW benchmark is derived from the TPC-H benchmark. It isn’t comparable to published TPC-H results, because the results of our tests don’t fully comply with the specification.\n\nThe following table shows our results.\n\n![image.png](https://dev-media.amazoncloud.cn/25e5c52940d24dda9506a62ed3c80bac_image.png)\n\nThe AutoMV feature improved query performance by up to 19% without any manual intervention.\n\n\n#### **Summary**\n\n\nIn this post, we first presented manual materialized views, their various features, and how to take advantage of them. We then looked into the effort and time required to design, create, and maintain materialized views to provide performance improvements in a data warehouse.\n\nNext, we discussed how AutoMVs help overcome these challenges and seamlessly provide performance improvements for SQL queries and dashboards. We went deeper into the details of how AutoMVs work and discussed how ML algorithms determine which materialized views to create based on the predicted performance improvement and overall benefit they will provide compared to the cost required to create and maintain them. Then we covered some of the internal processing logic such as how predicate elevation creates generalized materialized views that can be used by a range of queries, not just the original query that triggered the materialized view creation.\n\nFinally, we showed the results of a performance test on an industry benchmark where the AutoMV feature improved performance by up to 19%.\n\nAs we have demonstrated, automated materialized views provide performance improvements to a data warehouse without requiring any manual effort or specialized expertise. They transparently work in the background, optimizing your workload performance and automatically adapting when your workloads change.\n\nAutomated materialized views are enabled by default. We encourage you to monitor any performance improvements they have on your current clusters. If you’re new to Amazon Redshift, try the [Getting Started tutorial](http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html) and use the [free trial](https://aws.amazon.com/redshift/free-trial/) to create and provision your first cluster and experiment with the feature.\n\n\n##### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/231acb789ead4708be892fcd47499c13_image.png)\n\n**Adam Gatt** is a Senior Specialist Solution Architect for Analytics at Amazon Web Services. He has over 20 years of experience in data and data warehousing and helps customers build robust, scalable and high-performance analytics solutions in the cloud.\n\n![image.png](https://dev-media.amazoncloud.cn/bcfe0e2b9e6043d39ca08781df226564_image.png)\n\n**Rahul Chaturvedi** is an Analytics Specialist Solutions Architect at Amazon Web Services. Prior to this role, he was a Data Engineer at Amazon Advertising and Prime Video, where he helped build petabyte-scale data lakes for self-serve analytics.","render":"<p><a href=\"https://aws.amazon.com/redshift/\" target=\"_blank\">Amazon Redshift</a> is a fast, fully managed cloud data warehouse database that makes it cost-effective to analyze your data using standard SQL and business intelligence tools. Amazon Redshift allows you to analyze structured and semi-structured data and seamlessly query data lakes and operational databases, using Amazon Web Servicesdesigned hardware and automated machine learning (ML)-based tuning to deliver top-tier price-performance at scale.</p>\n<p>Although Amazon Redshift provides excellent price performance out of the box, it offers additional optimizations that can improve this performance and allow you to achieve even faster query response times from your data warehouse.</p>\n<p>For example, you can physically tune tables in a data model to minimize the amount of data scanned and distributed within a cluster, which speeds up operations such as table joins and range-bound scans. Amazon Redshift now automates this tuning with the <a href=\"https://aws.amazon.com/blogs/big-data/automate-your-amazon-redshift-performance-tuning-with-automatic-table-optimization/\" target=\"_blank\">automatic table optimization (ATO)</a> feature.</p>\n<p>Another optimization for reducing query runtime is to precompute query results in the form of a materialized view. Materialized views store precomputed query results that future similar queries can use. This improves query performance because many computation steps can be skipped and the precomputed results returned directly. Unlike a simple cache, many materialized views can be incrementally refreshed when DML changes are applied on the underlying (base) tables and can be used by other similar queries, not just the query used to create the materialized view.</p>\n<p>Amazon Redshift introduced materialized views in March 2020. In June 2020, support for external tables was added. With these releases, you could use materialized views on both local and external tables to deliver low-latency performance by using precomputed views in your queries. However, this approach required you to be aware of what materialized views were available on the cluster, and if they were up to date.</p>\n<p>In November 2020, materialized view automatic refresh and query rewrite features were added. With materialized view-aware automatic rewriting, data analysts get the benefit of materialized views for their queries and dashboards without having to query the materialized view directly. The analyst may not even be aware the materialized views exist. The auto rewrite feature enables this by rewriting queries to use materialized views without the query needing to explicitly reference them. In addition, auto refresh keeps materialized views up to date when base table data is changed, and there are available cluster resources for the materialized view maintenance.</p>\n<p>However, materialized views still have to be manually created, monitored, and maintained by data engineers or DBAs. To reduce this overhead, Amazon Redshift has introduced the <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html\" target=\"_blank\">Automated Materialized View</a> (AutoMV) feature, which goes one step further and automatically creates materialized views for queries with common recurring joins and aggregations.</p>\n<p>This post explains what materialized views are, how manual materialized views work and the benefits they provide, and what’s required to build and maintain manual materialized views to achieve performance improvements and optimization. Then we explain how this is greatly simplified with the new automated materialized view feature.</p>\n<h4><a id=\"Manually_create_materialized_views_17\"></a><strong>Manually create materialized views</strong></h4>\n<p>A materialized view is a database object that stores precomputed query results in a materialized (persisted) dataset. Similar queries can use the precomputed results from the materialized view and skip the expensive tasks of reading the underlying tables and performing joins and aggregates, thereby improving the query performance.</p>\n<p>For example, you can improve the performance of a dashboard by materializing the results of its queries into a materialized view or multiple materialized views. When the dashboard is opened or refreshed, it can use the precomputed results from the materialized view instead of rereading the base tables and reprocessing the queries. By creating a materialized view once and querying it multiple times, redundant processing can be avoided, improving query performance and freeing up resources for other processing on the database.</p>\n<p>To demonstrate this, we use the following query, which returns daily order and sales numbers. It joins two tables and aggregates at the day level.</p>\n<pre><code class=\"lang-\">SET enable_result_cache_for_session TO OFF;\n\nSELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate &gt;= '1997-01-01'\nAND o.o_orderdate &lt; '1998-01-01'\nGROUP BY o.o_orderdate\nORDER BY 1;\n</code></pre>\n<p>At the top of the query, we set <code>enable_result_cache_for_session</code> to <code>OFF</code>. This setting disables the results cache, so we can see the full processing runtime each time we run the query. Unlike a materialized view, the results cache is a simple cache that stores the results of a single query in memory, it can’t be used by other similar queries, is not updated when the base tables are modified, and because it isn’t persisted, can be aged-out of memory by more frequently used queries.</p>\n<p>When we run this query on a 10-node ra3.4xl cluster with the <a href=\"https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH\" target=\"_blank\">TPC-H 3 TB</a> dataset, it returns in approximately 20 seconds. If we need to run this query or similar queries more than once, we can create a materialized view with the <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-create-sql-command.html\" target=\"_blank\">CREATE MATERIALIZED VIEW</a> command and query the materialized view object directly, which has the same structure as a table:</p>\n<pre><code class=\"lang-\">CREATE MATERIALIZED VIEW mv_daily_sales\nAS\nSELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate &gt;= '1997-01-01'\nAND o.o_orderdate &lt; '1998-01-01'\nGROUP BY o.o_orderdate;\n\nSELECT order_date\n ,ext_price_total\nFROM mv_daily_sales\nORDER BY 1;\n</code></pre>\n<p>Because the join and aggregations have been precomputed, it runs in approximately 900 milliseconds, a performance improvement of 96%.</p>\n<p>As we have just shown, you can query the materialized view directly; however, Amazon Redshift can automatically rewrite a query to use one or more materialized views. The <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-rewrite.html\" target=\"_blank\">query rewrite</a> feature transparently rewrites the query as it’s being run to retrieve precomputed results from a materialized view. This process is automatically triggered on <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-rewrite.html#mv_auto-rewrite_limitations\" target=\"_blank\">eligible</a> and up-to-date materialized views, if the query contains the same base tables and joins, and has similar aggregations as the materialized view.</p>\n<p>For example, if we rerun the sales query, because it’s eligible for rewriting, it’s automatically rewritten to use the <code>mv_daily_sales</code> materialized view. We start with the original query:</p>\n<pre><code class=\"lang-\">SELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate &gt;= '1997-01-01'\nAND o.o_orderdate &lt; '1998-01-01'\nGROUP BY o.o_orderdate\nORDER BY 1;\n</code></pre>\n<p>Internally, the query is rewritten to the following SQL and run. This process is completely transparent to the user.</p>\n<pre><code class=\"lang-\">SELECT order_date\n ,ext_price_total\nFROM mv_daily_sales\nORDER BY 1;\n</code></pre>\n<p>The rewriting can be confirmed by looking at the query’s explain plan:</p>\n<pre><code class=\"lang-\">EXPLAIN SELECT o.o_orderdate AS order_date\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate &gt;= '1997-01-01'\nAND o.o_orderdate &lt; '1998-01-01'\nGROUP BY o.o_orderdate;\n\n+------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+------------------------------------------------------------------------------------------------+\n|XN HashAggregate (cost=5.47..5.97 rows=200 width=31) |\n| -&gt; XN Seq Scan on mv_tbl__mv_daily_sales__0 derived_table1 (cost=0.00..3.65 rows=365 width=31)|\n+------------------------------------------------------------------------------------------------+\n</code></pre>\n<p>The plan shows the query has been rewritten and has retrieved the results from the <code>mv_daily_sales</code> materialized view, not the query’s base tables: <code>orders</code> and <code>lineitem</code>.</p>\n<p>Other queries that use the same base tables and level of aggregation, or a level of aggregation derived from the materialized view’s level, are also rewritten. For example:</p>\n<pre><code class=\"lang-\">EXPLAIN SELECT date_trunc('month', o.o_orderdate) AS order_month\n ,SUM(l.l_extendedprice) AS ext_price_total\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderdate &gt;= '1997-01-01'\nAND o.o_orderdate &lt; '1998-01-01'\nGROUP BY order_month;\n\n+------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+------------------------------------------------------------------------------------------------+\n|XN HashAggregate (cost=7.30..10.04 rows=365 width=19) |\n| -&gt; XN Seq Scan on mv_tbl__mv_daily_sales__0 derived_table1 (cost=0.00..5.47 rows=365 width=19)|\n+------------------------------------------------------------------------------------------------+\n</code></pre>\n<p>If data in the <code>orders</code> or <code>lineitem</code> table changes, <code>mv_daily_sales</code> becomes stale; this means the materialized view isn’t reflecting the state of its base tables. If we update a row in <code>lineitem</code> and check the <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/r_STV_MV_INFO.html\" target=\"_blank\">stv_mv_info</a> system table, we can see the <code>is_stale</code> flag is set to <code>t</code> (true):</p>\n<pre><code class=\"lang-\">UPDATE lineitem\nSET l_extendedprice = 5000\nWHERE l_orderkey = 2362252519\nAND l_linenumber = 1;\n\nSELECT name\n ,is_stale\nFROM stv_mv_info\nWHERE name = 'mv_daily_sales';\n\n+--------------+--------+\n|name |is_stale|\n+--------------+--------+\n|mv_daily_sales|t |\n+--------------+--------+\n</code></pre>\n<p>We can now manually refresh the materialized view using the <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-refresh-sql-command.html\" target=\"_blank\">REFRESH MATERIALIZED VIEW</a> statement:</p>\n<pre><code class=\"lang-\">REFRESH MATERIALIZED VIEW mv_daily_sales;\n\nSELECT name\n ,is_stale\nFROM stv_mv_info\nWHERE name = 'mv_daily_sales';\n\n+--------------+--------+\n|name |is_stale|\n+--------------+--------+\n|mv_daily_sales|f |\n+--------------+--------+\n</code></pre>\n<p>There are two types of materialized view refresh: full and incremental. A full refresh reruns the underlying SQL statement and rebuilds the whole materialized view. An incremental refresh only updates specific rows affected by the source data change. To see if a materialized view is eligible for incremental refreshes, view the <code>state</code> column in the <code>stv_mv_info</code> system table. A state of <code>0</code> indicates the materialized view will be fully refreshed, and a state of <code>1</code> indicates the materialized view will be incrementally refreshed.</p>\n<pre><code class=\"lang-\">SELECT name\n ,state\nFROM stv_mv_info\nWHERE name = 'mv_daily_sales';\n\n+--------------+--------+\n|name |state |\n+--------------+--------+\n|mv_daily_sales| 1|\n+--------------+--------+\n</code></pre>\n<p>You can schedule manual refreshes on the Amazon Redshift console if you need to refresh a materialized view at fixed periods, such as once per hour. For more information, refer to <a href=\"https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-schedule-query.html\" target=\"_blank\">Scheduling a query on the Amazon Redshift console</a>.</p>\n<p>As well as the ability to do a manual refresh, Amazon Redshift can also automatically refresh materialized views. The auto refresh feature intelligently determines when to refresh the materialized view, and if you have multiple materialized views, which order to refresh them in. Amazon Redshift considers the benefit of refreshing a materialized view (how often the materialized view is used, what performance gain the materialized view provides) and the cost (resources required for the refresh, current system load, available system resources).</p>\n<p>This intelligent refreshing has a number of benefits. Because not all materialized views are equally important, deciding when and in which order to refresh materialized views on a large system is a complex task for a DBA to solve. Also, the DBA needs to consider other workloads running on the system, and try to ensure the latency of critical workloads is not increased by the effect of refreshing materialized views. The auto refresh feature helps remove the need for a DBA to do these difficult and time-consuming tasks.</p>\n<p>You can set a materialized view to be automatically refreshed in the <code>CREATE MATERIALIZED VIEW</code> statement with the <code>AUTO REFRESH YES</code> parameter:</p>\n<pre><code class=\"lang-\">CREATE MATERIALIZED VIEW mv_daily_sales\nAUTO REFRESH YES\nAS\nSELECT ...\n</code></pre>\n<p>Now when the source data of the materialized view changes, the materialized view is automatically refreshed. We can view the status of the refresh in the <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_MV_REFRESH_STATUS.html\" target=\"_blank\">svl_mv_refresh_status</a> system table. For example:</p>\n<pre><code class=\"lang-\">UPDATE lineitem\nSET l_extendedprice = 6000\nWHERE l_orderkey = 2362252519\nAND l_linenumber = 1;\n\nSELECT mv_name\n ,starttime\n ,endtime\n ,status\n ,refresh_type\nFROM svl_mv_refresh_status\nWHERE mv_name = 'mv_daily_sales';\n\n+--------------+--------------------------+--------------------------+---------------------------------------------+------------+\n|mv_name |starttime |endtime |status |refresh_type|\n+--------------+--------------------------+--------------------------+---------------------------------------------+------------+\n|mv_daily_sales|2022-05-06 14:07:24.857074|2022-05-06 14:07:33.342346|Refresh successfully updated MV incrementally|Auto |\n+--------------+--------------------------+--------------------------+---------------------------------------------+------------+\n\n</code></pre>\n<p>To remove a materialized view, we use the <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-drop-sql-command.html\" target=\"_blank\">DROP MATERIALIZED VIEW</a> command:</p>\n<pre><code class=\"lang-\">DROP MATERIALIZED VIEW mv_daily_sales;\n</code></pre>\n<p>Now that you’ve seen what materialized views are, their benefits, and how they are created, used, and removed, let’s discuss the drawbacks. Designing and implementing a set of materialized views to help improve overall query performance on a database requires a skilled resource to perform several involved and time-consuming tasks:</p>\n<ul>\n<li>Analyzing queries run on the system</li>\n<li>Identifying which queries are run regularly and provide business benefit</li>\n<li>Prioritizing the identified queries</li>\n<li>Determining if the performance improvement is worth creating a materialized view and storing the dataset</li>\n<li>Physically creating and refreshing the materialized views</li>\n<li>Monitoring the usage of the materialized views</li>\n<li>Dropping materialized views that are rarely or never used or can’t be refreshed due to the structure of base tables changing</li>\n</ul>\n<p>Significant skill, effort, and time is required to design and create materialized views that provide an overall benefit. Also, ongoing monitoring is needed to identify poorly designed or underutilized materialized views that are occupying resources without providing gains.</p>\n<p>Amazon Redshift now has a feature to automate this process, <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html\" target=\"_blank\">Automated Materialized Views</a> (AutoMVs). We explain how AutoMVs work and how to use them on your cluster in the following sections.</p>\n<h4><a id=\"Automatically_create_materialized_views_244\"></a><strong>Automatically create materialized views</strong></h4>\n<p>When the AutoMV feature is enabled on an Amazon Redshift cluster (it’s enabled by default), Amazon Redshift monitors recently run queries and identifies any that could have their performance improved by a materialized view. Expensive parts of the query, such as aggregates and joins that can be persisted into materialized views and reused by future queries, are then extracted from the main query and any subqueries. The extracted query parts are then rewritten into create materialized view statements (candidate materialized views) and stored for further processing.</p>\n<p>The candidate materialized views are not just one-to-one copies of queries; extra processing is applied to create generalized materialized views that can be used by queries similar to the original query. In the following example, the result set is limited by the filters <code>o_orderpriority = '1-URGENT'</code> and <code>l_shipmode ='AIR'</code>. Therefore, a materialized view built from this result set could only serve queries selecting that limited range of data.</p>\n<pre><code class=\"lang-\">SELECT o.o_orderdate\n ,SUM(l.l_extendedprice)\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nWHERE o.o_orderpriority = '1-URGENT'\nAND l.l_shipmode ='AIR'\nGROUP BY o.o_orderdate;\n</code></pre>\n<p>Amazon Redshift uses many techniques to create generalized materialized views; one of these techniques is called predicate elevation. To apply predicate elevation to this query, the filtered columns <code>o_orderpriority</code> and <code>l_shipmode</code> are moved into the GROUP BY clause, thereby storing the full range of data in the materialized view, which allows similar queries to use the same materialized view. This approach is driven by dashboard-like workloads that often issue identical queries with different filter predicates.</p>\n<pre><code class=\"lang-\">SELECT o.o_orderdate\n ,o.o_orderpriority\n ,l.l_shipmode\n ,SUM(l.l_extendedprice)\nFROM orders o\nINNER JOIN lineitem l\n ON o.o_orderkey = l.l_orderkey\nGROUP BY o.o_orderdate\n ,o.o_orderpriority\n ,l.l_shipmode;\n</code></pre>\n<p>In the next processing step, ML algorithms are applied to calculate which of the candidate materialized views provides the best performance benefit and system-wide performance optimization. The algorithms follow similar logic to the auto refresh feature mentioned previously. For each candidate materialized view, Amazon Redshift calculates a benefit, which corresponds to the expected performance improvement should the materialized view be materialized and used in the workload. In addition, it calculates a cost corresponding to the system resources required to create and maintain the candidate. Existing manual materialized views are also considered; an AutoMV will not be created if a manual materialized view already exists that covers the same scope, and manual materialized views have auto refresh priority over AutoMVs.</p>\n<p>The list of materialized views is then sorted in order of overall cost-benefit, taking into consideration <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html\" target=\"_blank\">workload management</a> (WLM) <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/query-priority.html\" target=\"_blank\">query priorities</a>, with materialized views related to queries on a higher priority queue ordered before materialized views related to queries on a lower priority queue. After the list of materialized views has been fully sorted, they’re automatically created and populated in the background in the prioritized order.</p>\n<p>The created AutoMVs are then monitored by a background process that checks their activity, such as how often they have been queried and refreshed. If the process determines that an AutoMV is not being used or refreshed, for example due to the base table’s structure changing, it is dropped.</p>\n<h5><a id=\"Example_283\"></a><strong>Example</strong></h5>\n<p>To demonstrate this process in action, we use the following query taken from the 3 TB <a href=\"https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH\" target=\"_blank\">Cloud DW Benchmark</a>, a performance testing benchmark derived from TPC-H. You can load the benchmark data into your cluster and follow along with the example.</p>\n<pre><code class=\"lang-\">SET enable_result_cache_for_session TO OFF;\n\nSELECT /* TPC-H Q12 */\n l_shipmode\n , SUM(CASE\n WHEN o_orderpriority = '1-URGENT'\n OR o_orderpriority = '2-HIGH'\n THEN 1\n ELSE 0\n END) AS high_line_count\n , SUM(CASE\n WHEN o_orderpriority '1-URGENT'\n AND o_orderpriority '2-HIGH'\n THEN 1\n ELSE 0\n END) AS low_line_count\nFROM orders\n , lineitem\nWHERE o_orderkey = l_orderkey\nAND l_shipmode IN ('MAIL', 'SHIP')\nAND l_commitdate &lt; l_receiptdate\nAND l_shipdate = DATE '1994-01-01'\nAND l_receiptdate &lt; DATEADD(YEAR, 1, CAST('1994-01-01' AS DATE))\nGROUP BY l_shipmode\nORDER BY l_shipmode;\n</code></pre>\n<p>We run the query three times and then wait for 30 minutes. On a 10-node ra3.4xl cluster, the query runs in approximately 8 seconds.</p>\n<p>During the 30 minutes, Amazon Redshift assesses the benefit of materializing candidate AutoMVs. It computes a sorted list of candidate materialized views and creates the most beneficial ones with incremental refresh, auto refresh, and query rewrite enabled. When the query or similar queries run, they’re automatically and transparently rewritten to use one or more of the created AutoMVs.</p>\n<p>Ongoing, if data in the base tables is modified (i.e. the AutoMV becomes stale), an incremental refresh automatically runs, inserting, updating, and deleting rows in the AutoMV to bring its data to the latest state.</p>\n<p>Rerunning the query shows that it runs in approximately 800 milliseconds, a performance improvement of 90%. We can confirm the query is using the AutoMV by checking the explain plan:</p>\n<pre><code class=\"lang-\">EXPLAIN SELECT /* TPC-H Q12 */\n l_shipmode\n ,\n SUM(CASE\n WHEN o_orderpriority = '1-URGENT'\n OR o_orderpriority = '2-HIGH'\n THEN 1\n ELSE 0\n END) AS high_line_count\n , SUM(CASE\n WHEN o_orderpriority &lt;&gt; '1-URGENT'\n AND o_orderpriority &lt;&gt; '2-HIGH'\n THEN 1\n ELSE 0\n END) AS low_line_count\nFROM orders\n , lineitem\nWHERE o_orderkey = l_orderkey\nAND l_shipmode IN ('MAIL', 'SHIP')\nAND l_commitdate &lt; l_receiptdate\nAND l_shipdate &lt; l_commitdate\nAND l_receiptdate &gt;= DATE '1994-01-01'\nAND l_receiptdate &lt; DATEADD(YEAR, 1, CAST('1994-01-01' AS DATE))\nGROUP BY l_shipmode\nORDER BY l_shipmode;\n\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|XN Merge (cost=1000000000354.23..1000000000354.23 rows=1 width=30) |\n| Merge Key: derived_table1.grvar_1 |\n| -&gt; XN Network (cost=1000000000354.23..1000000000354.23 rows=1 width=30) |\n| Send to leader |\n| -&gt; XN Sort (cost=1000000000354.23..1000000000354.23 rows=1 width=30) |\n| Sort Key: derived_table1.grvar_1 |\n| -&gt; XN HashAggregate (cost=354.21..354.22 rows=1 width=30) |\n| -&gt; XN Seq Scan on mv_tbl__auto_mv_2000__0 derived_table1 (cost=0.00..349.12 rows=679 width=30) |\n| Filter: ((grvar_2 &lt; '1995-01-01'::date) AND (grvar_2 &gt;= '1994-01-01'::date) AND ((grvar_1 = 'SHIP'::bpchar) OR (grvar_1 = 'MAIL'::bpchar)))|\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n</code></pre>\n<p>To demonstrate how AutoMVs can also improve the performance of similar queries, we change some of the filters on the original query. In the following example, we change the filter on <code>l_shipmode</code> from <code>IN ('MAIL', 'SHIP')</code> to <code>IN ('TRUCK', 'RAIL', 'AIR')</code>, and change the filter on <code>l_receiptdate</code> to the first 6 months of the previous year. The query runs in approximately 900 milliseconds and, looking at the explain plan, we confirm it’s using the AutoMV:</p>\n<pre><code class=\"lang-\">EXPLAIN SELECT /* TPC-H Q12 modified */\n l_shipmode\n , SUM(CASE\n WHEN o_orderpriority = '1-URGENT'\n OR o_orderpriority = '2-HIGH'\n THEN 1\n ELSE 0\n END) AS high_line_count\n , SUM(CASE\n WHEN o_orderpriority &lt;&gt; '1-URGENT'\n AND o_orderpriority &lt;&gt; '2-HIGH'\n THEN 1\n ELSE 0\n END) AS low_line_count\nFROM orders\n , lineitem\nWHERE o_orderkey = l_orderkey\nAND l_shipmode IN ('TRUCK', 'RAIL', 'AIR')\nAND l_commitdate &lt; l_receiptdate\nAND l_shipdate &lt; l_commitdate\nAND l_receiptdate &gt;= DATE '1993-01-01'\nAND l_receiptdate &lt; DATE '1993-07-01'\nGROUP BY l_shipmode\nORDER BY l_shipmode;\n\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|QUERY PLAN |\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n|XN Merge (cost=1000000000396.30..1000000000396.31 rows=1 width=30) |\n| Merge Key: derived_table1.grvar_1 |\n| -&gt; XN Network (cost=1000000000396.30..1000000000396.31 rows=1 width=30) |\n| Send to leader |\n| -&gt; XN Sort (cost=1000000000396.30..1000000000396.31 rows=1 width=30) |\n| Sort Key: derived_table1.grvar_1 |\n| -&gt; XN HashAggregate (cost=396.29..396.29 rows=1 width=30) |\n| -&gt; XN Seq Scan on mv_tbl__auto_mv_2000__0 derived_table1 (cost=0.00..392.76 rows=470 width=30) |\n| Filter: ((grvar_2 &lt; '1993-07-01'::date) AND (grvar_2 &gt;= '1993-01-01'::date) AND ((grvar_1 = 'AIR'::bpchar) OR (grvar_1 = 'RAIL'::bpchar) OR (grvar_1 = 'TRUCK'::bpchar)))|\n+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n</code></pre>\n<p>The AutoMV feature is transparent to users and is fully system managed. Therefore, unlike manual materialized views, AutoMVs are not visible to users and can’t be queried directly. They also don’t appear in any system tables like <code>stv_mv_info</code> or <code>svl_mv_refresh_status</code>.</p>\n<p>Finally, if the AutoMV hasn’t been used for some time by the workload, it’s automatically dropped and the storage released. When we rerun the query after this, the runtime returns to the original 8 seconds because the query is now using the base tables. This can be confirmed by examining the explain plan.</p>\n<p>This example illustrates that the AutoMV feature reduces the effort and time required to create and maintain materialized views.</p>\n<h4><a id=\"Performance_tests_and_results_415\"></a><strong>Performance tests and results</strong></h4>\n<p>To see how well AutoMVs work in practice, we ran tests using the 1 TB and 3 TB versions of the <a href=\"https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH\" target=\"_blank\">Cloud DW benchmark derived from TPC-H</a>. This test consists of a power run script with 22 queries that is run three times with the results cache off. The tests were run with two different clusters: 4-node ra3.4xlarge and 2-node ra3.16xlarge with a concurrency of 1 and 5.</p>\n<p>The Cloud DW benchmark is derived from the TPC-H benchmark. It isn’t comparable to published TPC-H results, because the results of our tests don’t fully comply with the specification.</p>\n<p>The following table shows our results.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/25e5c52940d24dda9506a62ed3c80bac_image.png\" alt=\"image.png\" /></p>\n<p>The AutoMV feature improved query performance by up to 19% without any manual intervention.</p>\n<h4><a id=\"Summary_429\"></a><strong>Summary</strong></h4>\n<p>In this post, we first presented manual materialized views, their various features, and how to take advantage of them. We then looked into the effort and time required to design, create, and maintain materialized views to provide performance improvements in a data warehouse.</p>\n<p>Next, we discussed how AutoMVs help overcome these challenges and seamlessly provide performance improvements for SQL queries and dashboards. We went deeper into the details of how AutoMVs work and discussed how ML algorithms determine which materialized views to create based on the predicted performance improvement and overall benefit they will provide compared to the cost required to create and maintain them. Then we covered some of the internal processing logic such as how predicate elevation creates generalized materialized views that can be used by a range of queries, not just the original query that triggered the materialized view creation.</p>\n<p>Finally, we showed the results of a performance test on an industry benchmark where the AutoMV feature improved performance by up to 19%.</p>\n<p>As we have demonstrated, automated materialized views provide performance improvements to a data warehouse without requiring any manual effort or specialized expertise. They transparently work in the background, optimizing your workload performance and automatically adapting when your workloads change.</p>\n<p>Automated materialized views are enabled by default. We encourage you to monitor any performance improvements they have on your current clusters. If you’re new to Amazon Redshift, try the <a href=\"http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html\" target=\"_blank\">Getting Started tutorial</a> and use the <a href=\"https://aws.amazon.com/redshift/free-trial/\" target=\"_blank\">free trial</a> to create and provision your first cluster and experiment with the feature.</p>\n<h5><a id=\"About_the_Authors_443\"></a><strong>About the Authors</strong></h5>\n<p><img src=\"https://dev-media.amazoncloud.cn/231acb789ead4708be892fcd47499c13_image.png\" alt=\"image.png\" /></p>\n<p><strong>Adam Gatt</strong> is a Senior Specialist Solution Architect for Analytics at Amazon Web Services. He has over 20 years of experience in data and data warehousing and helps customers build robust, scalable and high-performance analytics solutions in the cloud.</p>\n<p><img src=\"https://dev-media.amazoncloud.cn/bcfe0e2b9e6043d39ca08781df226564_image.png\" alt=\"image.png\" /></p>\n<p><strong>Rahul Chaturvedi</strong> is an Analytics Specialist Solutions Architect at Amazon Web Services. Prior to this role, he was a Data Engineer at Amazon Advertising and Prime Video, where he helped build petabyte-scale data lakes for self-serve analytics.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭