Migrate a large data warehouse from Greenplum to Amazon Redshift using Amazon SCT – Part 2

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"In this second post of a multi-part series, we share best practices for choosing the optimal [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) cluster, data architecture, converting [stored procedures](https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-create.html), compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns. You can check out the first post of this series for guidance on planning, running, and validation of a large-scale data warehouse migration from Greenplum to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) using [Amazon Web Services Schema Conversion Tool](https://aws.amazon.com/dms/schema-conversion-tool/) (Amazon Web Services SCT).\n\n\n\n#### **Choose your optimal [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) cluster**\n\n\n[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) has two types of clusters: provisioned and [serverless](https://aws.amazon.com/redshift/redshift-serverless/). For provisioned clusters, you need to set up the same with required compute resources. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) Serverless can run high-performance analytics in the cloud at any scale. For more information, refer to [Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure](https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/).\n\nAn [Amazon Redshift cluster](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html) consists of nodes. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query run plans. The leader node then coordinates the parallel run of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then returns the results to the client applications.\n\nWhen determining your type of cluster, consider the following:\n\n- Estimate the size of the input data compressed, vCPU, and performance. As of this writing, we recommend the [Amazon Redshift RA3](https://aws.amazon.com/redshift/features/ra3/) instance with managed storage, which scales compute and storage independently for fast query performance.\n- [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) provides an automated “Help me choose” cluster based on the size of your data.\n- A main advantage of a cloud [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) data warehouse is that you’re no longer stuck with hardware and commodities like old guard data warehouses. For faster innovation, you have the option to try different cluster options and choose the optimized one in terms of performance and cost.\n- At the time of development or pilot, you can usually start with a smaller number of nodes. As you move to production, you can adjust the number of nodes based on your usage pattern. When right-sizing your clusters, we recommend choosing the reserved instance type to cut down the cost even further. The public-facing utility [Simple Replay](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/SimpleReplay) can help you determine performance against different cluster types and sizes by replaying the customer workload. For provisioned clusters, if you’re planning to use the recommended RA3 instance, you can [compare different node types](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/SimpleReplay) to determine the right instance type.\n- Based on your workload pattern, [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) supports [resize](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html), [pause and stop](https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/), and [concurrency scaling](https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html) of the cluster. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) [workload management (WLM)](https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html) enables effective and flexible management of memory and query concurrency.\n\n\n#### **Create data extraction tasks with Amazon Web Services SCT**\n\n\nWith Amazon Web Services SCT extraction agents, you can migrate your source tables in parallel. These extraction agents authenticate using a valid user on the data source, allowing you to adjust the resources available for that user during the extraction. Amazon Web Services SCT agents process the data locally and upload it to [Amazon Simple Storage Service](http://aws.amazon.com/s3) ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)) through the network (via [AWS Direct Connect](https://aws.amazon.com/directconnect/)). We recommend having a consistent network bandwidth between your Greenplum machine where the Amazon Web Services SCT agent is installed and your Amazon Web Services Region.\n\nIf you have tables around 20 million rows or 1 TB in size, you can use the virtual partitioning feature on Amazon Web Services SCT to extract data from those tables. This creates several sub-tasks and parallelizes the data extraction process for this table. Therefore, we recommend creating two groups of tasks for each schema that you migrate: one for small tables and one for large tables using virtual partitions.\n\nFor more information, refer to [Creating, running, and monitoring an Amazon Web Services SCT data extraction task](https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/agents.dw.html#agents.Tasks).\n\n\n#### **Data architecture**\n\n\nTo simplify and modernize your data architecture, consider the following:\n\n- Establish accountability and authority to enforce enterprise data standards and policies.\n- Formalize the data and analytics operating model between enterprise and business units and functions.\n- Simplify the data technology ecosystem through rationalization and modernization of data assets and tools or technology.\n- Develop organizational constructs that facilitate more robust integration of the business and delivery teams, and build data-oriented products and solutions to address the business problems and opportunities throughout the lifecycle.\n- Back up the data periodically so that if something is wrong, you have the ability to replay.\n- During planning, design, execution, and throughout implementation and maintenance, ensure data quality management is added to achieve the desired outcome.\n- Simple is the key to an easy, fast, intuitive, and low-cost solution. Simple scales much better than complex. Simple makes it possible to think big (Invent and Simplify is another Amazon leadership principle). Simplify the legacy process by migrating only the necessary data used in tables and schemas. For example, if you’re performing truncate and load for incremental data, identify a watermark and only process incremental data.\n- You may have use cases that requiring record-level inserts, updates, and deletes for privacy regulations and simplified pipelines; simplified file management and near-real-time data access; or simplified change data capture (CDC) data pipeline development. We recommend using purposeful tools based on your use case. Amazon Web Services offers the options to use [Apache HUDI](https://hudi.apache.org/) with [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi.html) and [Amazon Web Services Glue](https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/).\n\n\n#### **Migrate stored procedures**\n\n\nIn this section, we share best practices for stored procedure migration from Greenplum to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Data processing pipelines with complex business logic often use stored procedures to perform the data transformation. We advise using big data processing like Amazon Web Services Glue or [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) to modernize your extract, transform, and load (ETL) jobs. For more information, check out [Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift](https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/). For time-sensitive migration to cloud-native data warehouses like [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), redesigning and developing the entire pipeline in a cloud-native ETL tool might be time-consuming. Therefore, migrating the stored procedures from Greenplum to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) [stored procedures](https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-overview.html) can be the right choice.\n\nFor a successful migration, make sure to follow [Amazon Redshift stored procedure best practices](https://aws.amazon.com/blogs/big-data/bringing-your-stored-procedures-to-amazon-redshift/):\n\n- Specify the schema name while creating a stored procedure. This helps facilitate schema-level security and you can enforce grants or revoke access control.\n- To prevent naming conflicts, we recommend naming procedures using the prefix ```sp_```. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) reserves the ```sp_ ```prefix exclusively for stored procedures. By prefixing your procedure names with ```sp_```, you ensure that your procedure name won’t conflict with any existing or future [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) procedure names.\n- Qualify your database objects with the schema name in the stored procedure.\n- Follow the minimal required access rule and revoke unwanted access. For similar implementation, make sure the stored procedure run permission is not open to ALL.\n- The SECURITY attribute controls a procedure’s privileges to access database objects. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default. For more information, refer to [Security and privileges for stored procedures](https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-security-and-privileges.html).\n- Managing transactions when it comes to stored procedures are important. For more information, refer to [Managing transactions](https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-transaction-management.html).\n- TRUNCATE issues a commit implicitly inside a stored procedure. It interferes with the transaction block by committing the current transaction and creating a new one. Exercise caution while using TRUNCATE to ensure it never breaks the atomicity of the transaction. This also applies for COMMIT and ROLLBACK.\n- Adhere to cursor constraints and understand [performance considerations while using cursor](https://docs.aws.amazon.com/redshift/latest/dg/declare.html#declare-constraints). You should use set-based SQL logic and temporary tables while processing large datasets.\n- Avoid hardcoding in stored procedures. Use [dynamic SQL](https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html#r_PLpgSQL-dynamic-sql) to construct SQL queries dynamically at runtime. Ensure appropriate logging and error handling of the dynamic SQL.\n- For exception handling, you can write RAISE statements as part of the stored procedure code. For example, you can raise an exception with a custom message or insert a record into a logging table. For unhandled exceptions like WHEN OTHERS, use built-in functions like SQLERRM or SQLSTATE to pass it on to the calling application or program. As of this writing, [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) limits calling a stored procedure from the exception block.\n\n\n#### **Sequences**\n\nYou can use [IDENTITY](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_examples.html) columns, system timestamps, or epoch time as an option to ensure uniqueness. The IDENTITY column or a timestamp-based solution might have sparse values, so if you need a continuous number sequence, you need to use dedicated number tables. You can also use of the [RANK()](https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html) or [ROW_NUMBER()](https://docs.aws.amazon.com/redshift/latest/dg/r_WF_ROW_NUMBER.html) [window function](https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html) over the entire set. Alternatively, get the high-water mark from the existing ID column from the table and increment the values while inserting records.\n\n\n#### **Character datatype length**\n\n\nGreenplum char and varchar data type length is specified in terms of character length, including multi-byte ones. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) [character types](https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html) are defined in terms of bytes. For table columns using multi-byte character sets in Greenplum, the converted table column in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) should allocate adequate storage to the actual byte size of the source data.\n\nAn easy workaround is to set the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) character column length to four times larger than the corresponding Greenplum column length.\n\nA best practice is to use the [smallest possible](https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-smallest-column-size.html) column size. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) doesn’t allocate storage space according to the length of the attribute; it allocates storage according to the real length of the stored string. However, at runtime, while processing queries, [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) allocates memory according to the length of the attribute. Therefore, not setting a default size of four times greater helps from a performance perspective.\n\nAn efficient solution is to analyze production datasets and determine the maximum byte size length of the Greenplum character columns. Add a 20% buffer to support future incremental growth on the table.\n\nTo arrive at the actual byte size length of an existing column, run the Greenplum data structure character utility from the Amazon Web Services Samples [GitHub repo](https://github.com/aws-samples/gp-rs-dataprofiling).\n\n\n#### **Numeric precision and scale**\n\nThe [Amazon Redshift numeric data type](https://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html) has a limit to store up to maximum precision of 38, whereas in a Greenplum database, you can define a numeric column without any defined length.\n\nAnalyze your production datasets and determine numeric overflow candidates using the Greenplum data structure numeric utility from the Amazon Web Services Samples [GitHub repo](https://github.com/aws-samples/gp-rs-dataprofiling). For numeric data, you have options to tackle this based on your use case. For numbers with a decimal part, you have the option to round the data based on the data type without any data loss in the whole number part. For future reference, you can a keep copy of the column in VARCHAR or store in an S3 data lake. If you see an extremely small percentage of an outlier of overflow data, clean up the source data for quality data migration.\n\n\n#### **SQL queries and functions**\n\nWhile converting SQL scripts or stored procedures to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), if you encounter unsupported functions, database objects, or code blocks for which you might have to rewrite the query, create [user-defined functions](https://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html) (UDFs), or redesign. You can create a custom scalar UDF using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. You run a custom scalar UDF in much the same way as you run existing [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) functions to match any functionality of legacy databases. The following are some examples of alternate query statements and ways to achieve specific aggregations that might be required during a code rewrite.\n\n\n#### **AGE**\n\nThe Greenplum function [AGE ()](https://www.postgresql.org/docs/8.3/functions-datetime.html) returns an interval subtracting from the current date. You could accomplish the same using a subset of [MONTHS_BETWEEN()](https://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.html), [ADD_MONTH()](https://docs.aws.amazon.com/redshift/latest/dg/r_ADD_MONTHS.html), [DATEDIFF()](https://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html), and [TRUNC()](https://docs.aws.amazon.com/redshift/latest/dg/r_TRUNC.html) functions based on your use case.\n\nThe following example [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) query calculates the gap between the date 2001-04-10 and 1957-06-13 in terms of year, month, and days. You can apply this to any date column in a table.\n\n```\\nselect\\n\\ttrunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,\\n\\tmod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,\\n\\t'2001-04-10'::date -add_months('1957-06-13'::date,\\n\\ttrunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/47f92b19bfca406bb432421da461db0f_image.png)\n\n##### **COUNT**\n\nIf you have a use case to get distinct aggregation in the [Count()](https://docs.aws.amazon.com/redshift/latest/dg/r_WF_COUNT.html) window function, you could accomplish the same using a combination of the [Dense_Rank](https://docs.aws.amazon.com/redshift/latest/dg/r_WF_DENSE_RANK.html) and [Max()](https://docs.aws.amazon.com/redshift/latest/dg/r_MAX.html) window functions.\n\nThe following example [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) query calculates the distinct item count for a given date of sale:\n\n```\\nselect\\n\\tsale_date,\\n\\titem,\\n\\tcost,\\n\\tmax(densernk) over (partition by sale_date order by item rows between unbounded preceding and unbounded following) as distinct_itemcount\\nfrom\\n\\t(\\n\\tselect\\n\\t\\t*,dense_rank() over (partition by sale_date order by item) as densernk\\n\\tfrom\\n\\t\\ttestaggr)\\norder by\\n\\tsale_date,\\n\\titem,\\ncost;\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/b6558e608e354b1cac0702fc75fa0362_image.png)\n\n\n##### **ORDER BY**\n\n[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) aggregate [window functions](https://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_synopsis.html) with an ORDER BY clause require a mandatory frame.\n\nThe following example [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) query creates a cumulative sum of cost by sale date and orders the results by item within the partition:\n\n\n```\\nselect\\n\\t*,\\n sum(cost) over (partition by sale_date\\norder by\\n\\titem rows between unbounded preceding and unbounded following) as total_cost_by_date\\nfrom\\n\\ttestaggr\\norder by\\n\\tsale_date,\\n\\titem,\\n\\tcost;\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/2db27a6d71b84c7c8ad6dbf4a0f62683_image.png)\n\n\n\n##### **STRING_AGG**\n\nIn Greenplum, STRING_AGG() is an aggregate function, which is used to concatenate a list of strings. In [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), use the [LISTAGG()](https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html) function.\n\nThe following example [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) query returns a semicolon-separated list of email addresses for each department:\n\n```\\nselect\\n\\tdept,\\n\\tlistagg(email_address,';') \\nwithin group (order by dept) as email_list\\nfrom\\n\\temployee_contact\\ngroup by\\n\\tdept\\norder by\\n\\tdept;\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/2a0bb26cdff449979d51d4798d6d2977_image.png)\n\n##### **ARRAY_AGG**\n\nIn Greenplum, ARRAY_AGG() is an aggregate function that takes a set of values as input and returns an array. In [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), use a combination of the [LISTAGG](https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html) and [SPLIT_TO_ARRAY()](https://docs.aws.amazon.com/redshift/latest/dg/split_to_array.html) functions. The [SPLIT_TO_ARRAY()](https://docs.aws.amazon.com/redshift/latest/dg/split_to_array.html) function returns a SUPER datatype.\n\nThe following example [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) query returns an array of email addresses for each department:\n\n```\\nselect\\n\\tdept,\\n\\tSPLIT_TO_ARRAY(email_list,\\n\\t';') email_array\\nfrom\\n\\t(\\n\\tselect\\n\\t\\tdept,\\n\\t\\tlistagg(email_address,\\n\\t\\t';') \\nwithin group (\\n\\t\\torder by dept) as email_list\\n\\tfrom\\n\\t\\temployee_contact\\n\\tgroup by\\n\\t\\tdept\\n\\torder by\\n\\t\\tdept);\\n\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/20c464d8196041f2949e27c5d0df4fd0_image.png)\n\nTo retrieve array elements from a SUPER expression, you can use the [SUBARRAY()](https://docs.aws.amazon.com/redshift/latest/dg/r_subarray.html) function:\n\n```\\n select\\n\\tSUBARRAY( email_array,\\n\\t0,\\n\\t1 ) first_element,\\n\\tSUBARRAY( email_array,\\n\\t1,\\n\\t1) second_element,\\n\\tSUBARRAY( email_array,\\n\\t0 ) all_element\\nfrom\\n\\ttestarray\\nwhere\\n\\tdept = 'HR';\\n\\n```\n\nhttps://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/06/28/BDB-1418-image031.png\n\n\n##### **UNNEST**\n\n\nIn Greenplum, you can use the UNNEST function to split an array and convert the array elements into a set of rows. In [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), you can use PartiQL syntax to iterate over SUPER arrays. For more information, refer to [Querying semistructured data](https://docs.aws.amazon.com/redshift/latest/dg/query-super.html).\n\n```\\ncreate temp table unnesttest as\\nselect\\n\\tjson_parse('{\\"scalar_array\\": [0,1,2,3,4,5.5,6,7.9,8,9]}') as data;\\n\\nselect\\n\\telement\\nfrom\\n\\tunnesttest as un,\\n\\tun.data.scalar_array as element at index;\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/ac491989acae435c908f1bf01f55d3c0_image.png)\n\n##### **WHERE**\n\nYou can’t use a window function in the WHERE clause of a query in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Instead, construct the query using the WITH clause and then refer the calculated column in the WHERE clause.\n\nThe following example [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) query returns the sale date, item, and cost from a table for the sales dates where the total sale is more than 100:\n\n```\\nwith aggrcost as (\\nselect\\n\\tsale_date ,\\n\\titem,\\n\\tcost,\\n\\tsum(cost) over (partition by sale_date) as total_sale\\nfrom\\n\\ttestaggr )\\nselect\\n\\t*\\nfrom\\n\\taggrcost\\nwhere\\n\\ttotal_sale > 100;\\n```\n\n\n![image.png](https://dev-media.amazoncloud.cn/77aefaccef1041159d925ede920084e8_image.png)\n\nRefer to the following table for additional Greenplum date/time functions along with the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) equivalent to accelerate you code migration.\n\n![image.png](https://dev-media.amazoncloud.cn/72e35dbf8fbd4ceaa0d770795c766ae3_image.png)\n\n![image.png](https://dev-media.amazoncloud.cn/327c8caf0081426ebe27f0ac1ed11538_image.png)\n![image.png](https://dev-media.amazoncloud.cn/3b93804518c049018bc5720da5e84699_image.png)\n\n\n#### **[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) utility for troubleshooting or running diagnostics for the cluster**\n\n\nThe [Amazon Redshift Utilities GitHub repo](https://github.com/awslabs/amazon-redshift-utils) contains a set of utilities to accelerate troubleshooting or analysis on [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Such utilities consist of queries, views, and scripts. They are not deployed by default onto [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) clusters. The best practice is to deploy the needed views into the admin schema.\n\n\n#### **Conclusion**\n\n\nIn this post, we covered prescriptive guidance around data types, functions, and stored procedures to accelerate the migration process from Greenplum to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged [modern data architecture](https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/). The Amazon Web Services Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), [Amazon Timestream](https://aws.amazon.com/timestream/), and others based on your use case.\n\n\n\n##### **About the Authors**\n\n\n![image.png](https://dev-media.amazoncloud.cn/0364a28b989948c389792d1fd9aef6ff_image.png)\n\n**Suresh Patnam** is a Principal Solutions Architect at Amazon Web Services. He is passionate about helping businesses of all sizes transforming into fast-moving digital organizations focusing on big data, data lakes, and AI/ML. Suresh holds a MBA degree from Duke University- Fuqua School of Business and MS in CIS from Missouri State University. In his spare time, Suresh enjoys playing tennis and spending time with his family.\n\n![image.png](https://dev-media.amazoncloud.cn/ef5e51fa9372474ebde872fb3f8853bb_image.png)\n\n**Arunabha Datta** is a Sr. Data Architect at Amazon Web Services Professional Services. He collaborates with customers and partners to architect and implement modern data architecture using Amazon Web Services Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.","render":"<p>In this second post of a multi-part series, we share best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-create.html\\" target=\\"_blank\\">stored procedures</a>, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns. You can check out the first post of this series for guidance on planning, running, and validation of a large-scale data warehouse migration from Greenplum to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) using <a href=\\"https://aws.amazon.com/dms/schema-conversion-tool/\\" target=\\"_blank\\">Amazon Web Services Schema Conversion Tool</a> (Amazon Web Services SCT).</p>\\n<h4><a id=\\"Choose_your_optimal_Amazon_Redshift_cluster_4\\"></a><strong>Choose your optimal Amazon Redshift cluster</strong></h4>\\n<p>Amazon Redshift has two types of clusters: provisioned and <a href=\\"https://aws.amazon.com/redshift/redshift-serverless/\\" target=\\"_blank\\">serverless</a>. For provisioned clusters, you need to set up the same with required compute resources. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) Serverless can run high-performance analytics in the cloud at any scale. For more information, refer to <a href=\\"https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/\\" target=\\"_blank\\">Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure</a>.</p>\\n<p>An <a href=\\"https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html\\" target=\\"_blank\\">Amazon Redshift cluster</a> consists of nodes. Each cluster has a leader node and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query run plans. The leader node then coordinates the parallel run of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then returns the results to the client applications.</p>\\n<p>When determining your type of cluster, consider the following:</p>\n<ul>\\n<li>Estimate the size of the input data compressed, vCPU, and performance. As of this writing, we recommend the <a href=\\"https://aws.amazon.com/redshift/features/ra3/\\" target=\\"_blank\\">Amazon Redshift RA3</a> instance with managed storage, which scales compute and storage independently for fast query performance.</li>\\n<li>Amazon Redshift provides an automated “Help me choose” cluster based on the size of your data.</li>\n<li>A main advantage of a cloud Amazon Redshift data warehouse is that you’re no longer stuck with hardware and commodities like old guard data warehouses. For faster innovation, you have the option to try different cluster options and choose the optimized one in terms of performance and cost.</li>\n<li>At the time of development or pilot, you can usually start with a smaller number of nodes. As you move to production, you can adjust the number of nodes based on your usage pattern. When right-sizing your clusters, we recommend choosing the reserved instance type to cut down the cost even further. The public-facing utility <a href=\\"https://github.com/awslabs/amazon-redshift-utils/tree/master/src/SimpleReplay\\" target=\\"_blank\\">Simple Replay</a> can help you determine performance against different cluster types and sizes by replaying the customer workload. For provisioned clusters, if you’re planning to use the recommended RA3 instance, you can <a href=\\"https://github.com/awslabs/amazon-redshift-utils/tree/master/src/SimpleReplay\\" target=\\"_blank\\">compare different node types</a> to determine the right instance type.</li>\\n<li>Based on your workload pattern, Amazon Redshift supports <a href=\\"https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-operations.html\\" target=\\"_blank\\">resize</a>, <a href=\\"https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/\\" target=\\"_blank\\">pause and stop</a>, and <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/concurrency-scaling.html\\" target=\\"_blank\\">concurrency scaling</a> of the cluster. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/c_workload_mngmt_classification.html\\" target=\\"_blank\\">workload management (WLM)</a> enables effective and flexible management of memory and query concurrency.</li>\\n</ul>\n<h4><a id=\\"Create_data_extraction_tasks_with_Amazon_Web_Services_SCT_20\\"></a><strong>Create data extraction tasks with Amazon Web Services SCT</strong></h4>\\n<p>With Amazon Web Services SCT extraction agents, you can migrate your source tables in parallel. These extraction agents authenticate using a valid user on the data source, allowing you to adjust the resources available for that user during the extraction. Amazon Web Services SCT agents process the data locally and upload it to <a href=\\"http://aws.amazon.com/s3\\" target=\\"_blank\\">Amazon Simple Storage Service</a> ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)) through the network (via <a href=\\"https://aws.amazon.com/directconnect/\\" target=\\"_blank\\">AWS Direct Connect</a>). We recommend having a consistent network bandwidth between your Greenplum machine where the Amazon Web Services SCT agent is installed and your Amazon Web Services Region.</p>\\n<p>If you have tables around 20 million rows or 1 TB in size, you can use the virtual partitioning feature on Amazon Web Services SCT to extract data from those tables. This creates several sub-tasks and parallelizes the data extraction process for this table. Therefore, we recommend creating two groups of tasks for each schema that you migrate: one for small tables and one for large tables using virtual partitions.</p>\n<p>For more information, refer to <a href=\\"https://docs.aws.amazon.com/SchemaConversionTool/latest/userguide/agents.dw.html#agents.Tasks\\" target=\\"_blank\\">Creating, running, and monitoring an Amazon Web Services SCT data extraction task</a>.</p>\\n<h4><a id=\\"Data_architecture_30\\"></a><strong>Data architecture</strong></h4>\\n<p>To simplify and modernize your data architecture, consider the following:</p>\n<ul>\\n<li>Establish accountability and authority to enforce enterprise data standards and policies.</li>\n<li>Formalize the data and analytics operating model between enterprise and business units and functions.</li>\n<li>Simplify the data technology ecosystem through rationalization and modernization of data assets and tools or technology.</li>\n<li>Develop organizational constructs that facilitate more robust integration of the business and delivery teams, and build data-oriented products and solutions to address the business problems and opportunities throughout the lifecycle.</li>\n<li>Back up the data periodically so that if something is wrong, you have the ability to replay.</li>\n<li>During planning, design, execution, and throughout implementation and maintenance, ensure data quality management is added to achieve the desired outcome.</li>\n<li>Simple is the key to an easy, fast, intuitive, and low-cost solution. Simple scales much better than complex. Simple makes it possible to think big (Invent and Simplify is another Amazon leadership principle). Simplify the legacy process by migrating only the necessary data used in tables and schemas. For example, if you’re performing truncate and load for incremental data, identify a watermark and only process incremental data.</li>\n<li>You may have use cases that requiring record-level inserts, updates, and deletes for privacy regulations and simplified pipelines; simplified file management and near-real-time data access; or simplified change data capture (CDC) data pipeline development. We recommend using purposeful tools based on your use case. Amazon Web Services offers the options to use <a href=\\"https://hudi.apache.org/\\" target=\\"_blank\\">Apache HUDI</a> with <a href=\\"https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi.html\\" target=\\"_blank\\">Amazon EMR</a> and <a href=\\"https://aws.amazon.com/blogs/big-data/writing-to-apache-hudi-tables-using-aws-glue-connector/\\" target=\\"_blank\\">Amazon Web Services Glue</a>.</li>\\n</ul>\n<h4><a id=\\"Migrate_stored_procedures_45\\"></a><strong>Migrate stored procedures</strong></h4>\\n<p>In this section, we share best practices for stored procedure migration from Greenplum to Amazon Redshift. Data processing pipelines with complex business logic often use stored procedures to perform the data transformation. We advise using big data processing like Amazon Web Services Glue or Amazon EMR to modernize your extract, transform, and load (ETL) jobs. For more information, check out <a href=\\"https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/\\" target=\\"_blank\\">Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift</a>. For time-sensitive migration to cloud-native data warehouses like [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), redesigning and developing the entire pipeline in a cloud-native ETL tool might be time-consuming. Therefore, migrating the stored procedures from Greenplum to [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-overview.html\\" target=\\"_blank\\">stored procedures</a> can be the right choice.</p>\\n<p>For a successful migration, make sure to follow <a href=\\"https://aws.amazon.com/blogs/big-data/bringing-your-stored-procedures-to-amazon-redshift/\\" target=\\"_blank\\">Amazon Redshift stored procedure best practices</a>:</p>\\n<ul>\\n<li>Specify the schema name while creating a stored procedure. This helps facilitate schema-level security and you can enforce grants or revoke access control.</li>\n<li>To prevent naming conflicts, we recommend naming procedures using the prefix <code>sp_</code>. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) reserves the <code>sp_ </code>prefix exclusively for stored procedures. By prefixing your procedure names with <code>sp_</code>, you ensure that your procedure name won’t conflict with any existing or future [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) procedure names.</li>\\n<li>Qualify your database objects with the schema name in the stored procedure.</li>\n<li>Follow the minimal required access rule and revoke unwanted access. For similar implementation, make sure the stored procedure run permission is not open to ALL.</li>\n<li>The SECURITY attribute controls a procedure’s privileges to access database objects. When you create a stored procedure, you can set the SECURITY attribute to either DEFINER or INVOKER. If you specify SECURITY INVOKER, the procedure uses the privileges of the user invoking the procedure. If you specify SECURITY DEFINER, the procedure uses the privileges of the owner of the procedure. INVOKER is the default. For more information, refer to <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-security-and-privileges.html\\" target=\\"_blank\\">Security and privileges for stored procedures</a>.</li>\\n<li>Managing transactions when it comes to stored procedures are important. For more information, refer to <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-transaction-management.html\\" target=\\"_blank\\">Managing transactions</a>.</li>\\n<li>TRUNCATE issues a commit implicitly inside a stored procedure. It interferes with the transaction block by committing the current transaction and creating a new one. Exercise caution while using TRUNCATE to ensure it never breaks the atomicity of the transaction. This also applies for COMMIT and ROLLBACK.</li>\n<li>Adhere to cursor constraints and understand <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/declare.html#declare-constraints\\" target=\\"_blank\\">performance considerations while using cursor</a>. You should use set-based SQL logic and temporary tables while processing large datasets.</li>\\n<li>Avoid hardcoding in stored procedures. Use <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html#r_PLpgSQL-dynamic-sql\\" target=\\"_blank\\">dynamic SQL</a> to construct SQL queries dynamically at runtime. Ensure appropriate logging and error handling of the dynamic SQL.</li>\\n<li>For exception handling, you can write RAISE statements as part of the stored procedure code. For example, you can raise an exception with a custom message or insert a record into a logging table. For unhandled exceptions like WHEN OTHERS, use built-in functions like SQLERRM or SQLSTATE to pass it on to the calling application or program. As of this writing, Amazon Redshift limits calling a stored procedure from the exception block.</li>\n</ul>\\n<h4><a id=\\"Sequences_64\\"></a><strong>Sequences</strong></h4>\\n<p>You can use <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_examples.html\\" target=\\"_blank\\">IDENTITY</a> columns, system timestamps, or epoch time as an option to ensure uniqueness. The IDENTITY column or a timestamp-based solution might have sparse values, so if you need a continuous number sequence, you need to use dedicated number tables. You can also use of the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_WF_RANK.html\\" target=\\"_blank\\">RANK()</a> or <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_WF_ROW_NUMBER.html\\" target=\\"_blank\\">ROW_NUMBER()</a> <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html\\" target=\\"_blank\\">window function</a> over the entire set. Alternatively, get the high-water mark from the existing ID column from the table and increment the values while inserting records.</p>\\n<h4><a id=\\"Character_datatype_length_69\\"></a><strong>Character datatype length</strong></h4>\\n<p>Greenplum char and varchar data type length is specified in terms of character length, including multi-byte ones. Amazon Redshift <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html\\" target=\\"_blank\\">character types</a> are defined in terms of bytes. For table columns using multi-byte character sets in Greenplum, the converted table column in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) should allocate adequate storage to the actual byte size of the source data.</p>\\n<p>An easy workaround is to set the Amazon Redshift character column length to four times larger than the corresponding Greenplum column length.</p>\n<p>A best practice is to use the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-smallest-column-size.html\\" target=\\"_blank\\">smallest possible</a> column size. [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) doesn’t allocate storage space according to the length of the attribute; it allocates storage according to the real length of the stored string. However, at runtime, while processing queries, [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) allocates memory according to the length of the attribute. Therefore, not setting a default size of four times greater helps from a performance perspective.</p>\\n<p>An efficient solution is to analyze production datasets and determine the maximum byte size length of the Greenplum character columns. Add a 20% buffer to support future incremental growth on the table.</p>\n<p>To arrive at the actual byte size length of an existing column, run the Greenplum data structure character utility from the Amazon Web Services Samples <a href=\\"https://github.com/aws-samples/gp-rs-dataprofiling\\" target=\\"_blank\\">GitHub repo</a>.</p>\\n<h4><a id=\\"Numeric_precision_and_scale_83\\"></a><strong>Numeric precision and scale</strong></h4>\\n<p>The <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_Numeric_types201.html\\" target=\\"_blank\\">Amazon Redshift numeric data type</a> has a limit to store up to maximum precision of 38, whereas in a Greenplum database, you can define a numeric column without any defined length.</p>\\n<p>Analyze your production datasets and determine numeric overflow candidates using the Greenplum data structure numeric utility from the Amazon Web Services Samples <a href=\\"https://github.com/aws-samples/gp-rs-dataprofiling\\" target=\\"_blank\\">GitHub repo</a>. For numeric data, you have options to tackle this based on your use case. For numbers with a decimal part, you have the option to round the data based on the data type without any data loss in the whole number part. For future reference, you can a keep copy of the column in VARCHAR or store in an S3 data lake. If you see an extremely small percentage of an outlier of overflow data, clean up the source data for quality data migration.</p>\\n<h4><a id=\\"SQL_queries_and_functions_90\\"></a><strong>SQL queries and functions</strong></h4>\\n<p>While converting SQL scripts or stored procedures to Amazon Redshift, if you encounter unsupported functions, database objects, or code blocks for which you might have to rewrite the query, create <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/user-defined-functions.html\\" target=\\"_blank\\">user-defined functions</a> (UDFs), or redesign. You can create a custom scalar UDF using either a SQL SELECT clause or a Python program. The new function is stored in the database and is available for any user with sufficient privileges to run. You run a custom scalar UDF in much the same way as you run existing [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) functions to match any functionality of legacy databases. The following are some examples of alternate query statements and ways to achieve specific aggregations that might be required during a code rewrite.</p>\\n<h4><a id=\\"AGE_95\\"></a><strong>AGE</strong></h4>\\n<p>The Greenplum function <a href=\\"https://www.postgresql.org/docs/8.3/functions-datetime.html\\" target=\\"_blank\\">AGE ()</a> returns an interval subtracting from the current date. You could accomplish the same using a subset of <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_MONTHS_BETWEEN_function.html\\" target=\\"_blank\\">MONTHS_BETWEEN()</a>, <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_ADD_MONTHS.html\\" target=\\"_blank\\">ADD_MONTH()</a>, <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_DATEDIFF_function.html\\" target=\\"_blank\\">DATEDIFF()</a>, and <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_TRUNC.html\\" target=\\"_blank\\">TRUNC()</a> functions based on your use case.</p>\\n<p>The following example Amazon Redshift query calculates the gap between the date 2001-04-10 and 1957-06-13 in terms of year, month, and days. You can apply this to any date column in a table.</p>\n<pre><code class=\\"lang-\\">select\\n\\ttrunc(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))/ 12) years,\\n\\tmod(trunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4, 12) months,\\n\\t'2001-04-10'::date -add_months('1957-06-13'::date,\\n\\ttrunc(months_between('2001-04-10'::date, '1957-06-13'::date))::int4) days;\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/47f92b19bfca406bb432421da461db0f_image.png\\" alt=\\"image.png\\" /></p>\n<h5><a id=\\"COUNT_111\\"></a><strong>COUNT</strong></h5>\\n<p>If you have a use case to get distinct aggregation in the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_WF_COUNT.html\\" target=\\"_blank\\">Count()</a> window function, you could accomplish the same using a combination of the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_WF_DENSE_RANK.html\\" target=\\"_blank\\">Dense_Rank</a> and <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_MAX.html\\" target=\\"_blank\\">Max()</a> window functions.</p>\\n<p>The following example Amazon Redshift query calculates the distinct item count for a given date of sale:</p>\n<pre><code class=\\"lang-\\">select\\n\\tsale_date,\\n\\titem,\\n\\tcost,\\n\\tmax(densernk) over (partition by sale_date order by item rows between unbounded preceding and unbounded following) as distinct_itemcount\\nfrom\\n\\t(\\n\\tselect\\n\\t\\t*,dense_rank() over (partition by sale_date order by item) as densernk\\n\\tfrom\\n\\t\\ttestaggr)\\norder by\\n\\tsale_date,\\n\\titem,\\ncost;\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/b6558e608e354b1cac0702fc75fa0362_image.png\\" alt=\\"image.png\\" /></p>\n<h5><a id=\\"ORDER_BY_138\\"></a><strong>ORDER BY</strong></h5>\\n<p>Amazon Redshift aggregate <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_synopsis.html\\" target=\\"_blank\\">window functions</a> with an ORDER BY clause require a mandatory frame.</p>\\n<p>The following example Amazon Redshift query creates a cumulative sum of cost by sale date and orders the results by item within the partition:</p>\n<pre><code class=\\"lang-\\">select\\n\\t*,\\n sum(cost) over (partition by sale_date\\norder by\\n\\titem rows between unbounded preceding and unbounded following) as total_cost_by_date\\nfrom\\n\\ttestaggr\\norder by\\n\\tsale_date,\\n\\titem,\\n\\tcost;\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/2db27a6d71b84c7c8ad6dbf4a0f62683_image.png\\" alt=\\"image.png\\" /></p>\n<h5><a id=\\"STRING_AGG_163\\"></a><strong>STRING_AGG</strong></h5>\\n<p>In Greenplum, STRING_AGG() is an aggregate function, which is used to concatenate a list of strings. In Amazon Redshift, use the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html\\" target=\\"_blank\\">LISTAGG()</a> function.</p>\\n<p>The following example Amazon Redshift query returns a semicolon-separated list of email addresses for each department:</p>\n<pre><code class=\\"lang-\\">select\\n\\tdept,\\n\\tlistagg(email_address,';') \\nwithin group (order by dept) as email_list\\nfrom\\n\\temployee_contact\\ngroup by\\n\\tdept\\norder by\\n\\tdept;\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/2a0bb26cdff449979d51d4798d6d2977_image.png\\" alt=\\"image.png\\" /></p>\n<h5><a id=\\"ARRAY_AGG_184\\"></a><strong>ARRAY_AGG</strong></h5>\\n<p>In Greenplum, ARRAY_AGG() is an aggregate function that takes a set of values as input and returns an array. In Amazon Redshift, use a combination of the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html\\" target=\\"_blank\\">LISTAGG</a> and <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/split_to_array.html\\" target=\\"_blank\\">SPLIT_TO_ARRAY()</a> functions. The <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/split_to_array.html\\" target=\\"_blank\\">SPLIT_TO_ARRAY()</a> function returns a SUPER datatype.</p>\\n<p>The following example Amazon Redshift query returns an array of email addresses for each department:</p>\n<pre><code class=\\"lang-\\">select\\n\\tdept,\\n\\tSPLIT_TO_ARRAY(email_list,\\n\\t';') email_array\\nfrom\\n\\t(\\n\\tselect\\n\\t\\tdept,\\n\\t\\tlistagg(email_address,\\n\\t\\t';') \\nwithin group (\\n\\t\\torder by dept) as email_list\\n\\tfrom\\n\\t\\temployee_contact\\n\\tgroup by\\n\\t\\tdept\\n\\torder by\\n\\t\\tdept);\\n\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/20c464d8196041f2949e27c5d0df4fd0_image.png\\" alt=\\"image.png\\" /></p>\n<p>To retrieve array elements from a SUPER expression, you can use the <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/r_subarray.html\\" target=\\"_blank\\">SUBARRAY()</a> function:</p>\\n<pre><code class=\\"lang-\\"> select\\n\\tSUBARRAY( email_array,\\n\\t0,\\n\\t1 ) first_element,\\n\\tSUBARRAY( email_array,\\n\\t1,\\n\\t1) second_element,\\n\\tSUBARRAY( email_array,\\n\\t0 ) all_element\\nfrom\\n\\ttestarray\\nwhere\\n\\tdept = 'HR';\\n\\n</code></pre>\\n<p>https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/06/28/BDB-1418-image031.png</p>\n<h5><a id=\\"UNNEST_236\\"></a><strong>UNNEST</strong></h5>\\n<p>In Greenplum, you can use the UNNEST function to split an array and convert the array elements into a set of rows. In Amazon Redshift, you can use PartiQL syntax to iterate over SUPER arrays. For more information, refer to <a href=\\"https://docs.aws.amazon.com/redshift/latest/dg/query-super.html\\" target=\\"_blank\\">Querying semistructured data</a>.</p>\\n<pre><code class=\\"lang-\\">create temp table unnesttest as\\nselect\\n\\tjson_parse('{&quot;scalar_array&quot;: [0,1,2,3,4,5.5,6,7.9,8,9]}') as data;\\n\\nselect\\n\\telement\\nfrom\\n\\tunnesttest as un,\\n\\tun.data.scalar_array as element at index;\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/ac491989acae435c908f1bf01f55d3c0_image.png\\" alt=\\"image.png\\" /></p>\n<h5><a id=\\"WHERE_255\\"></a><strong>WHERE</strong></h5>\\n<p>You can’t use a window function in the WHERE clause of a query in Amazon Redshift. Instead, construct the query using the WITH clause and then refer the calculated column in the WHERE clause.</p>\n<p>The following example Amazon Redshift query returns the sale date, item, and cost from a table for the sales dates where the total sale is more than 100:</p>\n<pre><code class=\\"lang-\\">with aggrcost as (\\nselect\\n\\tsale_date ,\\n\\titem,\\n\\tcost,\\n\\tsum(cost) over (partition by sale_date) as total_sale\\nfrom\\n\\ttestaggr )\\nselect\\n\\t*\\nfrom\\n\\taggrcost\\nwhere\\n\\ttotal_sale &gt; 100;\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/77aefaccef1041159d925ede920084e8_image.png\\" alt=\\"image.png\\" /></p>\n<p>Refer to the following table for additional Greenplum date/time functions along with the Amazon Redshift equivalent to accelerate you code migration.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/72e35dbf8fbd4ceaa0d770795c766ae3_image.png\\" alt=\\"image.png\\" /></p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/327c8caf0081426ebe27f0ac1ed11538_image.png\\" alt=\\"image.png\\" /><br />\\n<img src=\\"https://dev-media.amazoncloud.cn/3b93804518c049018bc5720da5e84699_image.png\\" alt=\\"image.png\\" /></p>\n<h4><a id=\\"Amazon_Redshift_utility_for_troubleshooting_or_running_diagnostics_for_the_cluster_289\\"></a><strong>Amazon Redshift utility for troubleshooting or running diagnostics for the cluster</strong></h4>\\n<p>The <a href=\\"https://github.com/awslabs/amazon-redshift-utils\\" target=\\"_blank\\">Amazon Redshift Utilities GitHub repo</a> contains a set of utilities to accelerate troubleshooting or analysis on [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). Such utilities consist of queries, views, and scripts. They are not deployed by default onto [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) clusters. The best practice is to deploy the needed views into the admin schema.</p>\\n<h4><a id=\\"Conclusion_295\\"></a><strong>Conclusion</strong></h4>\\n<p>In this post, we covered prescriptive guidance around data types, functions, and stored procedures to accelerate the migration process from Greenplum to Amazon Redshift. Although this post describes modernizing and moving to a cloud warehouse, you should be augmenting this transformation process towards a full-fledged <a href=\\"https://aws.amazon.com/big-data/datalakes-and-analytics/modern-data-architecture/\\" target=\\"_blank\\">modern data architecture</a>. The Amazon Web Services Cloud enables you to be more data-driven by supporting multiple use cases. For a modern data architecture, you should use purposeful data stores like [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), <a href=\\"https://aws.amazon.com/timestream/\\" target=\\"_blank\\">Amazon Timestream</a>, and others based on your use case.</p>\\n<h5><a id=\\"About_the_Authors_302\\"></a><strong>About the Authors</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/0364a28b989948c389792d1fd9aef6ff_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Suresh Patnam</strong> is a Principal Solutions Architect at Amazon Web Services. He is passionate about helping businesses of all sizes transforming into fast-moving digital organizations focusing on big data, data lakes, and AI/ML. Suresh holds a MBA degree from Duke University- Fuqua School of Business and MS in CIS from Missouri State University. In his spare time, Suresh enjoys playing tennis and spending time with his family.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/ef5e51fa9372474ebde872fb3f8853bb_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Arunabha Datta</strong> is a Sr. Data Architect at Amazon Web Services Professional Services. He collaborates with customers and partners to architect and implement modern data architecture using Amazon Web Services Analytics services. In his spare time, Arunabha enjoys photography and spending time with his family.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭