New – Amazon Redshift Integration with Apache Spark

海外精选
re:Invent
Amazon Redshift
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"[Apache Spark](https://spark.apache.org/) is an open-source, distributed processing system commonly used for big data workloads. Spark application developers working in [Amazon EMR](https://aws.amazon.com/emr), [Amazon SageMaker](https://aws.amazon.com/sagemaker), and [AWS Glue](https://aws.amazon.com/glue) often use third-party Apache Spark connectors that allow them to read and write the data with [Amazon Redshift](https://aws.amazon.com/redshift). These third-party connectors are not regularly maintained, supported, or tested with various versions of Spark for production.\n\nToday we are announcing the general availability of **[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) integration for Apache Spark**, which makes it easy to build and run Spark applications on [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) and Redshift Serverless, enabling customers to open up the data warehouse for a broader set of AWS analytics and machine learning (ML) solutions.\n\nWith [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) integration for Apache Spark, you can get started in seconds and effortlessly build Apache Spark applications in a variety of languages, such as Java, Scala, and Python.\n\nYour applications can read from and write to your [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) data warehouse without compromising on the performance of the applications or transactional consistency of the data, as well as performance improvements with pushdown optimizations.\n\n[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) integration for Apache Spark builds on an [existing open source connector project](https://github.com/spark-redshift-community/spark-redshift) and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.\n\n### ++Getting Started with Spark Connector for [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail)++\nTo get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job or Notebook to connect to the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) data warehouse, and start running queries in seconds.\n\nIn this launch, [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) 6.9, EMR Serverless, and [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) 4.0 come with the pre-packaged connector and JDBC driver, and you can just start writing code. EMR 6.9 provides a sample notebook, and EMR Serverless provides a sample Spark Job too.\n\nFirst, you should set [AWS Identity and Access Management](https://aws.amazon.com/iam) ([AWS IAM](https://aws.amazon.com/cn/iam/?trk=cndc-detail)) authentication between Redshift and Spark, between [Amazon Simple Storage Service](https://aws.amazon.com/s3) ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)) and Spark, and between Redshift and [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail). The following diagram describes the authentication between [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), Redshift, the Spark driver, and Spark executors.\n\n![image.png](https://dev-media.amazoncloud.cn/e2b6651479254dc3adaf1e6db5d1b5bd_image.png)\n\nFor more information, see [Identity and access management in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-authentication-access-control.html) in the AWS documentation.\n\n\n### ++[Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail)++\nIf you already have an [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) data warehouse and the data available, you can create the database user and provide the right level of grants to the database user. To use this with [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail), you need to upgrade to the latest version of the [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) 6.9 that has the packaged ```spark-redshift connector```. Select the ```emr-6.9.0``` release when you create an EMR cluster on Amazon EC2.\n\n![image.png](https://dev-media.amazoncloud.cn/878dc6f8590c470d8c1785181536af0b_image.png)\n\n\nYou can use EMR Serverless to create your Spark application using the ```emr-6.9.0``` release to run your workload.\n\n![image.png](https://dev-media.amazoncloud.cn/7cbca1171b4e4ad899c587256dc296cf_image.png)\n\nEMR Studio also provides an example Jupyter Notebook configured to connect to an [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) Serverless endpoint leveraging sample data that you can use to get started quickly.\n\nHere is a Scalar example to build your applications both with Spark Dataframe and Spark SQL. Use IAM-based credentials for connecting to Redshift and use IAM role for unloading and loading data from S3.\n\n```\\n// Create the JDBC connection URL and define the Redshift context\\nval jdbcURL = \\"jdbc:redshift:iam://<RedshiftEndpoint>:<Port>/<Database>?DbUser=<RsUser>\\"\\nval rsOptions = Map (\\n \\"url\\" -> jdbcURL,\\n \\"tempdir\\" -> tempS3Dir, \\n \\"aws_iam_role\\" -> roleARN,\\n )\\n// Reference the sales table from Redshift \\nval sales_df = spark\\n .read \\n .format(\\"io.github.spark_redshift_community.spark.redshift\\") \\n .options(rsOptions) \\n .option(\\"dbtable\\", \\"sales\\") \\n .load() \\nsales_df.createOrReplaceTempView(\\"sales\\") \\n// Reference the date table from Redshift using Data Frame \\nsales_df.join(date_df, sales_df(\\"dateid\\") === date_df(\\"dateid\\"))\\n .where(col(\\"caldate\\") === \\"2008-01-05\\")\\n .groupBy().sum(\\"qtysold\\")\\n .select(col(\\"sum(qtysold)\\"))\\n .show()\\n```\n \nIf [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) and [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) are in different VPCs, you have to configure VPC peering or enable cross-VPC access. Assuming both [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) and [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) are in the same virtual private cloud (VPC), you can create a Spark job or Notebook and connect to the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) data warehouse and write Spark code to use the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) connector.\n\nTo learn more, see [Use Spark on Amazon Redshift with a connector](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-redshift.html) in the AWS documentation.\n\n\n### [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail)\nWhen you use [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) 4.0, the spark-redshift connector is available both as a source and target. In Glue Studio, you can use a visual ETL job to read or write to a Redshift data warehouse simply by selecting a Redshift connection to use within a built-in Redshift source or target node.\n\nThe Redshift connection contains Redshift connection details along with the credentials needed to access Redshift with the proper permissions.\n\nTo get started, choose **Jobs** in the left menu of the **Glue Studio** console. Using either of the Visual modes, you can easily add and edit a source or target node and define a range of transformations on the data without writing any code.\n\n![image.png](https://dev-media.amazoncloud.cn/478a4aaa9c7d412aa90c4c25dc717c93_image.png)\n\nChoose **Create** and you can easily add and edit a source, target node, and the transform node in the job diagram. At this time, you will choose [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) as **Source** and **Target**.\n\n\n![image.png](https://dev-media.amazoncloud.cn/13e9243b7c77402c9617bffe4595caec_image.png)\n\n\nOnce completed, the Glue job can be executed on Glue for the Apache Spark engine, which will automatically use the latest spark-redshift connector.\n\nThe following Python script shows an example job to read and write to Redshift with dynamicframe using the spark-redshift connector.\n\n```\\nimport sys\\nfrom awsglue.transforms import *\\nfrom awsglue.utils import getResolvedOptions\\nfrom pyspark.context import SparkContext\\nfrom awsglue.context import GlueContext\\nfrom awsglue.job import Job\\n\\n## @params: [JOB_NAME]\\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\\nsc = SparkContext()\\nglueContext = GlueContext(sc)\\nspark = glueContext.spark_session\\njob = Job(glueContext)\\njob.init(args['JOB_NAME'], args)\\n\\nprint(\\"================ DynamicFrame Read ===============\\")\\nurl = \\"jdbc:redshift://<RedshiftEndpoint>:<Port>/dev\\"\\nread_options = {\\n \\"url\\": url,\\n \\"dbtable\\": dbtable,\\n \\"redshiftTmpDir\\": redshiftTmpDir,\\n \\"tempdir\\": redshiftTmpDir,\\n \\"aws_iam_role\\": aws_iam_role,\\n \\"autopushdown\\": \\"true\\",\\n \\"include_column_list\\": \\"false\\"\\n}\\n\\nredshift_read = glueContext.create_dynamic_frame.from_options(\\n connection_type=\\"redshift\\",\\n connection_options=read_options\\n) \\n\\nprint(\\"================ DynamicFrame Write ===============\\")\\n\\nwrite_options = {\\n \\"url\\": url,\\n \\"dbtable\\": dbtable,\\n \\"user\\": user,\\n \\"password\\": password,\\n \\"redshiftTmpDir\\": redshiftTmpDir,\\n \\"tempdir\\": redshiftTmpDir,\\n \\"aws_iam_role\\": aws_iam_role,\\n \\"autopushdown\\": \\"true\\",\\n \\"DbUser\\": user\\n}\\n\\nprint(\\"================ dyf write result: check redshift table ===============\\")\\nredshift_write = glueContext.write_dynamic_frame.from_options(\\n frame=redshift_read,\\n connection_type=\\"redshift\\",\\n connection_options=write_options\\n)\\n```\n\nWhen you set up your job detail, you can only use the **Glue 4.0 – Supports spark 3.3 Python 3** version for this integration.\n\n![image.png](https://dev-media.amazoncloud.cn/8e9479b0ed494976829538b4349c52b5_image.png)\n\n\nTo learn more, see [Creating ETL jobs with AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/creating-jobs-chapter.html) and [Using connectors and connections with AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/connectors-chapter.html) in the AWS documentation.\n\n\n### ++Gaining the Best Performance++\nIn the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) integration for Apache Spark, the Spark connector automatically applies predicate and query pushdown to optimize for performance. You can gain performance improvement by using the default Parquet format for the connector used for unloading with this integration.\n\nAs the following sample code shows, the Spark connector will turn the supported function into a SQL query and run the query in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail).\n\n```\\nimport sqlContext.implicits._val\\nsample= sqlContext.read\\n.format(\\"io.github.spark_redshift_community.spark.redshift\\")\\n.option(\\"url\\",jdbcURL )\\n.option(\\"tempdir\\", tempS3Dir)\\n.option(\\"unload_s3_format\\", \\"PARQUET\\")\\n.option(\\"dbtable\\", \\"event\\")\\n.load()\\n\\n// Create temporary views for data frames created earlier so they can be accessed via Spark SQL\\nsales_df.createOrReplaceTempView(\\"sales\\")\\ndate_df.createOrReplaceTempView(\\"date\\")\\n// Show the total sales on a given date using Spark SQL API\\nspark.sql(\\n\\"\\"\\"SELECT sum(qtysold)\\n| FROM sales, date\\n| WHERE sales.dateid = date.dateid\\n| AND caldate = '2008-01-05'\\"\\"\\".stripMargin).show()\\n```\n\n[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) integration for Apache Spark adds pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from the Redshift data warehouse to the consuming Spark application, thereby improving performance.\n\n### ++Available Now++\nThe [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) integration for Apache Spark is now available in all Regions that support [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) 6.9, [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) 4.0, and [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail). You can start using the feature directly from EMR 6.9 and Glue Studio 4.0 with the new Spark 3.3.0 version.\n\nGive it a try, and please send us feedback either in the [AWS re:Post for Amazon Redshift](https://repost.aws/tags/TAByF7MpfSQUCX_lAeDTvODw/amazon-redshift) or through your usual AWS support contacts.\n\n– [Channy](https://twitter.com/)\n\n![image.png](https://dev-media.amazoncloud.cn/8219653c601d4ee78ed3ead32d79cc1d_image.png)\n\n\n### **[Channy Yun](https://aws.amazon.com/blogs/aws/author/channy-yun/)**\nChanny Yun is a Principal Developer Advocate for AWS, and passionate about helping developers to build modern applications on latest AWS services. A pragmatic developer and blogger at heart, he loves community-driven learning and sharing of technology, which has funneled developers to global AWS Usergroups. His main topics are open-source, container, storage, network & security, and IoT. Follow him on Twitter at @channyun.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"<p><a href=\\"https://spark.apache.org/\\" target=\\"_blank\\">Apache Spark</a> is an open-source, distributed processing system commonly used for big data workloads. Spark application developers working in <a href=\\"https://aws.amazon.com/emr\\" target=\\"_blank\\">Amazon EMR</a>, <a href=\\"https://aws.amazon.com/sagemaker\\" target=\\"_blank\\">Amazon SageMaker</a>, and <a href=\\"https://aws.amazon.com/glue\\" target=\\"_blank\\">AWS Glue</a> often use third-party Apache Spark connectors that allow them to read and write the data with <a href=\\"https://aws.amazon.com/redshift\\" target=\\"_blank\\">Amazon Redshift</a>. These third-party connectors are not regularly maintained, supported, or tested with various versions of Spark for production.</p>\\n<p>Today we are announcing the general availability of <strong>Amazon Redshift integration for Apache Spark</strong>, which makes it easy to build and run Spark applications on [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) and Redshift Serverless, enabling customers to open up the data warehouse for a broader set of AWS analytics and machine learning (ML) solutions.</p>\\n<p>With Amazon Redshift integration for Apache Spark, you can get started in seconds and effortlessly build Apache Spark applications in a variety of languages, such as Java, Scala, and Python.</p>\n<p>Your applications can read from and write to your Amazon Redshift data warehouse without compromising on the performance of the applications or transactional consistency of the data, as well as performance improvements with pushdown optimizations.</p>\n<p>Amazon Redshift integration for Apache Spark builds on an <a href=\\"https://github.com/spark-redshift-community/spark-redshift\\" target=\\"_blank\\">existing open source connector project</a> and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.</p>\\n<h3><a id=\\"Getting_Started_with_Spark_Connector_for_Amazon_Redshift_10\\"></a><ins>Getting Started with Spark Connector for Amazon Redshift</ins></h3>\\n<p>To get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job or Notebook to connect to the Amazon Redshift data warehouse, and start running queries in seconds.</p>\n<p>In this launch, Amazon EMR 6.9, EMR Serverless, and AWS Glue 4.0 come with the pre-packaged connector and JDBC driver, and you can just start writing code. EMR 6.9 provides a sample notebook, and EMR Serverless provides a sample Spark Job too.</p>\n<p>First, you should set <a href=\\"https://aws.amazon.com/iam\\" target=\\"_blank\\">AWS Identity and Access Management</a> ([AWS IAM](https://aws.amazon.com/cn/iam/?trk=cndc-detail)) authentication between Redshift and Spark, between <a href=\\"https://aws.amazon.com/s3\\" target=\\"_blank\\">Amazon Simple Storage Service</a> ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)) and Spark, and between Redshift and [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail). The following diagram describes the authentication between [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), Redshift, the Spark driver, and Spark executors.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/e2b6651479254dc3adaf1e6db5d1b5bd_image.png\\" alt=\\"image.png\\" /></p>\n<p>For more information, see <a href=\\"https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-authentication-access-control.html\\" target=\\"_blank\\">Identity and access management in Amazon Redshift</a> in the AWS documentation.</p>\\n<h3><a id=\\"Amazon_EMR_22\\"></a><ins>Amazon EMR</ins></h3>\\n<p>If you already have an Amazon Redshift data warehouse and the data available, you can create the database user and provide the right level of grants to the database user. To use this with Amazon EMR, you need to upgrade to the latest version of the Amazon EMR 6.9 that has the packaged <code>spark-redshift connector</code>. Select the <code>emr-6.9.0</code> release when you create an EMR cluster on Amazon EC2.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/878dc6f8590c470d8c1785181536af0b_image.png\\" alt=\\"image.png\\" /></p>\n<p>You can use EMR Serverless to create your Spark application using the <code>emr-6.9.0</code> release to run your workload.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7cbca1171b4e4ad899c587256dc296cf_image.png\\" alt=\\"image.png\\" /></p>\n<p>EMR Studio also provides an example Jupyter Notebook configured to connect to an Amazon Redshift Serverless endpoint leveraging sample data that you can use to get started quickly.</p>\n<p>Here is a Scalar example to build your applications both with Spark Dataframe and Spark SQL. Use IAM-based credentials for connecting to Redshift and use IAM role for unloading and loading data from S3.</p>\n<pre><code class=\\"lang-\\">// Create the JDBC connection URL and define the Redshift context\\nval jdbcURL = &quot;jdbc:redshift:iam://&lt;RedshiftEndpoint&gt;:&lt;Port&gt;/&lt;Database&gt;?DbUser=&lt;RsUser&gt;&quot;\\nval rsOptions = Map (\\n &quot;url&quot; -&gt; jdbcURL,\\n &quot;tempdir&quot; -&gt; tempS3Dir, \\n &quot;aws_iam_role&quot; -&gt; roleARN,\\n )\\n// Reference the sales table from Redshift \\nval sales_df = spark\\n .read \\n .format(&quot;io.github.spark_redshift_community.spark.redshift&quot;) \\n .options(rsOptions) \\n .option(&quot;dbtable&quot;, &quot;sales&quot;) \\n .load() \\nsales_df.createOrReplaceTempView(&quot;sales&quot;) \\n// Reference the date table from Redshift using Data Frame \\nsales_df.join(date_df, sales_df(&quot;dateid&quot;) === date_df(&quot;dateid&quot;))\\n .where(col(&quot;caldate&quot;) === &quot;2008-01-05&quot;)\\n .groupBy().sum(&quot;qtysold&quot;)\\n .select(col(&quot;sum(qtysold)&quot;))\\n .show()\\n</code></pre>\\n<p>If Amazon Redshift and Amazon EMR are in different VPCs, you have to configure VPC peering or enable cross-VPC access. Assuming both Amazon Redshift and Amazon EMR are in the same virtual private cloud (VPC), you can create a Spark job or Notebook and connect to the Amazon Redshift data warehouse and write Spark code to use the Amazon Redshift connector.</p>\n<p>To learn more, see <a href=\\"https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-redshift.html\\" target=\\"_blank\\">Use Spark on Amazon Redshift with a connector</a> in the AWS documentation.</p>\\n<h3><a id=\\"AWS_Glue_65\\"></a>[AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail)</h3>\\n<p>When you use AWS Glue 4.0, the spark-redshift connector is available both as a source and target. In Glue Studio, you can use a visual ETL job to read or write to a Redshift data warehouse simply by selecting a Redshift connection to use within a built-in Redshift source or target node.</p>\n<p>The Redshift connection contains Redshift connection details along with the credentials needed to access Redshift with the proper permissions.</p>\n<p>To get started, choose <strong>Jobs</strong> in the left menu of the <strong>Glue Studio</strong> console. Using either of the Visual modes, you can easily add and edit a source or target node and define a range of transformations on the data without writing any code.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/478a4aaa9c7d412aa90c4c25dc717c93_image.png\\" alt=\\"image.png\\" /></p>\n<p>Choose <strong>Create</strong> and you can easily add and edit a source, target node, and the transform node in the job diagram. At this time, you will choose [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) as <strong>Source</strong> and <strong>Target</strong>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/13e9243b7c77402c9617bffe4595caec_image.png\\" alt=\\"image.png\\" /></p>\n<p>Once completed, the Glue job can be executed on Glue for the Apache Spark engine, which will automatically use the latest spark-redshift connector.</p>\n<p>The following Python script shows an example job to read and write to Redshift with dynamicframe using the spark-redshift connector.</p>\n<pre><code class=\\"lang-\\">import sys\\nfrom awsglue.transforms import *\\nfrom awsglue.utils import getResolvedOptions\\nfrom pyspark.context import SparkContext\\nfrom awsglue.context import GlueContext\\nfrom awsglue.job import Job\\n\\n## @params: [JOB_NAME]\\nargs = getResolvedOptions(sys.argv, ['JOB_NAME'])\\nsc = SparkContext()\\nglueContext = GlueContext(sc)\\nspark = glueContext.spark_session\\njob = Job(glueContext)\\njob.init(args['JOB_NAME'], args)\\n\\nprint(&quot;================ DynamicFrame Read ===============&quot;)\\nurl = &quot;jdbc:redshift://&lt;RedshiftEndpoint&gt;:&lt;Port&gt;/dev&quot;\\nread_options = {\\n &quot;url&quot;: url,\\n &quot;dbtable&quot;: dbtable,\\n &quot;redshiftTmpDir&quot;: redshiftTmpDir,\\n &quot;tempdir&quot;: redshiftTmpDir,\\n &quot;aws_iam_role&quot;: aws_iam_role,\\n &quot;autopushdown&quot;: &quot;true&quot;,\\n &quot;include_column_list&quot;: &quot;false&quot;\\n}\\n\\nredshift_read = glueContext.create_dynamic_frame.from_options(\\n connection_type=&quot;redshift&quot;,\\n connection_options=read_options\\n) \\n\\nprint(&quot;================ DynamicFrame Write ===============&quot;)\\n\\nwrite_options = {\\n &quot;url&quot;: url,\\n &quot;dbtable&quot;: dbtable,\\n &quot;user&quot;: user,\\n &quot;password&quot;: password,\\n &quot;redshiftTmpDir&quot;: redshiftTmpDir,\\n &quot;tempdir&quot;: redshiftTmpDir,\\n &quot;aws_iam_role&quot;: aws_iam_role,\\n &quot;autopushdown&quot;: &quot;true&quot;,\\n &quot;DbUser&quot;: user\\n}\\n\\nprint(&quot;================ dyf write result: check redshift table ===============&quot;)\\nredshift_write = glueContext.write_dynamic_frame.from_options(\\n frame=redshift_read,\\n connection_type=&quot;redshift&quot;,\\n connection_options=write_options\\n)\\n</code></pre>\\n<p>When you set up your job detail, you can only use the <strong>Glue 4.0 – Supports spark 3.3 Python 3</strong> version for this integration.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/8e9479b0ed494976829538b4349c52b5_image.png\\" alt=\\"image.png\\" /></p>\n<p>To learn more, see <a href=\\"https://docs.aws.amazon.com/glue/latest/ug/creating-jobs-chapter.html\\" target=\\"_blank\\">Creating ETL jobs with AWS Glue Studio</a> and <a href=\\"https://docs.aws.amazon.com/glue/latest/ug/connectors-chapter.html\\" target=\\"_blank\\">Using connectors and connections with AWS Glue Studio</a> in the AWS documentation.</p>\\n<h3><a id=\\"Gaining_the_Best_Performance_147\\"></a><ins>Gaining the Best Performance</ins></h3>\\n<p>In the Amazon Redshift integration for Apache Spark, the Spark connector automatically applies predicate and query pushdown to optimize for performance. You can gain performance improvement by using the default Parquet format for the connector used for unloading with this integration.</p>\n<p>As the following sample code shows, the Spark connector will turn the supported function into a SQL query and run the query in Amazon Redshift.</p>\n<pre><code class=\\"lang-\\">import sqlContext.implicits._val\\nsample= sqlContext.read\\n.format(&quot;io.github.spark_redshift_community.spark.redshift&quot;)\\n.option(&quot;url&quot;,jdbcURL )\\n.option(&quot;tempdir&quot;, tempS3Dir)\\n.option(&quot;unload_s3_format&quot;, &quot;PARQUET&quot;)\\n.option(&quot;dbtable&quot;, &quot;event&quot;)\\n.load()\\n\\n// Create temporary views for data frames created earlier so they can be accessed via Spark SQL\\nsales_df.createOrReplaceTempView(&quot;sales&quot;)\\ndate_df.createOrReplaceTempView(&quot;date&quot;)\\n// Show the total sales on a given date using Spark SQL API\\nspark.sql(\\n&quot;&quot;&quot;SELECT sum(qtysold)\\n| FROM sales, date\\n| WHERE sales.dateid = date.dateid\\n| AND caldate = '2008-01-05'&quot;&quot;&quot;.stripMargin).show()\\n</code></pre>\\n<p>Amazon Redshift integration for Apache Spark adds pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from the Redshift data warehouse to the consuming Spark application, thereby improving performance.</p>\n<h3><a id=\\"Available_Now_175\\"></a><ins>Available Now</ins></h3>\\n<p>The Amazon Redshift integration for Apache Spark is now available in all Regions that support Amazon EMR 6.9, AWS Glue 4.0, and Amazon Redshift. You can start using the feature directly from EMR 6.9 and Glue Studio 4.0 with the new Spark 3.3.0 version.</p>\n<p>Give it a try, and please send us feedback either in the <a href=\\"https://repost.aws/tags/TAByF7MpfSQUCX_lAeDTvODw/amazon-redshift\\" target=\\"_blank\\">AWS re:Post for Amazon Redshift</a> or through your usual AWS support contacts.</p>\\n<p>– <a href=\\"https://twitter.com/\\" target=\\"_blank\\">Channy</a></p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/8219653c601d4ee78ed3ead32d79cc1d_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"Channy_Yunhttpsawsamazoncomblogsawsauthorchannyyun_185\\"></a><strong><a href=\\"https://aws.amazon.com/blogs/aws/author/channy-yun/\\" target=\\"_blank\\">Channy Yun</a></strong></h3>\n<p>Channy Yun is a Principal Developer Advocate for AWS, and passionate about helping developers to build modern applications on latest AWS services. A pragmatic developer and blogger at heart, he loves community-driven learning and sharing of technology, which has funneled developers to global AWS Usergroups. His main topics are open-source, container, storage, network &amp; security, and IoT. Follow him on Twitter at @channyun.</p>\n"}
0
目录
关闭