{"value":"Data engineers use various Python packages to meet their data processing requirements while building data pipelines with [AWS Glue PySpark Jobs](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html). Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use cases, such as data manipulation and analysis. However, managing standardized packages can be cumbersome with multiple teams using different versions of packages, installing non-approved packages, and causing duplicate development effort due to the lack of visibility of what is available at the enterprise level. This can be especially challenging in large enterprises with multiple data engineering teams.\n\nETL Developers have requirements to use additional packages for their [AWS Glue ETL](https://aws.amazon.com/glue/) jobs. With security being job zero for customers, many will restrict egress traffic from their VPC to the public internet, and they need a way to manage the packages used by applications including their data processing pipelines.\n\nOur proposed solution will enable you with network egress restrictions to manage packages centrally with [AWS CodeArtifact](https://aws.amazon.com/codeartifact/) and use their favorite libraries in their [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL PySpark code. In this post, we’ll describe how CodeArtifact can be used for managing packages and modules for [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL jobs, and we’ll demo a solution using Glue PySpark jobs that run within VPC Subnets that have no internet access.\n\n#### **Solution overview**\nThe solution uses CodeArtifact as a tool to make it easier for organizations of any size to securely store, publish, and share software packages used in their ETL with [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail). VPC Endpoints will be enabled for CodeArtifact and Glue to enable [private link connections](\\t\\thttps://docs.aws.amazon.com/vpc/latest/privatelink/endpoint-services-overview.html). [AWS Step Functions](https://aws.amazon.com/step-functions/) makes it easy to coordinate the orchestration of components used in the data processing pipeline. Native integrations with both CodeArtifact and [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) enable the workflow to both authenticate the request to CodeArtifact and start the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL job.\n\nThe following architecture shows an implementation of a solution using [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail), CodeArtifact, and Step Functions to use additional Python modules without egress internet access. The solution is deployed using [AWS Cloud Development Kit](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) ([AWS CDK](https://aws.amazon.com/cdk/)), an open-source software development framework to define your cloud application resources using familiar programming languages.\n\n![image.png](https://dev-media.amazoncloud.cn/66434a0f15084a15b1cc321a1d5af724_image.png)\n\nFig 1: Architecture Diagram for the Solution\n\nTo illustrate how to set up this architecture, we’ll walk you through the following steps:\n\n1. Deploying an [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) stack to provision the following AWS Resources\n\t- CodeArtifact\n\t- An [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job\n\t- Step Functions workflow\n\t- [Amazon Simple Storage Service](https://aws.amazon.com/cn/s3/?trk=cndc-detail) ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)) bucket\n\t- VPC with a private Subnet and VPC Endpoints to [Amazon S3](https://aws.amazon.com/s3/) and CodeArtifact\n2. Validate the Deployment.\n3. Run a Sample Workflow – This workflow will run an [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) PySpark job that uses a custom Python library, and an upgraded version of boto3.\n4. Cleaning up your resources.\n\n#### **Prerequisites**\nMake sure that you complete the following steps as prerequisites:\n- Have an AWS account. For this post, you configure the required AWS resources using [AWS CloudFormation](https://aws.amazon.com/cloudformation/). If you haven’t signed up, complete the following tasks:\n\t- Create an account. For instructions, see [Sign Up for AWS](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/))\n\t- Create an [AWS Identity and Access Management](http://aws.amazon.com/iam) (IAM) user. For instructions, see Create IAM User.\n- Have the following installed and configured on your machine:\n\t- [AWS Command Line Interface (AWS CLI), authenticated and configured](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)\n\t- [Python 3.8+](https://www.python.org/downloads/)\n\t- [AWS CDK](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html)\n\t- [Git](http://git-scm.com/downloads)\n\n#### **The solution**\n##### **Launching your [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) Stack**\n**Step 1:** Using your device’s command line, check out our Git repository to a local directory on your device:\n\n```\\ngit clone https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets.git\\n```\n\n**Step 2**: Change directories to the new directory [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) script location:\n\n```\\ncd python-lib-management-without-internet-for-aws-glue-in-private-subnets/scripts/s3\\n```\n**Step 3**: Download the following CSV, which contains New York City Taxi and Limousine Commission (TLC) Trip weekly trips. This will serve as the input source for the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Job:\n\n```\\naws s3 cp s3://nyc-tlc/misc/FOIL_weekly_trips_apps.csv .\\n```\n**Step 4**: Change the directories to the path where the app.py file is located (in reference to the previous step, execute the following step):\n\n```\\ncd ../..\\n```\n**Step 5**: Create a virtual environment:\n\nmacOS/Linux:\n```\\npython3 -m venv .env\\n```\n\nWindows:\n```\\npython -m venv .env\\n```\n\n**Step 6:** Activate the virtual environment after the init process completes and the virtual environment is created:\n\nmacOS/Linux:\n```\\nsource .env/bin/activate\\n```\n\nWindows:\n```\\n.env\\\\Scripts\\\\activate.bat\\n```\n\n**Step 7:** Install the required dependencies:\n\n```\\npip3 install -r requirements.txt\\n```\n\n**Step 8:** Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:\n\n```\\ncdk synthesize\\n```\n**Step 9:** BootStrap the cdk app using the following command:\n\n```\\ncdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>\\n```\nReplace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.\n\nThis step provisions the initial resources, including an [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) bucket for storing files and IAM roles that grant permissions needed to perform deployments.\n\n**Step 10:** Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the ```--require-approval never``` flag:\n\n```\\ncdk deploy \\"*\\" --require-approval never\\n```\nWhile the [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) deploys the CloudFormation stacks, you can follow the deployment progress in your terminal:\n\n![image.png](https://dev-media.amazoncloud.cn/b5021653e4e2463d8fc2cb589cccd528_image.png)\n\nFig 2: [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) Deployment progress in terminal\n\nOnce the deployment is successful, you’ll see the successful status as follows:\n\n![image.png](https://dev-media.amazoncloud.cn/93c75fe7da3a4cd897e40a1b4e958cfd_image.png)\n\nFig 3: [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) Deployment completion success\n\n**Step 11:** Log in to the [AWS Console](https://aws.amazon.com/console/), go to CloudFormation, and see the output of the ```ApplicationStack``` stack:\n\n![image.png](https://dev-media.amazoncloud.cn/1823e5b4745d483c95931d686b0b6df6_image.png)\n\nFig 4: [AWS CloudFormation](https://aws.amazon.com/cn/cloudformation/?trk=cndc-detail) stack output\n\nNote the values of the ```DomainName``` and ```RepositoryName``` variables. We’ll use them in the next step to upload our artifacts\n\n**Step 12:** We will upload a custom library into the repo that we created. This will be used by our Glue ETL job.\n\n- Install twine using pip:\n```\\npython3 -m pip install twine\\n```\nThe custom python package ```glueutils-0.2.0.tar.gz``` can be found under this folder of the cloned repo:\n```\\ncd scripts/custom_glue_library\\n```\n- Configure twine with the login command (additional details [here](https://docs.aws.amazon.com/codeartifact/latest/ug/python-configure-twine.html) ). Refer to step 11 for the ```DomainName``` and ```RepositoryName``` from the CloudFormation output:\n```\\naws codeartifact login --tool twine --domain <DomainName> --domain-owner <AWS_ACCOUNTID> --repository <RepositoryName>\\n```\n- Publish Python package assets:\n```\\ntwine upload --repository codeartifact glueutils-0.2.0.tar.gz\\n```\n\n![image.png](https://dev-media.amazoncloud.cn/8879e08ee988453798e08b24be4d0b5f_image.png)\n\nFig 5: Python package publishing using twine\n\n#### **Validate the Deployment**\nThe [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) stack will deploy the following AWS resources:\n\n1. [Amazon Virtual Private Cloud (Amazon VPC)](https://aws.amazon.com/vpc/)\n\t- One Private Subnet\n2. [AWS CodeArtifact](https://aws.amazon.com/cn/codeartifact/?trk=cndc-detail)\n\t- CodeArtifact Repository\n\t- CodeArtifact Domain\n\t- CodeArtifact Upstream Repository\n3. [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail)\n\t- [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Job\n\t- [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Database\n\t- [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Connection\n4. AWS Step Function\n5. [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) Bucket for [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) and also for storing scripts and CSV file\n6. IAM Roles and Policies\n7. [Amazon Elastic Compute Cloud (Amazon EC2)](https://aws.amazon.com/ec2/) Security Group\n\n**Step 1:** Browse to the AWS account and region via the AWS Console to which the resources are deployed.\n\n**Step 2**:Browse the Subnet page (```https://<region> .console.aws.amazon.com/vpc/home?region=<region> #subnets:```) (*Replace region with actual AWS Region to which your resources are deployed)\n\n**Step 3:** Select the Subnet with name as ```ApplicationStack/enterprise-repo-vpc/Enterprise-Repo-Private-Subnet1```\n\n**Step 4**: Select the Route Table and validate that there are no Internet Gateway or NAT Gateway for routes to Internet, and that it’s similar to the following image:\n\n![image.png](https://dev-media.amazoncloud.cn/88454222589148e2a321655ae3980111_image.png)\n\nFig 6: Route table validation\n\n**Step 5:** Navigate to the CodeArtifact console and review the repositories created. The ```enterprise-repo``` is your local repository, and ```pypi-store``` is the upstream repository connected to the PyPI, providing artifacts from pypi.org.\n\n![image.png](https://dev-media.amazoncloud.cn/bca763b8d0bf4f86a46b7dfcc655b373_image.png)\n\nFig 7: AWS CodeArifact repositories created\n\n**Step 6**: Navigate to ```enterprise-repo``` and search for ```glueutils```. This is the custom python package that we published.\n\n![image.png](https://dev-media.amazoncloud.cn/24c49d41dd184cafb3c76fefdea23a77_image.png)\n\nFig 8: AWS CodeArifact custom python package published\n\n**Step 7**: Navigate to Step Functions Console and review the ```enterprise-repo-step-function``` as follows:\n\n![image.png](https://dev-media.amazoncloud.cn/648c711d8e56419a98f88dc4be2a066b_image.png)\n\nFig 9: [AWS Step Functions](https://aws.amazon.com/cn/step-functions/?trk=cndc-detail) workflow\n\nThe diagram shows how the Step Functions workflow will orchestrate the pattern.\n\n1. The first step ```CodeArtifactGetAuthorizationToken``` calls the [getAuthorizationToken](https://docs.aws.amazon.com/codeartifact/latest/APIReference/API_GetAuthorizationToken.html) API to generate a temporary authorization token for accessing repositories in the domain (this token is valid for 15 mins.).\n2. The next step ```GenerateCodeArtifactURL``` takes the authorization token from the response and generates the CodeArtifact URL.\n3. Then, this will move into the ```GlueStartJobRun``` state, which makes a synchronous API call to run the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job.\n\n**Step 8**: Navigate to the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Console and select the **Jobs** tab, then select ```enterprise-repo-glue-job```.\n\nThe [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job is created with the following script and [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Connection ```enterprise-repo-glue-connection```. The [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) connection is a Data Catalog object that enables the job to connect to sources and APIs from within the VPC. The network type connection runs the job from within the private subnet to make requests to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) and CodeArtifact over the VPC endpoint connection. This enables the job to run without any traffic through the internet.\n\nNote the connections section in the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) PySpark Job, which makes the Glue job run on the private subnet in the VPC provisioned.\n\n![image.png](https://dev-media.amazoncloud.cn/47a8006088ec40cbb78acb76d2a81151_image.png)\n\nFig 10: [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) network connections\n\nThe job takes an [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) bucket, Glue Database, Python Job Installer Option, and Additional Python Modules as job parameters. The parameters ```--additional-python-modules``` and ```--python-modules-installer-option``` are passed to install the selected Python module from a PyPI repository hosted in [AWS CodeArtifact](https://aws.amazon.com/cn/codeartifact/?trk=cndc-detail).\n\nThe script itself first reads the [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) input path of the taxi data in the CSV format. A light transformation to sum the total trips by year, week, and app is performed. Then the output is written to an [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) path as parquet . A partitioned table in the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Data Catalog will either be created or updated if it already exists .\n\nYou can find the Glue PySpark script [here](https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets/blob/main/scripts/glue/job.py).\n\n#### **Run a sample workflow**\nThe following steps will demonstrate how to run a sample workflow:\n\n**Step 1**: Navigate to the Step Functions Console and select the ```enterprise-repo-step-function```.\n\n**Step 2**: Select Start execution and input the following: We’re including the ```glueutils``` and latest boto3 libraries as part of the job run. It is always recommended to pin your python dependencies to avoid any breaking change due to a future version of dependency . In the below example, the latest available version of boto3, and the 0.2.0 version of ```glueutils``` will be installed. To pin it to a specific release you may add boto3==1.24.2 (Current latest release at the time of publishing this post).\n```\\n{\\"pythonmodules\\": \\"boto3,glueutils==0.2.0\\"}\\n```\n**Step 3**: Select **Start execution** and wait until **Execution Status** is ```Succeeded```. This may take a few minutes.\n\n**Step 4**: Navigate to the CodeArtifact Console to review the enterprise-repo repository. You’ll see the cached PyPi packages and all of their dependencies pulled down from PyPi.\n\n**Step 5**: In the Glue Console under the Runs section of the ```enterprise-glue-job```, you’ll see the parameters passed:\n\n![image.png](https://dev-media.amazoncloud.cn/4727d8e38cbd4ff3bf1a6f8cf2382432_image.png)\n\nFig 11 : [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job execution history\n\nNote the ```--index-url``` which was passed as a parameter to the glue ETL job. The token is valid only for 15 minutes.\n\n**Step 6**: Navigate to the [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) Console and go to the ```/aws/glue-jobs``` log group to verify that the packages were installed from the local repo.\n\n![image.png](https://dev-media.amazoncloud.cn/13188290cfdd451aa6e5d1efb2963ce6_image.png)\nYou will see that the 2 package names passed as parameters are installed with the corresponding versions.\n\nFig 12 : [Amazon CloudWatch](https://aws.amazon.com/cn/cloudwatch/?trk=cndc-detail) logs details for the Glue job\n\n**Step 7**: Navigate to the [Amazon Athena](https://aws.amazon.com/athena/) console and select **Query Editor**.\n\n**Step 8**: Run the following query to validate the output of the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job:\n```\\nSELECT year, app, SUM(total_trips) as sum_of_total_trips \\nFROM \\n\\"codeartifactblog_glue_db\\".\\"taxidataparquet\\" \\nGROUP BY year, app;\\n```\n#### **Clean up**\nMake sure that you clean up all of the other AWS resources that you created in the [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) Stack deployment. You can delete these resources via the [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) Destroy command as follows or the [CloudFormation console](https://console.aws.amazon.com/cloudformation/home?region=us-east-1%22%20\\\\l%20%22/stacks).\n\nTo destroy the resources using [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail), follow these steps:\n1. Follow Steps 1-6 from the ‘**Launching your CDK Stack**’ section.\n2. Destroy the app by executing the following command:\n```\\ncdk destroy\\n```\n#### **Conclusion**\nIn this post, we demonstrated how CodeArtifact can be used for managing Python packages and modules for [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) jobs that run within VPC Subnets that have no internet access. We also demonstrated how the versions of existing packages can be updated (i.e., boto3) and a custom Python library (glueutils) that is developed locally is also managed through CodeArtifact.\n\nThis post enables you to use your favorite Python packages with [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL PySpark jobs by modifying the input to the AWS StepFunctions workflow (Step 2 in the Run a Sample workflow section).\n\n##### **About the Authors**\n\n![image.png](https://dev-media.amazoncloud.cn/5b62a6f5d48c4f1eac25618957903c7f_image.png)\n\n**Bret Pontillo** is a Data & ML Engineer with AWS Professional Services. He works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his free time, Bret enjoys traveling, watching sports, and trying new restaurants.\n\n![image.png](https://dev-media.amazoncloud.cn/7893a3ed17cb479c93e2531cb46b0d58_image.png)\n\n**Gaurav Gundal** is a DevOps consultant with AWS Professional Services, helping customers build solutions on the customer platform. When not building, designing, or developing solutions, Gaurav spends time with his family, plays guitar, and enjoys traveling to different places.\n\n![image.png](https://dev-media.amazoncloud.cn/9cd2176ce25744be808f4147b950e64d_image.png)\n\n**Ashok Padmanabhan** is a Sr. IOT Data Architect with AWS Professional Services, helping customers build data and analytics platform and solutions. When not helping customers build and design data lakes, Ashok enjoys spending time at the beach near his home in Florida.","render":"<p>Data engineers use various Python packages to meet their data processing requirements while building data pipelines with <a href=\\"https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python.html\\" target=\\"_blank\\">AWS Glue PySpark Jobs</a>. Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use cases, such as data manipulation and analysis. However, managing standardized packages can be cumbersome with multiple teams using different versions of packages, installing non-approved packages, and causing duplicate development effort due to the lack of visibility of what is available at the enterprise level. This can be especially challenging in large enterprises with multiple data engineering teams.</p>\\n<p>ETL Developers have requirements to use additional packages for their <a href=\\"https://aws.amazon.com/glue/\\" target=\\"_blank\\">AWS Glue ETL</a> jobs. With security being job zero for customers, many will restrict egress traffic from their VPC to the public internet, and they need a way to manage the packages used by applications including their data processing pipelines.</p>\\n<p>Our proposed solution will enable you with network egress restrictions to manage packages centrally with <a href=\\"https://aws.amazon.com/codeartifact/\\" target=\\"_blank\\">AWS CodeArtifact</a> and use their favorite libraries in their [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL PySpark code. In this post, we’ll describe how CodeArtifact can be used for managing packages and modules for [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL jobs, and we’ll demo a solution using Glue PySpark jobs that run within VPC Subnets that have no internet access.</p>\\n<h4><a id=\\"Solution_overview_6\\"></a><strong>Solution overview</strong></h4>\\n<p>The solution uses CodeArtifact as a tool to make it easier for organizations of any size to securely store, publish, and share software packages used in their ETL with AWS Glue. VPC Endpoints will be enabled for CodeArtifact and Glue to enable <a href=\\"https://docs.aws.amazon.com/vpc/latest/privatelink/endpoint-services-overview.html\\" target=\\"_blank\\">private link connections</a>. <a href=\\"https://aws.amazon.com/step-functions/\\" target=\\"_blank\\">AWS Step Functions</a> makes it easy to coordinate the orchestration of components used in the data processing pipeline. Native integrations with both CodeArtifact and [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) enable the workflow to both authenticate the request to CodeArtifact and start the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) ETL job.</p>\\n<p>The following architecture shows an implementation of a solution using AWS Glue, CodeArtifact, and Step Functions to use additional Python modules without egress internet access. The solution is deployed using AWS Cloud Development Kit (<a href=\\"https://aws.amazon.com/cdk/\\" target=\\"_blank\\">AWS CDK</a>), an open-source software development framework to define your cloud application resources using familiar programming languages.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/66434a0f15084a15b1cc321a1d5af724_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 1: Architecture Diagram for the Solution</p>\n<p>To illustrate how to set up this architecture, we’ll walk you through the following steps:</p>\n<ol>\\n<li>Deploying an AWS CDK stack to provision the following AWS Resources\\n<ul>\\n<li>CodeArtifact</li>\n<li>An AWS Glue job</li>\n<li>Step Functions workflow</li>\n<li>Amazon Simple Storage Service (Amazon S3) bucket</li>\n<li>VPC with a private Subnet and VPC Endpoints to <a href=\\"https://aws.amazon.com/s3/\\" target=\\"_blank\\">Amazon S3</a> and CodeArtifact</li>\\n</ul>\n</li>\\n<li>Validate the Deployment.</li>\n<li>Run a Sample Workflow – This workflow will run an AWS Glue PySpark job that uses a custom Python library, and an upgraded version of boto3.</li>\n<li>Cleaning up your resources.</li>\n</ol>\\n<h4><a id=\\"Prerequisites_27\\"></a><strong>Prerequisites</strong></h4>\\n<p>Make sure that you complete the following steps as prerequisites:</p>\n<ul>\\n<li>Have an AWS account. For this post, you configure the required AWS resources using <a href=\\"https://aws.amazon.com/cloudformation/\\" target=\\"_blank\\">AWS CloudFormation</a>. If you haven’t signed up, complete the following tasks:\n<ul>\\n<li>Create an account. For instructions, see <a href=\\"https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/\\" target=\\"_blank\\">Sign Up for AWS</a>)</li>\\n<li>Create an <a href=\\"http://aws.amazon.com/iam\\" target=\\"_blank\\">AWS Identity and Access Management</a> (IAM) user. For instructions, see Create IAM User.</li>\\n</ul>\n</li>\\n<li>Have the following installed and configured on your machine:\\n<ul>\\n<li><a href=\\"https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html\\" target=\\"_blank\\">AWS Command Line Interface (AWS CLI), authenticated and configured</a></li>\\n<li><a href=\\"https://www.python.org/downloads/\\" target=\\"_blank\\">Python 3.8+</a></li>\\n<li><a href=\\"https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html\\" target=\\"_blank\\">AWS CDK</a></li>\\n<li><a href=\\"http://git-scm.com/downloads\\" target=\\"_blank\\">Git</a></li>\\n</ul>\n</li>\\n</ul>\n<h4><a id=\\"The_solution_38\\"></a><strong>The solution</strong></h4>\\n<h5><a id=\\"Launching_your_AWS_CDK_Stack_39\\"></a><strong>Launching your AWS CDK Stack</strong></h5>\\n<p><strong>Step 1:</strong> Using your device’s command line, check out our Git repository to a local directory on your device:</p>\\n<pre><code class=\\"lang-\\">git clone https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets.git\\n</code></pre>\\n<p><strong>Step 2</strong>: Change directories to the new directory [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) script location:</p>\\n<pre><code class=\\"lang-\\">cd python-lib-management-without-internet-for-aws-glue-in-private-subnets/scripts/s3\\n</code></pre>\\n<p><strong>Step 3</strong>: Download the following CSV, which contains New York City Taxi and Limousine Commission (TLC) Trip weekly trips. This will serve as the input source for the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Job:</p>\\n<pre><code class=\\"lang-\\">aws s3 cp s3://nyc-tlc/misc/FOIL_weekly_trips_apps.csv .\\n</code></pre>\\n<p><strong>Step 4</strong>: Change the directories to the path where the app.py file is located (in reference to the previous step, execute the following step):</p>\\n<pre><code class=\\"lang-\\">cd ../..\\n</code></pre>\\n<p><strong>Step 5</strong>: Create a virtual environment:</p>\\n<p>macOS/Linux:</p>\n<pre><code class=\\"lang-\\">python3 -m venv .env\\n</code></pre>\\n<p>Windows:</p>\n<pre><code class=\\"lang-\\">python -m venv .env\\n</code></pre>\\n<p><strong>Step 6:</strong> Activate the virtual environment after the init process completes and the virtual environment is created:</p>\\n<p>macOS/Linux:</p>\n<pre><code class=\\"lang-\\">source .env/bin/activate\\n</code></pre>\\n<p>Windows:</p>\n<pre><code class=\\"lang-\\">.env\\\\Scripts\\\\activate.bat\\n</code></pre>\\n<p><strong>Step 7:</strong> Install the required dependencies:</p>\\n<pre><code class=\\"lang-\\">pip3 install -r requirements.txt\\n</code></pre>\\n<p><strong>Step 8:</strong> Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. [AWS CDK](https://aws.amazon.com/cn/cdk/?trk=cndc-detail) apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:</p>\\n<pre><code class=\\"lang-\\">cdk synthesize\\n</code></pre>\\n<p><strong>Step 9:</strong> BootStrap the cdk app using the following command:</p>\\n<pre><code class=\\"lang-\\">cdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>\\n</code></pre>\\n<p>Replace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.</p>\n<p>This step provisions the initial resources, including an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.</p>\n<p><strong>Step 10:</strong> Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the <code>--require-approval never</code> flag:</p>\\n<pre><code class=\\"lang-\\">cdk deploy "*" --require-approval never\\n</code></pre>\\n<p>While the AWS CDK deploys the CloudFormation stacks, you can follow the deployment progress in your terminal:</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/b5021653e4e2463d8fc2cb589cccd528_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 2: AWS CDK Deployment progress in terminal</p>\n<p>Once the deployment is successful, you’ll see the successful status as follows:</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/93c75fe7da3a4cd897e40a1b4e958cfd_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 3: AWS CDK Deployment completion success</p>\n<p><strong>Step 11:</strong> Log in to the <a href=\\"https://aws.amazon.com/console/\\" target=\\"_blank\\">AWS Console</a>, go to CloudFormation, and see the output of the <code>ApplicationStack</code> stack:</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/1823e5b4745d483c95931d686b0b6df6_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 4: AWS CloudFormation stack output</p>\n<p>Note the values of the <code>DomainName</code> and <code>RepositoryName</code> variables. We’ll use them in the next step to upload our artifacts</p>\\n<p><strong>Step 12:</strong> We will upload a custom library into the repo that we created. This will be used by our Glue ETL job.</p>\\n<ul>\\n<li>Install twine using pip:</li>\n</ul>\\n<pre><code class=\\"lang-\\">python3 -m pip install twine\\n</code></pre>\\n<p>The custom python package <code>glueutils-0.2.0.tar.gz</code> can be found under this folder of the cloned repo:</p>\\n<pre><code class=\\"lang-\\">cd scripts/custom_glue_library\\n</code></pre>\\n<ul>\\n<li>Configure twine with the login command (additional details <a href=\\"https://docs.aws.amazon.com/codeartifact/latest/ug/python-configure-twine.html\\" target=\\"_blank\\">here</a> ). Refer to step 11 for the <code>DomainName</code> and <code>RepositoryName</code> from the CloudFormation output:</li>\\n</ul>\n<pre><code class=\\"lang-\\">aws codeartifact login --tool twine --domain <DomainName> --domain-owner <AWS_ACCOUNTID> --repository <RepositoryName>\\n</code></pre>\\n<ul>\\n<li>Publish Python package assets:</li>\n</ul>\\n<pre><code class=\\"lang-\\">twine upload --repository codeartifact glueutils-0.2.0.tar.gz\\n</code></pre>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/8879e08ee988453798e08b24be4d0b5f_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 5: Python package publishing using twine</p>\n<h4><a id=\\"Validate_the_Deployment_153\\"></a><strong>Validate the Deployment</strong></h4>\\n<p>The AWS CDK stack will deploy the following AWS resources:</p>\n<ol>\\n<li><a href=\\"https://aws.amazon.com/vpc/\\" target=\\"_blank\\">Amazon Virtual Private Cloud (Amazon VPC)</a>\n<ul>\\n<li>One Private Subnet</li>\n</ul>\\n</li>\n<li>AWS CodeArtifact\\n<ul>\\n<li>CodeArtifact Repository</li>\n<li>CodeArtifact Domain</li>\n<li>CodeArtifact Upstream Repository</li>\n</ul>\\n</li>\n<li>AWS Glue\\n<ul>\\n<li>AWS Glue Job</li>\n<li>AWS Glue Database</li>\n<li>AWS Glue Connection</li>\n</ul>\\n</li>\n<li>AWS Step Function</li>\n<li>Amazon S3 Bucket for AWS CDK and also for storing scripts and CSV file</li>\n<li>IAM Roles and Policies</li>\n<li><a href=\\"https://aws.amazon.com/ec2/\\" target=\\"_blank\\">Amazon Elastic Compute Cloud (Amazon EC2)</a> Security Group</li>\\n</ol>\n<p><strong>Step 1:</strong> Browse to the AWS account and region via the AWS Console to which the resources are deployed.</p>\\n<p><strong>Step 2</strong>:Browse the Subnet page (<code>https://<region> .console.aws.amazon.com/vpc/home?region=<region> #subnets:</code>) (*Replace region with actual AWS Region to which your resources are deployed)</p>\\n<p><strong>Step 3:</strong> Select the Subnet with name as <code>ApplicationStack/enterprise-repo-vpc/Enterprise-Repo-Private-Subnet1</code></p>\\n<p><strong>Step 4</strong>: Select the Route Table and validate that there are no Internet Gateway or NAT Gateway for routes to Internet, and that it’s similar to the following image:</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/88454222589148e2a321655ae3980111_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 6: Route table validation</p>\n<p><strong>Step 5:</strong> Navigate to the CodeArtifact console and review the repositories created. The <code>enterprise-repo</code> is your local repository, and <code>pypi-store</code> is the upstream repository connected to the PyPI, providing artifacts from pypi.org.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/bca763b8d0bf4f86a46b7dfcc655b373_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 7: AWS CodeArifact repositories created</p>\n<p><strong>Step 6</strong>: Navigate to <code>enterprise-repo</code> and search for <code>glueutils</code>. This is the custom python package that we published.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/24c49d41dd184cafb3c76fefdea23a77_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 8: AWS CodeArifact custom python package published</p>\n<p><strong>Step 7</strong>: Navigate to Step Functions Console and review the <code>enterprise-repo-step-function</code> as follows:</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/648c711d8e56419a98f88dc4be2a066b_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 9: AWS Step Functions workflow</p>\n<p>The diagram shows how the Step Functions workflow will orchestrate the pattern.</p>\n<ol>\\n<li>The first step <code>CodeArtifactGetAuthorizationToken</code> calls the <a href=\\"https://docs.aws.amazon.com/codeartifact/latest/APIReference/API_GetAuthorizationToken.html\\" target=\\"_blank\\">getAuthorizationToken</a> API to generate a temporary authorization token for accessing repositories in the domain (this token is valid for 15 mins.).</li>\\n<li>The next step <code>GenerateCodeArtifactURL</code> takes the authorization token from the response and generates the CodeArtifact URL.</li>\\n<li>Then, this will move into the <code>GlueStartJobRun</code> state, which makes a synchronous API call to run the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job.</li>\\n</ol>\n<p><strong>Step 8</strong>: Navigate to the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) Console and select the <strong>Jobs</strong> tab, then select <code>enterprise-repo-glue-job</code>.</p>\\n<p>The AWS Glue job is created with the following script and AWS Glue Connection <code>enterprise-repo-glue-connection</code>. The [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) connection is a Data Catalog object that enables the job to connect to sources and APIs from within the VPC. The network type connection runs the job from within the private subnet to make requests to [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) and CodeArtifact over the VPC endpoint connection. This enables the job to run without any traffic through the internet.</p>\\n<p>Note the connections section in the AWS Glue PySpark Job, which makes the Glue job run on the private subnet in the VPC provisioned.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/47a8006088ec40cbb78acb76d2a81151_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 10: AWS Glue network connections</p>\n<p>The job takes an Amazon S3 bucket, Glue Database, Python Job Installer Option, and Additional Python Modules as job parameters. The parameters <code>--additional-python-modules</code> and <code>--python-modules-installer-option</code> are passed to install the selected Python module from a PyPI repository hosted in [AWS CodeArtifact](https://aws.amazon.com/cn/codeartifact/?trk=cndc-detail).</p>\\n<p>The script itself first reads the Amazon S3 input path of the taxi data in the CSV format. A light transformation to sum the total trips by year, week, and app is performed. Then the output is written to an Amazon S3 path as parquet . A partitioned table in the AWS Glue Data Catalog will either be created or updated if it already exists .</p>\n<p>You can find the Glue PySpark script <a href=\\"https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets/blob/main/scripts/glue/job.py\\" target=\\"_blank\\">here</a>.</p>\\n<h4><a id=\\"Run_a_sample_workflow_223\\"></a><strong>Run a sample workflow</strong></h4>\\n<p>The following steps will demonstrate how to run a sample workflow:</p>\n<p><strong>Step 1</strong>: Navigate to the Step Functions Console and select the <code>enterprise-repo-step-function</code>.</p>\\n<p><strong>Step 2</strong>: Select Start execution and input the following: We’re including the <code>glueutils</code> and latest boto3 libraries as part of the job run. It is always recommended to pin your python dependencies to avoid any breaking change due to a future version of dependency . In the below example, the latest available version of boto3, and the 0.2.0 version of <code>glueutils</code> will be installed. To pin it to a specific release you may add boto3==1.24.2 (Current latest release at the time of publishing this post).</p>\\n<pre><code class=\\"lang-\\">{"pythonmodules": "boto3,glueutils==0.2.0"}\\n</code></pre>\\n<p><strong>Step 3</strong>: Select <strong>Start execution</strong> and wait until <strong>Execution Status</strong> is <code>Succeeded</code>. This may take a few minutes.</p>\\n<p><strong>Step 4</strong>: Navigate to the CodeArtifact Console to review the enterprise-repo repository. You’ll see the cached PyPi packages and all of their dependencies pulled down from PyPi.</p>\\n<p><strong>Step 5</strong>: In the Glue Console under the Runs section of the <code>enterprise-glue-job</code>, you’ll see the parameters passed:</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/4727d8e38cbd4ff3bf1a6f8cf2382432_image.png\\" alt=\\"image.png\\" /></p>\n<p>Fig 11 : AWS Glue job execution history</p>\n<p>Note the <code>--index-url</code> which was passed as a parameter to the glue ETL job. The token is valid only for 15 minutes.</p>\\n<p><strong>Step 6</strong>: Navigate to the <a href=\\"https://aws.amazon.com/cloudwatch/\\" target=\\"_blank\\">Amazon CloudWatch</a> Console and go to the <code>/aws/glue-jobs</code> log group to verify that the packages were installed from the local repo.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/13188290cfdd451aa6e5d1efb2963ce6_image.png\\" alt=\\"image.png\\" /><br />\\nYou will see that the 2 package names passed as parameters are installed with the corresponding versions.</p>\n<p>Fig 12 : Amazon CloudWatch logs details for the Glue job</p>\n<p><strong>Step 7</strong>: Navigate to the <a href=\\"https://aws.amazon.com/athena/\\" target=\\"_blank\\">Amazon Athena</a> console and select <strong>Query Editor</strong>.</p>\\n<p><strong>Step 8</strong>: Run the following query to validate the output of the [AWS Glue](https://aws.amazon.com/cn/glue/?trk=cndc-detail) job:</p>\\n<pre><code class=\\"lang-\\">SELECT year, app, SUM(total_trips) as sum_of_total_trips \\nFROM \\n"codeartifactblog_glue_db"."taxidataparquet" \\nGROUP BY year, app;\\n</code></pre>\\n<h4><a id=\\"Clean_up_260\\"></a><strong>Clean up</strong></h4>\\n<p>Make sure that you clean up all of the other AWS resources that you created in the AWS CDK Stack deployment. You can delete these resources via the AWS CDK Destroy command as follows or the <a href=\\"https://console.aws.amazon.com/cloudformation/home?region=us-east-1%22%20%5Cl%20%22/stacks\\" target=\\"_blank\\">CloudFormation console</a>.</p>\\n<p>To destroy the resources using AWS CDK, follow these steps:</p>\n<ol>\\n<li>Follow Steps 1-6 from the ‘<strong>Launching your CDK Stack</strong>’ section.</li>\\n<li>Destroy the app by executing the following command:</li>\n</ol>\\n<pre><code class=\\"lang-\\">cdk destroy\\n</code></pre>\\n<h4><a id=\\"Conclusion_269\\"></a><strong>Conclusion</strong></h4>\\n<p>In this post, we demonstrated how CodeArtifact can be used for managing Python packages and modules for AWS Glue jobs that run within VPC Subnets that have no internet access. We also demonstrated how the versions of existing packages can be updated (i.e., boto3) and a custom Python library (glueutils) that is developed locally is also managed through CodeArtifact.</p>\n<p>This post enables you to use your favorite Python packages with AWS Glue ETL PySpark jobs by modifying the input to the AWS StepFunctions workflow (Step 2 in the Run a Sample workflow section).</p>\n<h5><a id=\\"About_the_Authors_274\\"></a><strong>About the Authors</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/5b62a6f5d48c4f1eac25618957903c7f_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Bret Pontillo</strong> is a Data & ML Engineer with AWS Professional Services. He works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his free time, Bret enjoys traveling, watching sports, and trying new restaurants.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7893a3ed17cb479c93e2531cb46b0d58_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Gaurav Gundal</strong> is a DevOps consultant with AWS Professional Services, helping customers build solutions on the customer platform. When not building, designing, or developing solutions, Gaurav spends time with his family, plays guitar, and enjoys traveling to different places.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/9cd2176ce25744be808f4147b950e64d_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Ashok Padmanabhan</strong> is a Sr. IOT Data Architect with AWS Professional Services, helping customers build data and analytics platform and solutions. When not helping customers build and design data lakes, Ashok enjoys spending time at the beach near his home in Florida.</p>\n"}