Welcome to #160 of the Amazon Web Services open source newsletter, where we try and share all the important open source news, projects, events, and content that open source builders want. This week we have new projects that include tools to help you build data workflows, Terraform modules to help you incorporate temporary elevated access controls, integrating Tailscale to change your traffic flows, a neat Amazon Lambda debugging tool, Go bindings for Cedar, and more.
We also feature content this week on lots of great open source technologies, including Apache Spark, Cedar, Keycloak, Apache Cassandra, Ray, Apache Airflow, deequ, Babelfish for Aurora PostgreSQL, MySQL, PyTorch, ROS, Kotlin, Amazon SDK for pandas, Apache Kafka, Notation, and more! Finally, don't miss the videos and events sections, and make sure you check out events happening this week including the open source data meetup in London where I will be speaking.
Please please please take 1 minute to [complete this short survey](https://pulse.buildon.aws/survey/PJRTOUMK?trk=cndc-detail?trk=cndc-detail) and bask in the warmth of having helped improve how we support open source.
#### **Celebrating open source contributors**
The articles and projects shared in this newsletter are only possible thanks to the many contributors in open source. I would like to shout out and thank those folks who really do power open source and enable us all to learn and build on top of what they have created.
So thank you to the following open source heroes: Suman Debnath, Justin Garrison, Aidan Steele, Nathan Glover, Lars Jacobsson, John Mille, Abdel Jaidi, Anton Kukushkin, Leon Luttenberger, Jimmy Ray, Jesse Butler, Stuti Deshpande, Jesus Max Hernandez, Divya Gaitonde, Joseph Barlan, Aniket Jiddigoudar, Parnab Basak, Fernando Gamero, Shubham Mehta, Michael Raney, Meet Bhagdev, Lotfi Mouhib, Vara Bonthu, Praveen Koorse, and Mike Hicks, Jeremy Cowan and Shane Corbett
#### **Latest open source projects**
The great thing about open source projects is that you can review the source code. If you like the look of these projects, make sure you that take a look at the code, and if it is useful to you, get in touch with the maintainer to provide feedback, suggestions or even submit a contribution.
**Amazon Web Services-ddk**
[Amazon Web Services-ddk](https://aws-oss.beachgeek.co.uk/2wh?trk=cndc-detail) (DataOps Development Kit) is an open source development framework to help you build data workflows and modern data architecture on Amazon Web Services. From the README:
The DDK offers high-level abstractions of Amazon CDK constructs, allowing you to quickly build pipelines that manage data flows on Amazon Web Services. You can create your pipelines using Typescript or Python. We want our customers to focus on writing code that adds business value, whether that is a data transformation, cleaning data to train a model, or creating a report. We believe that orchestrating pipelines, creating infrastructure, and creating the DevOps behind that infrastructure is undifferentiated heavy lifting and should be done as easily as possible using a robust framework.
**terraform-Amazon Web Services-sso-elevator**
[terraform-Amazon Web Services-sso-elevator](https://github.com/fivexl/terraform-aws-sso-elevator?trk=cndc-detail) is a Terraform module for implementing temporary elevated access via Amazon IAM Identity Center (Successor to Amazon Single Sign-On) and Slack. From the README:
While temporary credentials through SSO/IAM Identity Center is a significant improvement compared to static IAM users, there is still a risk of IDP credentials compromise. It makes creating permanently assigned permission sets a tricky task - balancing between allowing enough and mitigating the risk of credentials compromise. To address this, we've created this Terraform module - Amazon SSO Elevator. It allows requesting and granting temporary elevated access for Amazon SSO through a Slack request/approval workflow to reduce security risks. And it is now open-sourced. Despite its recent debut in the open-source space, it's been used and proven valuable by our customers.
Check out the rest of the README for additional details and features this provides, and you can also check out the demo. This is certainly on my todo list to try out.
And here is a [link](https://aws-oss.beachgeek.co.uk/2wj?trk=cndc-detail) to the Terraform registry.
[freedata](https://github.com/aidansteele/freedata?trk=cndc-detail) when it comes to pushing creativity boundaries in what you can do with Amazon Web Services, few people reach Amazon Web Services Hero Aidan Steele's level. Freedata alters the traffic patterns from your webapp-hosting EC2 instances via Amazon Systems Manager Session Manager and Tailscale. Session Manager streams are limited to about 10mbit. Not to be taken seriously, but fun nevertheless.
[lambda-debug](https://github.com/ljacobsson/lambda-debug#lambda-debug?trk=cndc-detail) is the latest update from our good friend and Amazon Web Services Community Builder Lars Jacobsson. Lambda-Debug is a tool that enables you to invoke Amazon Lambda functions in the cloud from any event source and intercept the requests with breakpoints locally. There's similar functionality in SST which this is inspired by, but is designed to work with Amazon SAM and CloudFormation and, in a later version, CDK. As always, documentation and examples of how you can get started are bar raising.
##### **Amazon Web Services-greengrass-json-gzip**
[Amazon Web Services-greengrass-json-gzip](https://github.com/t04glovern/aws-greengrass-json-gzip?trk=cndc-detail) is an Amazon Greengrass component from Nathan Glover that takes a stream of JSON messages from StreamManager and batches them into a gzip file. It uses a JSON Line (JSONL) format for the messages. This is a prototype component, and Nathan is experimenting how to approach batching high-frequency data direct to S3 pre-partitioned - without relying on Cloud services like Kinesis or Firehose.
[cedar-go](https://github.com/Joffref/cedar?trk=cndc-detail) is a project from Mathis Joffre that provides a Go binding for Amazon Cedar Policy using wasm to embed the Cedar engine with near zero overhead. The repo provides an example of how to use the Cedar engine to evaluate a policy inside your Go code. Very cool.
#### **Demos, Samples, Solutions and Workshops**
[game-streaming-on-kubernetes](https://github.com/rothgar/game-streaming-on-kubernetes?trk=cndc-detail) this repo from Justin Garrison is an example of how to set up a Kubernetes cluster and an example workload to stream a game from Kubernetes using [Sunshine](https://github.com/LizardByte/Sunshine?trk=cndc-detail). [Sunshine](https://github.com/LizardByte/Sunshine?trk=cndc-detail) is an open source implementation of NVIDIA GameStream and Moonlight is an open source client for playing games. This repo heavily builds on the work from Games on Whales (GOW). They provided the main containers and an example implementation of a Kubernetes deployment. The workload in this example folder has been changed but will be upstreamed at some point.
[pycon_polars101](https://github.com/debnsuma/pycon_polars101?trk=cndc-detail) is a workshop that my colleague Suman Debnath recently delivered at PyCon. In this workshop you will start with Polars basics and compare Polars with Pandas DataFrame. You will walk through code exploring functions and features of Polars, for example load and transform data from CSV, Excel, or Parquet, perform data analysis in parallel and prepare your data for machine learning pipelines and shall compare with Pandas and Spark. The workshop will focus on the following, which makes Polars special: parallel hashing, lazy execution, and expresive API
#### **Amazon Web Services and Community blog posts**
As part of getting familiar with Amazon Identity Centre and looking to integrate an identity provider other than Active Directory (and the other usual suspects) I decided to deploy Keycloak. Keycloak is a very feature rich, open source alternative for those folks who want to run their own identity and access management service. I thought this would be straight forward to setup on my Amazon account, but it turned out anything other than that! I put together a quick post to share my findings and to help future explorers looking to deploy Keycloak on Amazon Web Services. Check it out at [Integrating Keycloak as my Identity Provider for IAM Identity Centre: Part one, deploying Keycloak on Amazon Web Services](https://dev.to/aws/integrating-keycloak-as-my-identity-provider-for-iam-identity-centre-part-one-deploying-keycloak-on-aws-2ol1?trk=cndc-detail) \[hands on]
**Amazon SDK for pandas**
In the post, [Advanced patterns with Amazon SDK for pandas on Amazon Glue for Ray](https://aws.amazon.com/cn/blogs/big-data/advanced-patterns-with-aws-sdk-for-pandas-on-aws-glue-for-ray/?trk=cndc-detail), Abdel Jaidi, Anton Kukushkin, and Leon Luttenberger showcase some more advanced patterns to run your workloads using Amazon SDK for pandas. In particular, these examples demonstrated how Ray is used within the library to distribute operations for several other Amazon services, not just Amazon S3. \[hands on]
[Notation](https://github.com/notaryproject/notation?trk=cndc-detail) is a CLI project to add signatures as standard items in the registry ecosystem, and to build a set of simple tooling for signing and verifying these signatures. Jimmy Ray and Jesse Butler have put together this great post, [Announcing Container Image Signing with Amazon Signer and Amazon EKS](https://aws.amazon.com/cn/blogs/containers/announcing-container-image-signing-with-aws-signer-and-amazon-eks/?trk=cndc-detail) that walks you through the newly launched Amazon Signer Container Image Signing. This new service provides the capability for customers to have native Amazon support for signing and verifying container images stored in container registries like Amazon Elastic Container Registry (Amazon ECR). Amazon Signer is a fully managed code signing service to ensure trust and integrity of your code, and supports signing and verifying container images using Notation.
The blog post looks at some of the upstream projects that Amazon Web Services teams have been contributing to (such as kyverno-notation-Amazon Web Services and Ratify). Also don't miss Jimmy's sample repo,[ k8s-notary-admission](https://github.com/aws-samples/k8s-notary-admission?trk=cndc-detail) that provides a non-production example of how to use the Notation CLI with Amazon Signer to verify container image signatures in Kubernetes for container images stored in private Amazon ECR registries.
Make sure you check out this weeks must read post.
Amazon Glue Data Quality allows you to measure and monitor the quality of your data so that you can make good business decisions. Built on top of the open-source [DeeQu framework](https://github.com/awslabs/deequ?trk=cndc-detail), Amazon Glue Data Quality provides a managed, serverless experience.
Last week there was a series of posts that are worth checking out if you are exploring this open source framework. The series kicked off with, [Getting started with Amazon Glue Data Quality from the Amazon Glue Data Catalog](https://aws.amazon.com/cn/blogs/big-data/getting-started-with-aws-glue-data-quality-from-the-aws-glue-data-catalog/?trk=cndc-detail) where Stuti Deshpande, Jesus Max Hernandez, Divya Gaitonde, Joseph Barlan, and Aniket Jiddigoudar talk about the ease and speed of incorporating data quality rules using Amazon Glue Data Quality into your Data Catalog tables and how to run recommendations and evaluate data quality against your tables. The post provides links to the other posts that explore different dimensions of how to use Amazon Glue Data Quality. Make sure you check out this series of posts, top notch stuff.
Amazon Web Services Community Builder John Milne was a guest on Build on Open Source a while back ([check out the session](https://www.twitch.tv/videos/1770306287?collection=a7OU6drfRxfAag?trk=cndc-detail)) talking about his open source journey and showing us some of his open source projects. In his latest blog post,[ Journey of creating a new Amazon CloudFormation resource](https://dev.to/aws-builders/journey-of-creating-a-new-aws-cloudformation-resource-3483?trk=cndc-detail) shows you how to create custom Cloudformation resources and how to integrate them using ecs-compose-x and [troposphere](https://github.com/cloudtools/troposphere?trk=cndc-detail).
### **LLM roundup**
Last week we saw a number of interesting posts covering the very hot large language model space. The pick of these for me were:
• [Train a Large Language Model on a single Amazon SageMaker GPU with Hugging Face and LoRA](https://aws.amazon.com/cn/blogs/machine-learning/train-a-large-language-model-on-a-single-amazon-sagemaker-gpu-with-hugging-face-and-lora/?trk=cndc-detail) shows you how to train the 7-billion-parameter BloomZ model using just a single graphics processing unit (GPU) on Amazon SageMaker \[hands on]
• [Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker](https://aws.amazon.com/cn/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/?trk=cndc-detail) looks at the release of a new Hugging Face Deep Learning Container (DLC) for inference with Large Language Models (LLMs) \[hands on]
### **Big data and analytics**
• Best practices for migrating SQL Server MERGE statements to Babelfish for Aurora PostgreSQL covers one use case of the Babelfish Compass tool to convert them to equivalent Babelfish T-SQL code \[hands on]
• [Cross-account Amazon Aurora MySQL migration with Aurora cloning and binlog replication for reduced downtime](https://aws.amazon.com/cn/blogs/database/cross-account-amazon-aurora-mysql-migration-with-aurora-cloning-and-binlog-replication-for-reduced-downtime/?trk=cndc-detail) explains the various steps involved to migrate your Aurora MySQL cluster from one Amazon account to another Amazon account while setting up binary log replication to reduce the downtime \[hands on]
• [Deep dive on Amazon MSK tiered storage](https://aws.amazon.com/cn/blogs/big-data/deep-dive-on-amazon-msk-tiered-storage/?trk=cndc-detail) dives deep into the underlying infrastructure affects Apache Kafka performance when you use Amazon Managed Streaming for Apache Kafka (Amazon MSK) tiered storage \[hands on]
• [Stream data with Amazon DocumentDB, Amazon MSK Serverless, and Amazon MSK Connect](https://aws.amazon.com/cn/blogs/database/stream-data-with-amazon-documentdb-amazon-msk-serverless-and-amazon-msk-connect/?trk=cndc-detail) is a hands on guide to help you run and configure the open-source MongoDB Kafka connector to move data between Amazon MSK and Amazon DocumentDB \[hands on]
#### **Other posts and quick reads**
• [Build high-performance ML models using PyTorch 2.0 on Amazon Web Services – Part 1](https://aws.amazon.com/cn/blogs/machine-learning/part-1-build-high-performance-ml-models-using-pytorch-2-0-on-aws/?trk=cndc-detail) demonstrates the performance and ease of running large-scale, high-performance distributed ML model training and deployment using PyTorch 2.0 on Amazon Web Services \[hands on]
• [Create RESTful APIs on Amazon Web Services with OpenAPI Specification (With No Coding)](https://aws.amazon.com/cn/blogs/opensource/create-restful-apis-on-aws-with-openapi-specification-with-no-coding/?trk=cndc-detail) looks at how you can use Amazon Web Services serverless technologies and the OpenAPI specification \[hands on]
• [Announcing Amazon S3 checksums support in the Amazon SDK for Kotlin](https://aws.amazon.com/cn/blogs/developer/announcing-amazon-s3-checksums-support-in-the-aws-sdk-for-kotlin/?trk=cndc-detail) shows you how to begin using Amazon S3 checksums with the Amazon SDK for Kotlin \[hands on]
• [Orchestrate NVIDIA Isaac Sim and ROS 2 Navigation on Amazon RoboMaker with a public container image](https://aws.amazon.com/cn/blogs/iot/orchestrate-nvidia-isaac-sim-ros-2-navigation-aws-robomaker-public-container-image/?trk=cndc-detail) dives deep into how to run high-fidelity simulations using NVIDIA Isaac Sim and ROS 2 Navigation on Amazon RoboMaker \[hands on]
#### **Quick updates**
Amazon Managed Workflows for Apache Airflow (MWAA) now supports in-place version upgrades for environments version 2.x and later. Amazon MWAA is a managed orchestration service for Apache Airflow that makes it easier to set up and operate end-to-end data pipelines in the cloud. With in-place version upgrades on Amazon MWAA, customers can seamlessly upgrade their existing version 2.x environments to newer available Airflow versions while retaining the execution history and configurations, enabling them to leverage the latest capabilities of the Airflow platform. Amazon MWAA manages the entire upgrade process, from provisioning new Apache Airflow versions to upgrading the metadata database. In case of a failure during the upgrade, Amazon MWAA is designed to roll back to the previous stable version and the associated metadata database snapshot.
[Make sure you read Introducing in-place version upgrades with Amazon MWAA](https://aws.amazon.com/cn/blogs/big-data/introducing-in-place-version-upgrades-with-amazon-mwaa/?trk=cndc-detail), where Parnab Basak, Fernando Gamero, and Shubham Mehta walk you through this in depth and look at typical use cases and considerations you need to think about.
Amazon Keyspaces (for Apache Cassandra) is a scalable, serverless, highly available, and fully managed Apache Cassandra-compatible database service. Amazon Keyspaces now supports Multi-Region Replication. Amazon Keyspaces Multi-Region Replication is a new capability that provides you with automated, fully-managed, active-active replication across the Amazon Regions of your choice. You can improve both availability and resiliency from regional degradation while also benefiting from low latency local reads and writes for global applications. With Multi-Region Replication, Keyspaces asynchronously replicates data between Regions and data is typically propagated within a second. Multi-Region Replication also eliminates the difficult work of resolving update conflicts and correcting for data divergence issues, enabling you to focus on your application.
For those wanting more details, then check out this post, [Announcing Amazon Keyspaces Multi-Region Replication](https://aws.amazon.com/cn/blogs/big-data/introducing-in-place-version-upgrades-with-amazon-mwaa/?trk=cndc-detail) where Michael Raney and Meet Bhagdev look at the benefits and use cases of this new feature and demonstrate how to get started using Multi-Region Replication in Amazon Keyspaces.
Amazon EMR on EKS now supports Spark Operator and spark-submit as new job submission models for Apache Spark, in addition to the existing StartJobRun API. With today’s launch, you now have the flexibility to submit your Apache Spark jobs via your preferred submission model on Amazon EMR on EKS without needing to change your application.
Prior to today’s launch, you could only submit Apache Spark jobs via the StartJobRun API, including using the Amazon CLI and Amazon Controllers for Kubernetes (ACK). Customers with existing Apache Spark applications running Spark Operator or spark-submit would have to make changes to their applications in order to use Amazon EMR on EKS. With this feature, you can now run your applications on EMR on EKS without changing them, benefit from the EMR Spark runtime performance and features, and save time by using the spark-submit and Spark Operator you are already familiar with.
Read [Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit](https://aws.amazon.com/cn/blogs/big-data/introducing-amazon-emr-on-eks-job-submission-with-spark-operator-and-spark-submit/?trk=cndc-detail) where Lotfi Mouhib and Vara Bonthu walk you through the process of setting up and running Spark jobs using both Spark Operator and spark-submit.
Amazon Glue for Ray, a data integration engine option on Amazon Glue, is now generally available. Amazon Glue for Ray helps data engineers and ETL (extract, transform, and load) developers scale their Python jobs. Amazon Glue for Ray combines that serverless capability for data integration with Ray (ray.io), a popular new open-source compute framework that helps you scale Python workloads.
Similar to Apache Spark and Python engines on Amazon Glue, you only pay for the resources that you use while running code, and you don’t need to configure or tune the resources. Amazon Glue for Ray facilitates the distributed processing of your Python code over multi-node clusters. You can create and run Ray jobs anywhere that you can run Amazon Glue ETL jobs. This includes existing Amazon Glue jobs, command line interfaces (CLIs), and APIs. You can select the Amazon Glue for Ray engine locally or through notebooks on Amazon Glue Studio and Amazon SageMaker Studio Notebook. When the Ray job is ready, you can run it on demand or on a schedule.
Amazon Lambda now supports Ruby 3.2 as both a managed runtime and a container base image. Developers creating serverless applications in Lambda with Ruby 3.2 can take advantage of new features such as endless methods, a new Data class, improved pattern matching, and performance improvements. For more information on Lambda’s support for Ruby 3.2, see our blog post at Ruby 3.2 runtime now available in Amazon Lambda.
To deploy Lambda functions using Ruby 3.2, upload the code through the Lambda console and select the Ruby 3.2 runtime. You can also use the Amazon CLI, Amazon Serverless Application Model (Amazon SAM) and Amazon CloudFormation to deploy and manage serverless applications written in Ruby 3.2. Additionally, you can also use the Amazon-provided Ruby 3.2 base image to build and deploy Ruby 3.2 functions using a container image. To migrate existing Lambda functions running earlier Ruby versions, review your code for compatibility with Ruby 3.2 and then update the function runtime to Ruby 3.2.
To dive deeper, read the post[ Ruby 3.2 runtime now available in Amazon Web Services Lambda](https://aws.amazon.com/cn/blogs/compute/ruby-3-2-runtime-now-available-in-aws-lambda/?trk=cndc-detail) where Praveen Koorse goes into more details including stuff you need to know to get started.
#### **Videos of the week**
Check out Mike Hicks as he demo's features of the Cedar policy language during the Open Source Summit NA in Vancouver last month. You should also read the supporting post in the New Stack, [The Cedar Programming Language: Authorization Simplified](https://thenewstack.io/the-cedar-programming-language-authorization-simplified/?trk=cndc-detail)
Join Jeremy Cowan and Shane Corbett from the Containers from the Couch posse as they speak with guest Natan Yellin at how you can optimise your container resources using krr]\(<https://aws-oss.beachgeek.co.uk/2vv>), an open source project from Robusta.
**Build on Open Source**
For those unfamiliar with this show, Build on Open Source is where we go over this newsletter and then invite special guests to dive deep into their open source project. Expect plenty of code, demos and hopefully laughs. We have put together a playlist so that you can easily access all (sixteen) of the episodes of the Build on Open Source show. [Build on Open Source playlist.](https://www.twitch.tv/collections/a7OU6drfRxfAag?trk=cndc-detail)
We are currently planning the third series - if you have an open source project you want to talk about, get in touch and we might be able to feature your project in future episodes of Build on Open Source.
### **Events for your diary**
If you are planning any events in 2023, either virtual, in person, or hybrid, get in touch as I would love to share details of your event with readers.
**London Open Source Data Infrastructure Meetup Huckletree Shoreditch, June 14th 6:30PM - 9:30PM**
As part of London Tech Week Aiven are co-hosting the 'London Open Source Data Infrastructure Meetup' on the 14th of June. This event is in collaboration with Hugh Evans from Daemon the 'AI and Deep Learning for Enterprise' group. The speaker panel includes Ricardo Sueiras from Amazon Web Services discussing the wonders of Apache Airflow, Davies Oludare from Confluent shedding light on the intricate world of Kafka, and Ed Shee from Seldon breaking down the nuances of ML-powered summarisation.
Spaces are limited, so make sure you reserve your spot by [checking and out the meetup page](https://www.meetup.com/login/?returnUri=https%3A%2F%2Fwww.meetup.com%2Fuk-open-source-data-infrastructure-meetup%2Fevents%2F293469951?trk=cndc-detail)
**Open Source Festival**
**Lagos, Nigeria - June 15th-17th**
An established and essential event for all open source developers, Open Source Festival promises to be even bigger and better than previous years. I feel very privileged to have spoken at this event in 2020, and so make sure you check this out if you are in the region or perhaps if you are maybe looking to sponsor an open source event, then maybe this is the one you should check out.
Check out more on their homepage, [Open Source Festival](https://festival.oscafrica.org/?trk=cndc-detail)
**Live Amazon Web Services and Apache Hudi Workshop Online, June 22nd at 8am PT**
Nadine Farah and Soumil S are teaming up to deliver a live workshop that will mimic a typical use case, in this instance, building near real time dashboards for a fictional ride sharing company.
Make sure you reserve your place by going to the registration page, [Live AWS and Apache Hudi Workshop: Build a ride share lakehouse platform](https://www.onehouse.ai/live-aws-and-apache-hudi-workshop-build-a-ride-share-lakehouse-platform?trk=cndc-detail)
**Every other Thursday, next one 16th February**
The Cortex community call happens every two weeks on Thursday, alternating at 1200 UTC and 1700 UTC. You can check out the GitHub project for more details, go to the [Community Meetings ](https://github.com/cortexproject/cortex#community-meetings?trk=cndc-detail)section. The community calls keep a rolling doc of previous meetings, so you can catch up on the previous discussions. Check the [Cortex Community Meetings Notes](https://docs.google.com/document/d/1shtXSAqp3t7fiC-9uZcKkq3mgwsItAJlH6YW6x1joZo/edit?trk=cndc-detail) for more info.
**Every other Tuesday, 3pm GMT**
This regular meet-up is for anyone interested in OpenSearch & Open Distro. All skill levels are welcome and they cover and welcome talks on topics including: search, logging, log analytics, and data visualisation.
Sign up to the next session, [OpenSearch Community Meeting](https://www.meetup.com/OpenSearch/?trk=cndc-detail)
### **Stay in touch with open source at Amazon Web Services**
Remember to check out the [Open Source homepage](https://aws.amazon.com/cn/opensource/?opensource-all.sort-by=item.additionalFields.startDate\&opensource-all.sort-order=asc\&blog-posts-content-open-source.sort-by=item.additionalFields.createdDate\&blog-posts-content-open-source.sort-order=desc?trk=cndc-detail) to keep up to date with all our activity in open source by following us on [@Amazon Web ServicesOpen](https://twitter.com/AWSOpen?trk=cndc-detail)