Enable federated governance using Trino and Apache Ranger on Amazon EMR

海外精选
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"Managing data through a central data platform simplifies staffing and training challenges and reduces the costs. However, it can create scaling, ownership, and accountability challenges, because central teams may not understand the specific needs of a data domain, whether it’s because of data types and storage, security, data catalog requirements, or specific technologies needed for data processing. One of the architecture patterns that has emerged recently to tackle this challenge is the data mesh architecture, which gives ownership and autonomy to individual teams who own the data. One of the major components of implementing a data mesh architecture lies in enabling federated governance, which includes centralized authorization and audits.\n\n[Apache Ranger](https://ranger.apache.org/) is an open-source project that provides authorization and audit capabilities for Hadoop and related big data applications like Apache Hive, Apache HBase, and Apache Kafka.\n\n[Trino](https://trino.io/), on the other hand, is a highly parallel and distributed query engine, and provides federated access to data by using connectors to multiple backend systems like Hive, [Amazon Redshift](http://aws.amazon.com/redshift), and [Amazon OpenSearch Service](https://aws.amazon.com/opensearch-service/). Trino acts as a single access point to query all data sources.\n\nBy combining Trino query federation features with the authorization and audit capability of Apache Ranger, you can enable federated governance. This allows multiple [purpose-built data engines](https://aws.amazon.com/products/databases/) to function as one, with a single centralized place to manage data access controls.\n\nThis post shares details on how to architect this solution using the new EMR Ranger Trino plugin on [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) 6.7.\n\n\n#### **Solution overview**\n\n\n[Trino](https://trino.io/) allows you to query data in different sources, using an extensive set of connectors. This feature enables you to have a single point of entry for all data sources that can be queried through SQL.\n\nThe following diagram illustrates the high-level overview of the architecture.\n\n![image.png](https://dev-media.amazoncloud.cn/c5a466ea9e304475956e6025117bc7c8_image.png)\n\nThis architecture is based on four major components:\n\n- Windows AD, which is responsible for providing the identities of users across the system. It’s mainly composed of a key distribution center (KDC) that provides kerberos tickets to AD users to interact with the EMR cluster, and a Lightweight Directory Access Protocol (LDAP) server that defines the organization of users in logical structures.\n- An Apache Ranger server, which runs on an [Amazon Elastic Compute Cloud](http://aws.amazon.com/ec2) (Amazon EC2) instance whose lifecycle is independent from the one of the EMR cluster. Apache Ranger is composed of a Ranger admin server that stores and retrieves policies in and from a MySQL database running in [Amazon Relational Database Service](http://aws.amazon.com/rds) ([Amazon RDS](https://aws.amazon.com/cn/rds/?trk=cndc-detail)), a usersync server that connects to the Windows AD LDAP server to synchronize identities to make them available for policy settings, and an optional Apache Solr server to index and store audits.\n- An [Amazon RDS for MySQL](https://aws.amazon.com/rds/mysql/) database instance used by the Hive metastore to store metadata related to the tables schemas, and the Apache Ranger server to store the access control policies.\n- An EMR cluster with the following configuration updates:\n\t- Apache Ranger security configuration.\n\t- A local KDC that establishes a one-way trust with Windows AD in order to have the Kerberized EMR cluster recognize the user identities from the AD.\n\t- A Hue user interface with LDAP authentication enabled to run SQL queries on the Trino engine.\n- An [Amazon CloudWatch](https://aws.amazon.com/cn/cloudwatch/?trk=cndc-detail) log group to store all audit logs for the Amazon Web Services managed Ranger plugins.\n- (Optional) Trino connectors for other execution engines like [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), [Amazon OpenSearch Service](https://aws.amazon.com/cn/opensearch-service/?trk=cndc-detail), PostgresSQL, and others.\n\n\n#### **Prerequisites**\n\n\nBefore getting started, you must have the following prerequisites. For more information, refer to the **Prerequisites** and Setting up your resources sections in [Introducing Amazon EMR integration with Apache Ranger](https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-integration-with-apache-ranger/).\n\n- [Amazon Web Services Identity and Access Management](http://aws.amazon.com/iam) (IAM) roles for Apache Ranger and other Amazon Web Services services , and update to the Updates to the [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)EMR role. For more information see [IAM roles for native integration with Apache Ranger](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-iam.html)\n- Certificates and private keys for Apache Ranger plugins and the Apache Ranger server uploaded to [Amazon Web Services Secrets Manager](https://aws.amazon.com/secrets-manager/)\n- TLS/SSL mutual authentication enabled between the Apache Ranger server and Apache Ranger plugins running on the EMR cluster\n- A self-managed Apache Ranger server (2.x only) outside of an EMR cluster\n- A CloudWatch log group for Apache Ranger audits\n- All [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) nodes, including core and task nodes, must be able to communicate with the Apache Ranger Admin servers, [Amazon Simple Storage Service](https://aws.amazon.com/cn/s3/?trk=cndc-detail) ([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)), [Amazon Web Services Key Management Service](http://aws.amazon.com/kms) (AWS KMS) if using EMRFS SSE-KMS, CloudWatch, and [Amazon Web Services Security Token Service](https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html) (Amazon Web Services STS).\n\nTo set up the new Apache Ranger Trino plugin, the following steps are required:\n\n1. Delete any existing Presto service definitions in the Apache Ranger admin server:\n\n```\\n#Delete Presto Service Definition\\ncurl -f -u *<admin users login>*:*_<_**_password_ **_for_** _ranger admin user_**_>_* -X DELETE -k 'https://*<RANGER SERVER ADDRESS>*:6182/service/public/v2/api/servicedef/name/presto'\\n```\n\n2. Download and add new Apache Ranger service definitions for Trino in the Apache Ranger admin server\n\n```\\n wget https://s3.amazonaws.com/elasticmapreduce/ranger/service-definitions/version-2.0/ranger-servicedef-amazon-emr-trino.json\\n\\ncurl -u *<admin users login>*:*_<_**_password_ **_for_** _ranger admin user_**_>_* -X POST -d @ranger-servicedef-amazon-emr-trino.json \\\\\\n-H \\"Accept: application/json\\" \\\\\\n-H \\"Content-Type: application/json\\" \\\\\\n-k 'https://*<RANGER SERVER ADDRESS>*:6182/service/public/v2/api/servicedef'\\n\\n```\n\n3. Create a new [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) security configuration for Apache Ranger installation to include Trino policy repository details. For more information, see [Create the EMR security configuration](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-security-config.html).\n4. Optionally, if you want to use the Hue UI to run Trino SQL, add the ```hue``` user to the Apache Ranger admin server. Run the following command on the Ranger admin server:\n\n```\\n# Note: input parameter Ranger host IP address\\n \\n set -x\\nranger_server_fqdn=\$1\\nRANGER_HTTP_URL=https://\$ranger_server_fqdn:6182\\n\\ncat > hueuser.json << EOF\\n{ \\n \\"name\\": \\"hue1\\",\\n \\"firstName\\": \\"hue\\",\\n \\"lastName\\": \\"\\",\\n \\"loginId\\": \\"hue1\\",\\n \\"emailAddress\\" : null,\\n \\"description\\" : \\"hue user\\",\\n \\"password\\" : \\"user1pass\\",\\n \\"groupIdList\\": [],\\n \\"groupNameList\\": [],\\n \\"status\\": 1,\\n \\"isVisible\\": 1,\\n \\"userRoleList\\": [ \\"ROLE_USER\\" ],\\n \\"userSource\\": 0\\n}\\nEOF\\n\\n#add users \\ncurl -u admin:admin -v -i -s -X POST -d @hueuser.json -H \\"Accept: application/json\\" -H \\"Content-Type: application/json\\" -k \$RANGER_HTTP_URL/service/xusers/secure/users\\n```\n\nAfter you add the ```hue``` user, it’s used to impersonate SQL calls submitted by AD users.\n\n**Warning**: Impersonation feature should always be used carefully to avoid giving any/all users access to high privileges.\n\n![image.png](https://dev-media.amazoncloud.cn/02a5769e22624853b7a2e4fa5217349f_image.png)\n\nThis post also demonstrates the capabilities of running queries against external databases, such as [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) and PostgresSQL using [Trino connectors](https://trino.io/docs/current/connector.html), while controlling access at the database, table, row, and column level using the Apache Ranger policies. This requires you to set up the database engines you want to connect with. The following example code demonstrates using the [Amazon Redshift connector](https://trino.io/docs/current/connector/redshift.html). To set up the connector, create the file ```redshift.properties``` under ```/etc/trino/conf.dist/catalog``` on all [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) nodes and restart the Trino server.\n\n- Create the redshift.properties property file on all the [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) nodes with the following code:\n\n```\\n# Create a new redshift.properties file\\n/etc/trino/conf.dist/catalog/redshift.properties\\n```\n\n- Update the property file with the [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) cluster details:\n\n```\\nconnector.name=redshift\\nconnection-url=jdbc:redshift://XXXXX:5439/dev\\nconnection-user=XXXX\\nconnection-password=XXXXX\\n```\n\n- Restart the Trino server:\n\n```\\n# Restart Trino server \\nsudo systemctl stop trino-server.service\\nsudo systemctl start trino-server.service\\n```\n\n- In a production environment, you can automate this step using the following inside your EMR Classification:\n\n```\\n{\\n\\"Classification\\": \\"trino-connector-redshift\\",\\n\\"Properties\\": {\\n\\"connector.name\\": \\"redshift\\",\\n\\"connection-url\\": \\"jdbc:redshift://XXXXX:5439/dev\\",\\n\\"connection-user\\": \\"XXXX\\",\\n\\"connection-password\\": \\"XXXX\\"\\n}\\n}\\n```\n\n#### **Test your setup**\n\nIn this section, we go through an example where the data is distributed across [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) for dimension tables and Hive for fact tables. We can use Trino to join data between these two engines.\n\nOn [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail), let’s define a new dimension table called Products and load it with data:\n\n```\\n--- Setup products table in Redshift\\n > create table public.products \\n (company VARCHAR, link VARCHAR, price FLOAT, product_category VARCHAR, \\n release_date VARCHAR, sku VARCHAR);\\n\\n--- Copy data from S3\\n\\n > COPY public.products\\n FROM 's3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/products/'\\n IAM_ROLE '<XXXXXXXXX>'\\n FORMAT AS PARQUET;\\n```\n\nThen use the Hue UI to create the Hive external table ```Orders```:\n\n```\\nCREATE EXTERNAL TABLE IF NOT EXISTS default.orders \\n(customer_id STRING, order_date STRING, price DOUBLE, sku STRING)\\nSTORED AS PARQUET\\nLOCATION 's3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/orders';\\n```\n\nNow let’s use Trino to join both datasets:\n\n```\\n-- Join the dimension table in Redshift (products) with the fact table in hive (orders), \\nto get the sum of sales by product_category and sku\\nSELECT sum(orders.price) total_sales, products.sku, products.product_category\\nFROM hive.staging.orders join redshift.public.products on orders.sku = products.sku\\ngroup by products.sku, products.product_category limit 10\\n```\n\nThe following screenshot shows our results.\n\n![image.png](https://dev-media.amazoncloud.cn/732aa991f2454527afb98997d52eda26_image.png)\n\n\n##### **Row filtering and column masking**\n\nApache Ranger supports policies to allow or deny access based on several attributes, including user, group, and predefined roles, as well as dynamic attributes like IP address and time of access. In addition, the model supports authorization based on the classification of the resources such as like PII, FINANCE, and SENSITIVE.\n\nAnother feature is the ability to allow users to access only a subset of rows in a table or restrict users to access only masked or redacted values of sensitive data. Examples of this include the ability to restrict users to access only records of customers located in the same country where the user works, or allow a user who is doctor to see only records of patients that are associated with that doctor.\n\nThe following screenshots show how, by using Trino Ranger policies, you can enable row filtering and column masking of data in [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) tables.\n\nThe example policy masks the ```firstname``` column, and applies a filter condition on the city column to restrict users to view rows for a specific city only.\n\n![image.png](https://dev-media.amazoncloud.cn/608a24587efe48cd863e64d281773abd_image.png)\n\n![image.png](https://dev-media.amazoncloud.cn/4b56f9e60fe4467d98e8be5603732504_image.png)\n\nThe following screenshot shows our results.\n\n![image.png](https://dev-media.amazoncloud.cn/52433fdb5b9245b894d7a8cf60ba57b8_image.png)\n\nDynamic row filtering using user session context\n\nThe Trino Ranger plugin passes down Trino session data like ```current_user()``` that you can use in the policy definition. This can vastly simplify your row filter conditions by removing the need for hardcoded values and using a mapping lookup. For more details on dynamic row filtering, refer to [Row-level filtering and column-masking using Apache Ranger policies in Apache Hive](Row-level filtering and column-masking using Apache Ranger policies in Apache Hive).\n\n\n##### **Known issue with [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) 6.7**\n\n\n[Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) 6.7 has a known issue when enabling Kerberos 1-way trust with Microsoft windows AD. Please run [this](https://github.com/aws-samples/aws-emr-apache-ranger/blob/main/aws_emr_blog_v3/scripts/remove-yum-package-name-validator.sh) bootstrap script following [these instructions](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapCustom) as part of the cluster launch.\n\n\n#### **Limitations**\n\nWhen using this solution, keep in mind the following limitations, further details can be found here:\n\n- Non-Kerberos clusters are not supported.\n- Audit logs are not visible on the Apache Ranger UI, because these are sent to CloudWatch.\n- The Amazon Web Services Glue Data Catalog isn’t supported as the Apache Hive Metastore.\n- The integration between [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) and Apache Ranger limits the available applications. For a full list, refer to [Application support and limitations](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-app-support.html).\n\n\n#### **Troubleshooting**\n\n\nIf you can’t log in to the EMR cluster’s node as an AD user, and you get the error message ```Permission denied, please try again```.\n\n![image.png](https://dev-media.amazoncloud.cn/c23c7c0e9fd24a8792096588f89ebd8e_image.png)\n\n\n```\\nsudo service sssd restart\\n```\n\nIf you’re unable to download policies from Ranger admin server, and you get the error message ```Error getting policies``` with the HTTP status code 400. This can be caused because either the certificate has expired or the Ranger policy definition is not set up correctly.\n\nTo fix this, check the Ranger admin logs. If it shows the following error, it’s likely that the certificates have expired.\n\n```\\nVXResponse@347f4bdbstatusCode={1} msgDesc={Unauthorized access - unable to get client certificate} messageList={[VXMessage={org.apache.ran\\nger.view.VXMessage@7ebc298cname={OPER_NOT_ALLOWED_FOR_ENTITY} rbKey={xa.error.oper_not_allowed_for_state} message={Operation not allowed for entity} objectId={n\\null} fieldName={null} }]} }\\n```\n\nYou will need to perform the following steps to address the issue\n\n- Recreate the certs using the [create-tls-certs.sh](https://github.com/aws-samples/aws-emr-apache-ranger/blob/main/aws_emr_blog_v3/scripts/emr-tls/create-tls-certs.sh) script and upload them to Secrets Manager.\n- Then update the Ranger admin server configuration with new certificate, and restart Ranger admin service.\n- Create a new EMR security configuration using the new certificate, and re-launch EMR cluster using new security configurations.\n\nThe issue can also be caused due to a misconfigured Ranger policy definition. The Ranger admin service policy definition should trust the self-signed certificate chain. Make sure the following configuration attribute for the service definitions has the correct domain name or pattern to match the domain name used for your EMR cluster nodes.\n\n\n![image.png](https://dev-media.amazoncloud.cn/97398bd049bf48ae8faf5cee1889fe9a_image.png)\n\nIf the EMR cluster keeps failing with the error message ```Terminated with errors: An internal error occurred```, check the [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) primary node secret agent logs.\n\n![image.png](https://dev-media.amazoncloud.cn/14b7829d10ff4017ad93e4079193ae1e_image.png)\n\nIf it shows the following message, the cluster is failing because the specified CloudWatch log group doesn’t exist:\n\n```\\nException in thread \\"main\\" com.amazonaws.services.logs.model.ResourceNotFoundException: The specified log group does not exist. (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: d9fa2ef1-17cb-4b8f-a07f-8a6aea245173; Proxy: null)\\n at com.amazonaws.http.AmazonHttpClient\$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)\\n at com.amazonaws.http.AmazonHttpClient\$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)\\n at com.amazonaws.http.AmazonHttpClient\$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)\\n```\n\nA query run through ```trino-cli``` might fail with the error ```Unable to obtain password from user```. For example:\n\n```\\nERROR remote-task-callback-42 io.trino.execution.StageStateMachine Stage 20220422_023236_00005_zn4vb.1 failed\\ncom.google.common.util.concurrent.UncheckedExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: javax.security.auth.login.LoginException: Unable to obtain password from user\\n```\n\n\nThis issue can occur due to incorrect realm names in the ```etc/trino/conf.dist/catalog/hive.properties``` file. Check the domain or realm name and other Kerberos related configs in the ```etc/trino/conf.dist/catalog/hive.properties``` file. Optionally, also check the /etc/trino/conf.dist/trino-env.sh and ```/etc/trino/conf.dist/config.properties``` files in case any config changes has been made.\n\n\n#### **Clean up**\n\n\nClean up the resources created either manually or by the Amazon Web Services CloudFormation template provided in GitHub repo to avoid unnecessary cost to your Amazon Web Services account. You can delete the CloudFormation stack by selecting the stack on the Amazon Web Services CloudFormation console and choosing **Delete**. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may encounter some issues during cleanup; you need to clean these up independently.\n\n\n#### **Conclusion**\n\n\nA data mesh approach encourages the idea of data domains where each domain team owns their data and is responsible for data quality and accuracy. This draws parallels with a microservices architecture. Building federated data governance like we show in this post is at the core of implementing a data mesh architecture. Combining the powerful query federation capabilities of Apache Trino with the centralized authorization and audit capabilities of Apache Ranger provides an end-to-end solution to operate and govern a data mesh platform.\n\nIn addition to the already available Ranger integrations capabilities for Apache SparkSQL, [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail), and Apache Hive, starting from 6.7 release, [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) includes plugins for Ranger Trino integrations. For more information, refer to EMR [Trino Ranger plugin](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-trino.html).\n\n\n\n##### **About the authors**\n\n\n![image.png](https://dev-media.amazoncloud.cn/7d39f4e0f6504e908dc7feb98f713201_image.png)\n\n**Varun Rao Bhamidimarri** is a Sr Manager, Amazon Web Services Analytics Specialist Solutions Architect team. His focus is helping customers with adoption of cloud-enabled analytics solutions to meet their business requirements. Outside of work, he loves spending time with his wife and two kids, stay healthy, mediate and recently picked up gardening during the lockdown.\n\n![image.png](https://dev-media.amazoncloud.cn/7940710fc5a14602bea03fba3db00ee7_image.png)\n\n**Partha Sarathi Sahoo** is an Analytics Specialist TAM – at Amazon Web Services based in Sydney, Australia. He brings 15+ years of technology expertise and helps Enterprise customers optimize Analytics workloads. He has extensively worked on both on-premise and cloud Bigdata workloads along with various ETL platform in his previous roles. He also actively works on conducting proactive operational reviews around the Analytics services like [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail), Redshift, and OpenSearch.\n\n![image.png](https://dev-media.amazoncloud.cn/469c604359e14590a607f0a45ecca6fe_image.png)\n\n**Anis Harfouche** is a Data Architect at Amazon Web Services Professional Services. He helps customers achieving their business outcomes by designing, building and deploying data solutions based on Amazon Web Services services.","render":"<p>Managing data through a central data platform simplifies staffing and training challenges and reduces the costs. However, it can create scaling, ownership, and accountability challenges, because central teams may not understand the specific needs of a data domain, whether it’s because of data types and storage, security, data catalog requirements, or specific technologies needed for data processing. One of the architecture patterns that has emerged recently to tackle this challenge is the data mesh architecture, which gives ownership and autonomy to individual teams who own the data. One of the major components of implementing a data mesh architecture lies in enabling federated governance, which includes centralized authorization and audits.</p>\n<p><a href=\\"https://ranger.apache.org/\\" target=\\"_blank\\">Apache Ranger</a> is an open-source project that provides authorization and audit capabilities for Hadoop and related big data applications like Apache Hive, Apache HBase, and Apache Kafka.</p>\\n<p><a href=\\"https://trino.io/\\" target=\\"_blank\\">Trino</a>, on the other hand, is a highly parallel and distributed query engine, and provides federated access to data by using connectors to multiple backend systems like Hive, <a href=\\"http://aws.amazon.com/redshift\\" target=\\"_blank\\">Amazon Redshift</a>, and <a href=\\"https://aws.amazon.com/opensearch-service/\\" target=\\"_blank\\">Amazon OpenSearch Service</a>. Trino acts as a single access point to query all data sources.</p>\\n<p>By combining Trino query federation features with the authorization and audit capability of Apache Ranger, you can enable federated governance. This allows multiple <a href=\\"https://aws.amazon.com/products/databases/\\" target=\\"_blank\\">purpose-built data engines</a> to function as one, with a single centralized place to manage data access controls.</p>\\n<p>This post shares details on how to architect this solution using the new EMR Ranger Trino plugin on Amazon EMR 6.7.</p>\n<h4><a id=\\"Solution_overview_11\\"></a><strong>Solution overview</strong></h4>\\n<p><a href=\\"https://trino.io/\\" target=\\"_blank\\">Trino</a> allows you to query data in different sources, using an extensive set of connectors. This feature enables you to have a single point of entry for all data sources that can be queried through SQL.</p>\\n<p>The following diagram illustrates the high-level overview of the architecture.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/c5a466ea9e304475956e6025117bc7c8_image.png\\" alt=\\"image.png\\" /></p>\n<p>This architecture is based on four major components:</p>\n<ul>\\n<li>Windows AD, which is responsible for providing the identities of users across the system. It’s mainly composed of a key distribution center (KDC) that provides kerberos tickets to AD users to interact with the EMR cluster, and a Lightweight Directory Access Protocol (LDAP) server that defines the organization of users in logical structures.</li>\n<li>An Apache Ranger server, which runs on an <a href=\\"http://aws.amazon.com/ec2\\" target=\\"_blank\\">Amazon Elastic Compute Cloud</a> (Amazon EC2) instance whose lifecycle is independent from the one of the EMR cluster. Apache Ranger is composed of a Ranger admin server that stores and retrieves policies in and from a MySQL database running in <a href=\\"http://aws.amazon.com/rds\\" target=\\"_blank\\">Amazon Relational Database Service</a> ([Amazon RDS](https://aws.amazon.com/cn/rds/?trk=cndc-detail)), a usersync server that connects to the Windows AD LDAP server to synchronize identities to make them available for policy settings, and an optional Apache Solr server to index and store audits.</li>\\n<li>An <a href=\\"https://aws.amazon.com/rds/mysql/\\" target=\\"_blank\\">Amazon RDS for MySQL</a> database instance used by the Hive metastore to store metadata related to the tables schemas, and the Apache Ranger server to store the access control policies.</li>\\n<li>An EMR cluster with the following configuration updates:\\n<ul>\\n<li>Apache Ranger security configuration.</li>\n<li>A local KDC that establishes a one-way trust with Windows AD in order to have the Kerberized EMR cluster recognize the user identities from the AD.</li>\n<li>A Hue user interface with LDAP authentication enabled to run SQL queries on the Trino engine.</li>\n</ul>\\n</li>\n<li>An Amazon CloudWatch log group to store all audit logs for the Amazon Web Services managed Ranger plugins.</li>\n<li>(Optional) Trino connectors for other execution engines like Amazon Redshift, Amazon OpenSearch Service, PostgresSQL, and others.</li>\n</ul>\\n<h4><a id=\\"Prerequisites_33\\"></a><strong>Prerequisites</strong></h4>\\n<p>Before getting started, you must have the following prerequisites. For more information, refer to the <strong>Prerequisites</strong> and Setting up your resources sections in <a href=\\"https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-integration-with-apache-ranger/\\" target=\\"_blank\\">Introducing Amazon EMR integration with Apache Ranger</a>.</p>\\n<ul>\\n<li><a href=\\"http://aws.amazon.com/iam\\" target=\\"_blank\\">Amazon Web Services Identity and Access Management</a> (IAM) roles for Apache Ranger and other Amazon Web Services services , and update to the Updates to the [Amazon EC2 ](https://aws.amazon.com/cn/ec2/?trk=cndc-detail)EMR role. For more information see <a href=\\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-iam.html\\" target=\\"_blank\\">IAM roles for native integration with Apache Ranger</a></li>\\n<li>Certificates and private keys for Apache Ranger plugins and the Apache Ranger server uploaded to <a href=\\"https://aws.amazon.com/secrets-manager/\\" target=\\"_blank\\">Amazon Web Services Secrets Manager</a></li>\\n<li>TLS/SSL mutual authentication enabled between the Apache Ranger server and Apache Ranger plugins running on the EMR cluster</li>\n<li>A self-managed Apache Ranger server (2.x only) outside of an EMR cluster</li>\n<li>A CloudWatch log group for Apache Ranger audits</li>\n<li>All Amazon EMR nodes, including core and task nodes, must be able to communicate with the Apache Ranger Admin servers, Amazon Simple Storage Service (Amazon S3), <a href=\\"http://aws.amazon.com/kms\\" target=\\"_blank\\">Amazon Web Services Key Management Service</a> (AWS KMS) if using EMRFS SSE-KMS, CloudWatch, and <a href=\\"https://docs.aws.amazon.com/STS/latest/APIReference/welcome.html\\" target=\\"_blank\\">Amazon Web Services Security Token Service</a> (Amazon Web Services STS).</li>\\n</ul>\n<p>To set up the new Apache Ranger Trino plugin, the following steps are required:</p>\n<ol>\\n<li>Delete any existing Presto service definitions in the Apache Ranger admin server:</li>\n</ol>\\n<pre><code class=\\"lang-\\">#Delete Presto Service Definition\\ncurl -f -u *&lt;admin users login&gt;*:*_&lt;_**_password_ **_for_** _ranger admin user_**_&gt;_* -X DELETE -k 'https://*&lt;RANGER SERVER ADDRESS&gt;*:6182/service/public/v2/api/servicedef/name/presto'\\n</code></pre>\\n<ol start=\\"2\\">\\n<li>Download and add new Apache Ranger service definitions for Trino in the Apache Ranger admin server</li>\n</ol>\\n<pre><code class=\\"lang-\\"> wget https://s3.amazonaws.com/elasticmapreduce/ranger/service-definitions/version-2.0/ranger-servicedef-amazon-emr-trino.json\\n\\ncurl -u *&lt;admin users login&gt;*:*_&lt;_**_password_ **_for_** _ranger admin user_**_&gt;_* -X POST -d @ranger-servicedef-amazon-emr-trino.json \\\\\\n-H &quot;Accept: application/json&quot; \\\\\\n-H &quot;Content-Type: application/json&quot; \\\\\\n-k 'https://*&lt;RANGER SERVER ADDRESS&gt;*:6182/service/public/v2/api/servicedef'\\n\\n</code></pre>\\n<ol start=\\"3\\">\\n<li>Create a new Amazon EMR security configuration for Apache Ranger installation to include Trino policy repository details. For more information, see <a href=\\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-security-config.html\\" target=\\"_blank\\">Create the EMR security configuration</a>.</li>\\n<li>Optionally, if you want to use the Hue UI to run Trino SQL, add the <code>hue</code> user to the Apache Ranger admin server. Run the following command on the Ranger admin server:</li>\\n</ol>\n<pre><code class=\\"lang-\\"># Note: input parameter Ranger host IP address\\n \\n set -x\\nranger_server_fqdn=\$1\\nRANGER_HTTP_URL=https://\$ranger_server_fqdn:6182\\n\\ncat &gt; hueuser.json &lt;&lt; EOF\\n{ \\n &quot;name&quot;: &quot;hue1&quot;,\\n &quot;firstName&quot;: &quot;hue&quot;,\\n &quot;lastName&quot;: &quot;&quot;,\\n &quot;loginId&quot;: &quot;hue1&quot;,\\n &quot;emailAddress&quot; : null,\\n &quot;description&quot; : &quot;hue user&quot;,\\n &quot;password&quot; : &quot;user1pass&quot;,\\n &quot;groupIdList&quot;: [],\\n &quot;groupNameList&quot;: [],\\n &quot;status&quot;: 1,\\n &quot;isVisible&quot;: 1,\\n &quot;userRoleList&quot;: [ &quot;ROLE_USER&quot; ],\\n &quot;userSource&quot;: 0\\n}\\nEOF\\n\\n#add users \\ncurl -u admin:admin -v -i -s -X POST -d @hueuser.json -H &quot;Accept: application/json&quot; -H &quot;Content-Type: application/json&quot; -k \$RANGER_HTTP_URL/service/xusers/secure/users\\n</code></pre>\\n<p>After you add the <code>hue</code> user, it’s used to impersonate SQL calls submitted by AD users.</p>\\n<p><strong>Warning</strong>: Impersonation feature should always be used carefully to avoid giving any/all users access to high privileges.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/02a5769e22624853b7a2e4fa5217349f_image.png\\" alt=\\"image.png\\" /></p>\n<p>This post also demonstrates the capabilities of running queries against external databases, such as Amazon Redshift and PostgresSQL using <a href=\\"https://trino.io/docs/current/connector.html\\" target=\\"_blank\\">Trino connectors</a>, while controlling access at the database, table, row, and column level using the Apache Ranger policies. This requires you to set up the database engines you want to connect with. The following example code demonstrates using the <a href=\\"https://trino.io/docs/current/connector/redshift.html\\" target=\\"_blank\\">Amazon Redshift connector</a>. To set up the connector, create the file <code>redshift.properties</code> under <code>/etc/trino/conf.dist/catalog</code> on all [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) nodes and restart the Trino server.</p>\\n<ul>\\n<li>Create the redshift.properties property file on all the Amazon EMR nodes with the following code:</li>\n</ul>\\n<pre><code class=\\"lang-\\"># Create a new redshift.properties file\\n/etc/trino/conf.dist/catalog/redshift.properties\\n</code></pre>\\n<ul>\\n<li>Update the property file with the Amazon Redshift cluster details:</li>\n</ul>\\n<pre><code class=\\"lang-\\">connector.name=redshift\\nconnection-url=jdbc:redshift://XXXXX:5439/dev\\nconnection-user=XXXX\\nconnection-password=XXXXX\\n</code></pre>\\n<ul>\\n<li>Restart the Trino server:</li>\n</ul>\\n<pre><code class=\\"lang-\\"># Restart Trino server \\nsudo systemctl stop trino-server.service\\nsudo systemctl start trino-server.service\\n</code></pre>\\n<ul>\\n<li>In a production environment, you can automate this step using the following inside your EMR Classification:</li>\n</ul>\\n<pre><code class=\\"lang-\\">{\\n&quot;Classification&quot;: &quot;trino-connector-redshift&quot;,\\n&quot;Properties&quot;: {\\n&quot;connector.name&quot;: &quot;redshift&quot;,\\n&quot;connection-url&quot;: &quot;jdbc:redshift://XXXXX:5439/dev&quot;,\\n&quot;connection-user&quot;: &quot;XXXX&quot;,\\n&quot;connection-password&quot;: &quot;XXXX&quot;\\n}\\n}\\n</code></pre>\\n<h4><a id=\\"Test_your_setup_144\\"></a><strong>Test your setup</strong></h4>\\n<p>In this section, we go through an example where the data is distributed across Amazon Redshift for dimension tables and Hive for fact tables. We can use Trino to join data between these two engines.</p>\n<p>On Amazon Redshift, let’s define a new dimension table called Products and load it with data:</p>\n<pre><code class=\\"lang-\\">--- Setup products table in Redshift\\n &gt; create table public.products \\n (company VARCHAR, link VARCHAR, price FLOAT, product_category VARCHAR, \\n release_date VARCHAR, sku VARCHAR);\\n\\n--- Copy data from S3\\n\\n &gt; COPY public.products\\n FROM 's3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/products/'\\n IAM_ROLE '&lt;XXXXXXXXX&gt;'\\n FORMAT AS PARQUET;\\n</code></pre>\\n<p>Then use the Hue UI to create the Hive external table <code>Orders</code>:</p>\\n<pre><code class=\\"lang-\\">CREATE EXTERNAL TABLE IF NOT EXISTS default.orders \\n(customer_id STRING, order_date STRING, price DOUBLE, sku STRING)\\nSTORED AS PARQUET\\nLOCATION 's3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/data/staging/orders';\\n</code></pre>\\n<p>Now let’s use Trino to join both datasets:</p>\n<pre><code class=\\"lang-\\">-- Join the dimension table in Redshift (products) with the fact table in hive (orders), \\nto get the sum of sales by product_category and sku\\nSELECT sum(orders.price) total_sales, products.sku, products.product_category\\nFROM hive.staging.orders join redshift.public.products on orders.sku = products.sku\\ngroup by products.sku, products.product_category limit 10\\n</code></pre>\\n<p>The following screenshot shows our results.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/732aa991f2454527afb98997d52eda26_image.png\\" alt=\\"image.png\\" /></p>\n<h5><a id=\\"Row_filtering_and_column_masking_188\\"></a><strong>Row filtering and column masking</strong></h5>\\n<p>Apache Ranger supports policies to allow or deny access based on several attributes, including user, group, and predefined roles, as well as dynamic attributes like IP address and time of access. In addition, the model supports authorization based on the classification of the resources such as like PII, FINANCE, and SENSITIVE.</p>\n<p>Another feature is the ability to allow users to access only a subset of rows in a table or restrict users to access only masked or redacted values of sensitive data. Examples of this include the ability to restrict users to access only records of customers located in the same country where the user works, or allow a user who is doctor to see only records of patients that are associated with that doctor.</p>\n<p>The following screenshots show how, by using Trino Ranger policies, you can enable row filtering and column masking of data in Amazon Redshift tables.</p>\n<p>The example policy masks the <code>firstname</code> column, and applies a filter condition on the city column to restrict users to view rows for a specific city only.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/608a24587efe48cd863e64d281773abd_image.png\\" alt=\\"image.png\\" /></p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/4b56f9e60fe4467d98e8be5603732504_image.png\\" alt=\\"image.png\\" /></p>\n<p>The following screenshot shows our results.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/52433fdb5b9245b894d7a8cf60ba57b8_image.png\\" alt=\\"image.png\\" /></p>\n<p>Dynamic row filtering using user session context</p>\n<p>The Trino Ranger plugin passes down Trino session data like <code>current_user()</code> that you can use in the policy definition. This can vastly simplify your row filter conditions by removing the need for hardcoded values and using a mapping lookup. For more details on dynamic row filtering, refer to [Row-level filtering and column-masking using Apache Ranger policies in Apache Hive](Row-level filtering and column-masking using Apache Ranger policies in Apache Hive).</p>\\n<h5><a id=\\"Known_issue_with_Amazon_EMR_67_211\\"></a><strong>Known issue with Amazon EMR 6.7</strong></h5>\\n<p>Amazon EMR 6.7 has a known issue when enabling Kerberos 1-way trust with Microsoft windows AD. Please run <a href=\\"https://github.com/aws-samples/aws-emr-apache-ranger/blob/main/aws_emr_blog_v3/scripts/remove-yum-package-name-validator.sh\\" target=\\"_blank\\">this</a> bootstrap script following <a href=\\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapCustom\\" target=\\"_blank\\">these instructions</a> as part of the cluster launch.</p>\\n<h4><a id=\\"Limitations_217\\"></a><strong>Limitations</strong></h4>\\n<p>When using this solution, keep in mind the following limitations, further details can be found here:</p>\n<ul>\\n<li>Non-Kerberos clusters are not supported.</li>\n<li>Audit logs are not visible on the Apache Ranger UI, because these are sent to CloudWatch.</li>\n<li>The Amazon Web Services Glue Data Catalog isn’t supported as the Apache Hive Metastore.</li>\n<li>The integration between Amazon EMR and Apache Ranger limits the available applications. For a full list, refer to <a href=\\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-app-support.html\\" target=\\"_blank\\">Application support and limitations</a>.</li>\\n</ul>\n<h4><a id=\\"Troubleshooting_227\\"></a><strong>Troubleshooting</strong></h4>\\n<p>If you can’t log in to the EMR cluster’s node as an AD user, and you get the error message <code>Permission denied, please try again</code>.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/c23c7c0e9fd24a8792096588f89ebd8e_image.png\\" alt=\\"image.png\\" /></p>\n<pre><code class=\\"lang-\\">sudo service sssd restart\\n</code></pre>\\n<p>If you’re unable to download policies from Ranger admin server, and you get the error message <code>Error getting policies</code> with the HTTP status code 400. This can be caused because either the certificate has expired or the Ranger policy definition is not set up correctly.</p>\\n<p>To fix this, check the Ranger admin logs. If it shows the following error, it’s likely that the certificates have expired.</p>\n<pre><code class=\\"lang-\\">VXResponse@347f4bdbstatusCode={1} msgDesc={Unauthorized access - unable to get client certificate} messageList={[VXMessage={org.apache.ran\\nger.view.VXMessage@7ebc298cname={OPER_NOT_ALLOWED_FOR_ENTITY} rbKey={xa.error.oper_not_allowed_for_state} message={Operation not allowed for entity} objectId={n\\null} fieldName={null} }]} }\\n</code></pre>\\n<p>You will need to perform the following steps to address the issue</p>\n<ul>\\n<li>Recreate the certs using the <a href=\\"https://github.com/aws-samples/aws-emr-apache-ranger/blob/main/aws_emr_blog_v3/scripts/emr-tls/create-tls-certs.sh\\" target=\\"_blank\\">create-tls-certs.sh</a> script and upload them to Secrets Manager.</li>\\n<li>Then update the Ranger admin server configuration with new certificate, and restart Ranger admin service.</li>\n<li>Create a new EMR security configuration using the new certificate, and re-launch EMR cluster using new security configurations.</li>\n</ul>\\n<p>The issue can also be caused due to a misconfigured Ranger policy definition. The Ranger admin service policy definition should trust the self-signed certificate chain. Make sure the following configuration attribute for the service definitions has the correct domain name or pattern to match the domain name used for your EMR cluster nodes.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/97398bd049bf48ae8faf5cee1889fe9a_image.png\\" alt=\\"image.png\\" /></p>\n<p>If the EMR cluster keeps failing with the error message <code>Terminated with errors: An internal error occurred</code>, check the [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail) primary node secret agent logs.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/14b7829d10ff4017ad93e4079193ae1e_image.png\\" alt=\\"image.png\\" /></p>\n<p>If it shows the following message, the cluster is failing because the specified CloudWatch log group doesn’t exist:</p>\n<pre><code class=\\"lang-\\">Exception in thread &quot;main&quot; com.amazonaws.services.logs.model.ResourceNotFoundException: The specified log group does not exist. (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: d9fa2ef1-17cb-4b8f-a07f-8a6aea245173; Proxy: null)\\n at com.amazonaws.http.AmazonHttpClient\$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)\\n at com.amazonaws.http.AmazonHttpClient\$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)\\n at com.amazonaws.http.AmazonHttpClient\$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)\\n</code></pre>\\n<p>A query run through <code>trino-cli</code> might fail with the error <code>Unable to obtain password from user</code>. For example:</p>\\n<pre><code class=\\"lang-\\">ERROR remote-task-callback-42 io.trino.execution.StageStateMachine Stage 20220422_023236_00005_zn4vb.1 failed\\ncom.google.common.util.concurrent.UncheckedExecutionException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: javax.security.auth.login.LoginException: Unable to obtain password from user\\n</code></pre>\\n<p>This issue can occur due to incorrect realm names in the <code>etc/trino/conf.dist/catalog/hive.properties</code> file. Check the domain or realm name and other Kerberos related configs in the <code>etc/trino/conf.dist/catalog/hive.properties</code> file. Optionally, also check the /etc/trino/conf.dist/trino-env.sh and <code>/etc/trino/conf.dist/config.properties</code> files in case any config changes has been made.</p>\\n<h4><a id=\\"Clean_up_284\\"></a><strong>Clean up</strong></h4>\\n<p>Clean up the resources created either manually or by the Amazon Web Services CloudFormation template provided in GitHub repo to avoid unnecessary cost to your Amazon Web Services account. You can delete the CloudFormation stack by selecting the stack on the Amazon Web Services CloudFormation console and choosing <strong>Delete</strong>. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may encounter some issues during cleanup; you need to clean these up independently.</p>\\n<h4><a id=\\"Conclusion_290\\"></a><strong>Conclusion</strong></h4>\\n<p>A data mesh approach encourages the idea of data domains where each domain team owns their data and is responsible for data quality and accuracy. This draws parallels with a microservices architecture. Building federated data governance like we show in this post is at the core of implementing a data mesh architecture. Combining the powerful query federation capabilities of Apache Trino with the centralized authorization and audit capabilities of Apache Ranger provides an end-to-end solution to operate and govern a data mesh platform.</p>\n<p>In addition to the already available Ranger integrations capabilities for Apache SparkSQL, Amazon S3, and Apache Hive, starting from 6.7 release, Amazon EMR includes plugins for Ranger Trino integrations. For more information, refer to EMR <a href=\\"https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-trino.html\\" target=\\"_blank\\">Trino Ranger plugin</a>.</p>\\n<h5><a id=\\"About_the_authors_299\\"></a><strong>About the authors</strong></h5>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7d39f4e0f6504e908dc7feb98f713201_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Varun Rao Bhamidimarri</strong> is a Sr Manager, Amazon Web Services Analytics Specialist Solutions Architect team. His focus is helping customers with adoption of cloud-enabled analytics solutions to meet their business requirements. Outside of work, he loves spending time with his wife and two kids, stay healthy, mediate and recently picked up gardening during the lockdown.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/7940710fc5a14602bea03fba3db00ee7_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Partha Sarathi Sahoo</strong> is an Analytics Specialist TAM – at Amazon Web Services based in Sydney, Australia. He brings 15+ years of technology expertise and helps Enterprise customers optimize Analytics workloads. He has extensively worked on both on-premise and cloud Bigdata workloads along with various ETL platform in his previous roles. He also actively works on conducting proactive operational reviews around the Analytics services like [Amazon EMR](https://aws.amazon.com/cn/emr/?trk=cndc-detail), Redshift, and OpenSearch.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/469c604359e14590a607f0a45ecca6fe_image.png\\" alt=\\"image.png\\" /></p>\n<p><strong>Anis Harfouche</strong> is a Data Architect at Amazon Web Services Professional Services. He helps customers achieving their business outcomes by designing, building and deploying data solutions based on Amazon Web Services services.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭