### **行业背景及用户需求**
Transformer 模型在智能驾驶算法中的广泛应用,对算力、网络存储、安全及运维等方面提出了更高的要求。为了支持智能驾驶算法的开发和验证,需要投入更强大的计算资源,提供稳定、高速的网络连接,以及具备足够的存储容量和高效的存储设备,采取有效的数据安全保护措施,确保数据的安全性和保密性,此外还需要高效的运维手段和工具保障服务可用性。
* **算力资源需求**:Transformer 模型通常包含大量的参数和复杂的计算操作,对计算能力有较高的需求。在智能驾驶算法开发和验证阶段,需要进行大规模的训练和优化,以及实时的推理和决策。因此,需要更强大的计算资源,如高性能的 GPU 或专用的 AI 芯片,以支持模型训练、推理和实验的高效进行。在开发验证阶段,企业往往批量采购高性能开发套件,但一次性大批量采购的高昂成本及到货周期是巨大的挑战。
* **网络存储需求**:智能驾驶算法开发和验证涉及大量的数据传输和通信,如各类传感器数据、模型文件及日志等。在分布式计算和协同开发的环境中,需要稳定、高速的网络连接,足够的存储容量和高速的存储设备,以支持数据的共享和协作。此外,在实时数据采集和处理的场景中,需要低延迟的网络连接,以确保实时和快速的数据传输。企业自行采购和部署网络存储设备需要企业承担大量的资金投入及时间投入。
* **数据安全需求**:在智能驾驶算法开发和验证过程中,涉及大量的数据传输流动。保护这些数据的安全性是至关重要的。因此,需要在算法开发和验证环境中采取严格的数据安全措施,如数据加密、访问控制和安全传输协议等,以防止数据泄露和非授权访问。数据安全是一个复杂而庞大的领域,构建和维护一个完整的数据安全体系需要企业大量的资金及专业技术人员投入。
* **系统运维需求**:为保障智能驾驶算法开发及验证工作的正常开展,企业需要构建一个专业团队对硬件、网络、存储及安全进行持续不断的安装、配置、监控和故障排除。支持智能驾驶算法的开发和验证将给运维团队带来更高的挑战,需要他们具备相关的技术能力和经验,以有效地管理和维护所需的硬件、网络、存储和安全环境。
![image.png](https://dev-media.amazoncloud.cn/5e18fafc3ff240e6aff87a5c2788ded6_image.png "image.png")
本文介绍联想与亚马逊云科技以 SOAFEE(Scalable Open Architecture For Embedded Edge)架构为基础,聚焦智能驾驶数据闭环的模型开发、仿真及验证领域,共同探索构建云边一体的混合云智能驾驶超算平台。
![image.png](https://dev-media.amazoncloud.cn/38458afe4a0640f394ea2aa15f0a9560_image.png "image.png")
该平台为汽车软件开发团队提供了一种创新的解决方案,帮助团队成员(包括设计、开发、集成、测试和运维人员)摆脱硬件和地域的限制,从而更高效地开展软件开发工作,平台采用云服务的方式,可以通过按需付费模式提供计算资源,从而降低了企业的成本和风险。与传统虚拟仿真系统或云开发平台相比,该方案具有显著的优势。
### **解决方案**
本方案基于亚马逊云科技云服务及本地计算服务,构建云边一体的混合云智能驾驶超算平台,通过这种平台,开发人员可以更加高效地进行智能驾驶系统的开发和测试,降低硬件成本和运维难度。
为实现算力的有效管理,**在亚马逊云科技构建 SOCA(Scale-Out Computing on Amazon)平台,该平台根据任务需求动态分配云上及线下资源,确保每个任务都能在最优的环境下运行,从而高效地调度智能驾驶开发及验证任务。**
为了提供强大的算力支持,在联想本地机房部署了 Drive AGX Orin Kit,Drive AGX Orin Kit 使用的 Orin SoC 为车规级算力芯片,对云端 S3/EFS 存储的采集数据进行 HiL 验证,为智能驾驶算法的开发及验证提供了坚实的算力基础。同时,为保障数据传输安全和网络安全,通过 VPN 将 Drive AGX Orin Kit 与 SOCA 平台连接,确保数据的安全传输。
为了提供灵活的服务,通过本平台,以云服务的方式为客户提供智能驾驶算法 SiL 云端及 HiL 线下开发及验证方法。**这种云服务模式不仅使客户能够轻松扩展服务规模以满足不断增长的需求,还为客户提供了方便的远程访问能力,使他们可以随时随地使用服务。**
### **系统架构图**
在云上搭建 SOCA 平台,用于用户登录管理、 WebUI(网络用户界面)、存储管理及集群作业调度及管理。通过构建 Public 及 Private Subnet 将资源进行隔离,线上线下算力资源部署在 private subnet,线下设备通过 OpenVPN 同云端进行连接,通过 VPN 进行数据传输。WebUI、调度管理均部署在 Public Subnet。
![image.png](https://dev-media.amazoncloud.cn/f36497d1186947d397f217933db54118_image.png "image.png")
系统主要组件及功能如下:
1. 搭建 [Amazon OpenSearch Service](https://aws.amazon.com/cn/opensearch-service/?trk=cndc-detail) 集群,用于存储任务和主机信息。
2. [Elastic Load Balancing](https://aws.amazon.com/cn/elasticloadbalancing/?trk=cndc-detail) 用于确保跨可用区的可访问性,保障平台资源高可用。
3. 使用 Amazon Elastic Compute Cloud(Amazon EC2)实例构建调度器,用于动态预置用户提交任务所需的云上资源。该实例还托管了 SOCA Web UI,以允许用户和管理员与环境交互。
4. 启动使用 NICE Desktop Cloud Visualization(DCV)的 2D 或 3D 工作站,作为远程桌面工具,用于提交批处理任务和运行 GUI 工具。
5. 使用 Amazon Elastic Compute Cloud(Amazon EC2)实例搭建 OpenVPN Server,加密连接云上及线下资源,将 VPN IP 网段加入路由表,保护 Private Subnet 内网资源。
6. 线下使用 NVIDIA Drive AGX Orin Kit 作为 HiL 仿真计算资源,安装 OpenVPN Client 接入亚马逊云科技网络。
7. 使用的安全服务和资源包括 Secrets Manager、Certificate Manager、安全组和 Identity and Access Management(IAM),保障平台安全。
8. 部署用于存储客户数据的 [Amazon Elastic File System](https://aws.amazon.com/cn/efs/?trk=cndc-detail)([Amazon EFS](https://aws.amazon.com/cn/efs/?trk=cndc-detail))、用于持久性日志的 [Amazon Simple Storage Service](https://aws.amazon.com/cn/s3/?trk=cndc-detail)([Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail)),以及可选的并行文件系统 [Amazon FSx for Lustre](https://aws.amazon.com/cn/fsx/lustre/?trk=cndc-detail)。
9. Lambda 用于验证所需的先决条件,并为应用程序负载均衡器(ALB)创建默认的签名证书,以管理对 DCV 工作站会话的访问。
### **构建系统的步骤**
##### **先决条件**
* 拥有亚马逊云科技账户和 Admin 用户权限,部署在北京 Region
* NVIDIA Drive AGX Orin Kit
##### **本次实践的搭建有四个部分**:
1. 在亚马逊云科技构建 SOCA 平台
2. 本地的 Drive AGX Orin 环境搭建
3. 构建 VPN,加密连接亚马逊云科技云服务及本地 Orin
4. Orin 端配置 PBS,实现 SOCA 调度管理
##### **具体实施方法**
###### **1. 在亚马逊云科技构建 SOCA 平台**
SOCA(Scale-Out Computing on Amazon)是一种可帮助客户部署和操作多⽤户环境,从而支持计算机辅助工程(CAE)等计算密集型⼯作流的开源解决⽅案,本实践使用 SOCA 作为算力资源调度平台及任务编排平台,通过对 SOCA 的功能修改,实现混合云架构的超算平台搭建。
SOCA 基础部署参见亚马逊云科技上的扩展计算解决方案:
https\://aws.amazon.com/cn/solutions/implementations/scale-out-computing-on-aws/?trk=cndc-detail
###### **2. 本地 Drive AGX Orin Kit 环境搭建**
本地部署 NVIDIA Drive AGX Orin Kit,设备连接互联网。
NVIDIA Drive AGX Orin Kit 构建以下软件环境,具体软件部署过程可自行参考相关教程:
* NVIDIA DRIVE OS 6.0.5 (Ubuntu 20.04)
* CUDA 11.4
* DriveWorks 5.8
* TensorRT 8.4.12
###### **3. 构建 VPN,加密连接亚马逊云科技云服务及本地 Drive Orin**
通过 VPN 的搭建,将 NVIDIA Drive AGX Orin Kit 连接到 SOCA 平台,既保障数据传输的安全,又实现 NVIDIA Drive AGX Orin Kit 与 SOCA 的网络环境融合,允许 SOCA 对 NVIDIA Drive AGX Orin Kit 的管理调度。
云端部署 OpenVPN Server,该 Server 部署在 Public Subnet 中的 EC2 实例。
注意:配置 OpenVPN Server 时,需要勾选源/目标检查为停止。OpenVPN 的网段加入 Private/Public Subnet 安全组。
![image.png](https://dev-media.amazoncloud.cn/b1a4cd8c5af542f8a4048fb46832e8a7_image.png "image.png")
NVIDIA Drive AGX Orin Kit 部署 OpenVPN Client,参考部署方法参见使用 OpenVPN 客户端应用程序进行连接:
https\://docs.aws.amazon.com/zh_cn/vpn/latest/clientvpn-user/linux.html?trk=cndc-detail
###### **4. NVIDIA Drive AGX Orin Kit 安装并配置 OpenPBS,实现 SOCA 调度管理(以下步骤未特殊说明均在 Orin 端执行)**
1\. OpenPBS 安装脚本,共 3 个文件
* environment
```
export SOCA_CONFIGURATION=soca-lenovo
export SOCA_BASE_OS=ubuntu20
export SOCA_JOB_QUEUE=alwayson
export SOCA_JOB_OWNER=socaadmin
export SOCA_JOB_NAME=always_on_capacity
export SOCA_JOB_PROJECT=False
export SOCA_VERSION=2.7.4
export SOCA_JOB_EFA=false
export SOCA_JOB_ID=184dab82-151a-4678-be3f-03b5930560c9
export SOCA_SCRATCH_SIZE=0
export SOCA_INSTALL_BUCKET=soca-lenovo-beijing
export SOCA_INSTALL_BUCKET_FOLDER=soca-lenovo
export SOCA_FSX_LUSTRE_BUCKET=false
export SOCA_FSX_LUSTRE_DNS=false
export SOCA_INSTANCE_TYPE=t4g.large
export SOCA_INSTANCE_HYPERTHREADING=false
export SOCA_SYSTEM_METRICS=false
export SOCA_OSDOMAIN_ENDPOINT=vpc-soca-lenovo-larhtlupw25x3ub7ni2mk3zndi.cn-north-1.es.amazonamazon.com.cn
export SOCA_ANALYTICS_ENGINE=elasticsearch
export SOCA_AUTH_PROVIDER=openldap
export SOCA_HOST_SYSTEM_LOG=/apps/soca/soca-lenovo/cluster_node_bootstrap/logs/184dab82-151a-4678-be3f-03b5930560c9/ip-10-0-197-101export Amazon_STACK_ID=soca-lenovo-keepforever-alwayson-184dab82-151a-4678-be3f-03b5930560c9
export Amazon_DEFAULT_REGION=cn-north-1
```
参照下图,并根据安装 SOCA 时,创建的资源情况,修改以下配置参数值:
![image.png](https://dev-media.amazoncloud.cn/891dc2a98cc742acb99b44926adde3ed_image.png "image.png")
* config.cfg
```
# Python
PYTHON_VERSION="3.9.16"
PYTHON_TGZ="Python-3.9.16.tgz"
# PYTHON_URL="https://www.python.org/ftp/python/3.9.16/Python-3.9.16.tgz"
PYTHON_URL="https://s3.cn-northwest-1.amazonaws.com.cn/soca-china-2.7.4-northwest1/download/Python-3.9.16.tgz"
PYTHON_HASH="38c99c7313f416dcf3238f5cf444c6c2"
# Scheduler
OPENPBS_VERSION="22.05.11"
OPENPBS_TGZ="openpbs-22.05.11.tar.gz"
OPENPBS_URL="https://s3.cn-northwest-1.amazonaws.com.cn/soca-china-2.7.4-northwest1/download/openpbs-22.05.11.tar.gz"
OPENPBS_HASH="1d687431da849a952eee738201840094"
# OpenMPI
OPENMPI_VERSION="4.1.5"
OPENMPI_TGZ="openmpi-4.1.5.tar.gz"
OPENMPI_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/openmpi-4.1.5.tar.gz"
OPENMPI_HASH="2593008bea4bc721b9f304428abbf94b"
# DCV
DCV_X86_64_VERSION="2023.0-14852-el7-x86_64"
DCV_X86_64_TGZ="nice-dcv-2023.0-14852-el7-x86_64.tgz"
DCV_X86_64_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/nice-dcv-2023.0-14852-el7-x86_64.tgz"
DCV_X86_64_HASH="7cc461ffa3477a8ab1d49416d27db47a"
DCV_AARCH64_VERSION="2023.0-14852-el7-aarch64"
DCV_AARCH64_TGZ="nice-dcv-2023.0-14852-el7-aarch64.tgz"
DCV_AARCH64_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/nice-dcv-2023.0-14852-el7-aarch64.tgz"
DCV_AARCH64_HASH="97a3547e123142f0927810039a503850"
# EFA
EFA_VERSION="1.22.1"
EFA_TGZ="aws-efa-installer-1.22.1.tar.gz"
EFA_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/amazon-efa-installer-1.22.1.tar.gz"
EFA_HASH="600c0ad7cdbc06e8e846cb763f92901b"
# Metric Beat (Deprecated after migration to OpenSearch). See computeNodeConfigureMetrics.sh to re-enable if needed
METRICBEAT_RPM="metricbeat-oss-7.6.2-x86_64.rpm"
METRICBEAT_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/metricbeat-oss-7.6.2-x86_64.rpm"
METRICBEAT_HASH="631a7e53a47c53b092f64db9cd8a96a8"
# SSM
SSM_X86_64_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/amazon-ssm-agent.rpm"
SSM_AARCH64_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/amazon-ssm-agent.rpm"
#CHINA
PIP_CHINA_MIRROR="https://opentuna.cn/pypi/web/simple"
CENTOS_CHINA_REPO="https://soca-china-deployment.s3.cn-northwest-1.amazonamazon.com.cn/scale-out-computing-on-aws/v2.7.0/CentOS-Base-china.repo"
# NVM
NVM_INSTALL_SCRIPT="install.sh"
NVM_URL="https://mylab-soca.s3.cn-northwest-1.amazonamazon.com.cn/nvm-sh/nvm/v0.38.0/install.sh"
NVM_HASH="88725c9e15c45165fba796d63aa0a6ce"
# EPEL
EPEL_RPM="epel-release-latest-9.noarch.rpm"
EPEL_URL="https://s3.cn-northwest-1.amazonamazon.com.cn/soca-china-2.7.4-northwest1/download/epel-release-latest-9.noarch.rpm"
# System libraries
SYSTEM_PKGS=(
wget
chrony
cpp
e2fsprogs
e2fsprogs-libs
gcc
gcc-c++
gcc-gfortran
glibc
glibc-common
glibc-devel
glibc-headers
gssproxy
htop
kernel
kernel-devel
kernel-headers
keyutils
keyutils-libs-devel
krb5-devel
krb5-libs
libbasicobjects
libcollection
libcom_err
libcom_err-devel
libevent
libffi-devel
libgcc
libgfortran
libgomp
libini_config
libkadm5
libmpc
libnfsidmap
libpath_utils
libquadmath
libquadmath-devel
libref_array
libselinux
libselinux-devel
libselinux-python
libselinux-utils
libsepol
libsepol-devel
libss
libstdc++
libstdc++-devel
libtalloc
libtevent
libtirpc
libverto-devel
libverto-tevent
libglvnd-devel
mpfr
mdadm
nvme-cli
elfutils-libelf-devel
nfs-utils
git
htop
jq
openssl
openssl-devel
openssl-libs
pcre
pcre-devel
perl
perl-Carp
perl-Encode
perl-Env
perl-Exporter
perl-File-Path
perl-File-Temp
perl-Filter
perl-Getopt-Long
perl-HTTP-Tiny
perl-PathTools
perl-Pod-Escapes
perl-Pod-Perldoc
perl-Pod-Simple
perl-Pod-Usage
perl-Scalar-List-Utils
perl-Socket
perl-Storable
perl-Switch
perl-Text-ParseWords
perl-Time-HiRes
perl-Time-Local
perl-constant
perl-libs
perl-macros
perl-parent
perl-podlators
perl-threads
perl-threads-shared
quota
quota-nls
rpcbind
sqlite-devel
system-lsb
nss-pam-ldapd
tcp_wrappers
vim
zlib
zlib-devel
redhat-lsb
)
SCHEDULER_PKGS=(
dejavu-fonts-common
dejavu-sans-fonts
fontconfig
fontpackages-filesystem
freetype
htop
hwloc
hwloc-libs
libICE
libSM
libX11
libX11-common
libX11-devel
libXau
libXft
libXrender
libical
libpng
libtool-ltdl
libxcb
postgresql
postgresql-contrib
postgresql-devel
postgresql-libs
postgresql-server
tcl
tk
rpm-build
libtool
hwloc-devel
libXt-devel
libedit-devel
libical-devel
ncurses-devel
perl
python3
python3-pip
python3-devel
tcl-devel
tk-devel
swig
expat-devel
openssl-devel
libXext
libXft
autoconf
automake
)
OPENLDAP_SERVER_PKGS=(
compat-openldap
cyrus-sasl
cyrus-sasl-devel
openldap
openldap-clients
openldap-devel
openldap-servers
unixODBC
unixODBC-devel
)
SSSD_PKGS=(
adcli
avahi-libs
bind-libs
bind-libs-lite
bind-license
bind-utils
c-ares
cups-libs
cyrus-sasl-gssapi
http-parser
krb5-workstation
libdhash
libipa_hbac
libldb
libsmbclient
libsss_autofs
libsss_certmap
libsss_idmap
libsss_nss_idmap
libsss_sudo
libtalloc
libtdb
libtevent
libwbclient
oddjob
oddjob-mkhomedir
python-sssdconfig
realmd
samba-client-libs
samba-common
samba-common-libs
samba-common-tools
sssd
sssd-ad
sssd-client
sssd-common
sssd-common-pac
sssd-ipa
sssd-krb5
sssd-krb5-common
sssd-ldap
sssd-proxy
)
# Package top install Gnome on Amazon Linux 2
DCV_AMAZONLINUX_PKGS=(
gdm
gnome-session
gnome-classic-session
gnome-session-xsession
gnome-terminal
gnu-free-fonts-common
gnu-free-mono-fonts
gnu-free-sans-fonts
gnu-free-serif-fonts
xorg-x11-server-Xorg
xorg-x11-server-utils
xorg-x11-utils
)
# for Ubuntu18.04
DCV_UBUNTU_X86_64_VERSION="2022.0-12123-ubuntu1804-x86_64"
DCV_UBUNTU_X86_64_TGZ="nice-dcv-2022.0-12123-ubuntu1804-x86_64.tgz"
DCV_UBUNTU_X86_64_URL="https://d1uj6qtbmh3dt5.cloudfront.net/2022.0/Servers/nice-dcv-2022.0-12123-ubuntu1804-x86_64.tgz"
DCV_UBUNTU_X86_64_HASH="7816e6ebb8c83c1f3efba778846f0e7b"
# for Ubuntu18.04
SYSTEM_PKGS_UBUNTU18=(
gcc
make
libtool
libhwloc-dev
libx11-dev
libxt-dev
libedit-dev
libical-dev
ncurses-dev
perl
postgresql-server-dev-all
postgresql-contrib
python3-dev
tcl-dev
tk-dev
swig
libexpat-dev
libssl-dev
libxext-dev
libxft-dev
autoconf
automake
g++
net-tools
)
# for Ubuntu18.04
OPENLDAP_SERVER_PKGS_UBUNTU18=(
slapd
ldap-utils
unixodbc
unixodbc-dev
)
# for Ubuntu18.04
SSSD_PKGS_UBUNTU18=(
libavahi-common3
libbind9-160
bind9utils
libc-ares2
libcups2
libsasl2-2
libsasl2-modules-ldap
libsasl2-modules-gssapi-heimdal
libhttp-parser2.7.1
krb5-kdc
krb5-admin-server
krb5-config
libdhash1
python3-libipa-hbac
libldb1
libsmbclient
libsss-certmap0
libsss-idmap0
libsss-nss-idmap0
python3-libsss-nss-idmap
libsss-simpleifp0
libsss-sudo
libtalloc2
libtdb1
libtevent0
libwbclient0
oddjob
oddjob-mkhomedir
realmd
samba-common
sssd
sssd-ad
sssd-common
sssd-ipa
sssd-krb5
sssd-krb5-common
sssd-ldap
sssd-proxy
sssd-tools
)
# for Ubuntu20.04
DCV_UBUNTU20_X86_64_VERSION="2022.0-11954-ubuntu2004-x86_64"
DCV_UBUNTU20_X86_64_TGZ="nice-dcv-2022.0-11954-ubuntu2004-x86_64.tgz"
DCV_UBUNTU20_X86_64_URL="https://d1uj6qtbmh3dt5.cloudfront.net/2022.0/Servers/nice-dcv-2022.0-11954-ubuntu2004-x86_64.tgz"
DCV_UBUNTU20_X86_64_HASH="2639b0b0cd35b2cf27e8e301d57c9e31"
# for Ubuntu20.04
SYSTEM_PKGS_UBUNTU20=(
gcc
make
libtool
libhwloc-dev
libx11-dev
libxt-dev
libedit-dev
libical-dev
ncurses-dev
perl
postgresql-server-dev-all
postgresql-contrib
python3-dev
tcl-dev
tk-dev
swig
libexpat-dev
libssl-dev
libxext-dev
libxft-dev
autoconf
automake
g++
net-tools
nfs-common
)
# for Ubuntu20.04
OPENLDAP_SERVER_PKGS_UBUNTU20=(
slapd
ldap-utils
unixodbc
unixodbc-dev
)
# for Ubuntu20.04
SSSD_PKGS_UBUNTU20=(
libavahi-common3
bind9-libs
bind9-utils
libc-ares2
libcups2
libsasl2-2
libsasl2-modules-ldap
libsasl2-modules-gssapi-heimdal
libhttp-parser2.9
krb5-kdc
krb5-admin-server
krb5-config
libdhash1
python3-libipa-hbac
libldb2
libsmbclient
libsss-certmap0
libsss-idmap0
libsss-nss-idmap0
python3-libsss-nss-idmap
libsss-simpleifp0
libsss-sudo
libtalloc2
libtdb1
libtevent0
libwbclient0
oddjob
oddjob-mkhomedir
realmd
samba-common
sssd
sssd-ad
sssd-common
sssd-ipa
sssd-krb5
sssd-krb5-common
sssd-ldap
sssd-proxy
sssd-tools
)
```
* HybridComputeNode.sh
```
#!/bin/bash -xe
######################################################################################################################
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. #
# #
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance #
# with the License. A copy of the License is located at #
# #
# http://www.apache.org/licenses/LICENSE-2.0 #
# #
# or in the 'license' file accompanying this file. This file is distributed on an 'AS IS' BASIS, WITHOUT WARRANTIES #
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions #
# and limitations under the License. #
######################################################################################################################
set -x
source /etc/environment
source /root/config.cfg
if [[ \$# -lt 1 ]]; then
exit 1
fi
# In case AMI already have PBS installed, force it to stop
service pbs stop || true
SCHEDULER_HOSTNAME=\$1
Amazon=\$(command -v amazon)
# Prepare PBS/System
cd ~
# Check if we're using a customized AMI
if [[ ! -f /root/soca_preinstalled_packages.log ]]; then
# Install System required libraries / EPEL
if [[ \$SOCA_BASE_OS == "rhel7" ]]; then
curl "\$EPEL_URL" -o \$EPEL_RPM
if [[ \$(md5sum "\$EPEL_RPM" | awk '{print \$1}') != "\$EPEL_HASH" ]]; then
echo -e "FATAL ERROR: Checksum for EPEL failed. File may be compromised." > /etc/motd
exit 1
fi
yum -y install \$EPEL_RPM
yum install -y \$(echo \${SYSTEM_PKGS[*]} \${SCHEDULER_PKGS[*]}) --enablerepo rhel-7-server-rhui-optional-rpms
elif [[ \$SOCA_BASE_OS == "centos7" ]]; then
yum -y install epel-release
yum install -y \$(echo \${SYSTEM_PKGS[*]} \${SCHEDULER_PKGS[*]})
elif [[ \$SOCA_BASE_OS == "ubuntu18" ]]; then
apt-get update
apt-get install -y \$(echo \${SYSTEM_PKGS_UBUNTU18[*]})
elif [[ \$SOCA_BASE_OS == "ubuntu20" ]]; then
apt-get update
apt-get install -y \$(echo \${SYSTEM_PKGS_UBUNTU20[*]})
else
# AL2
sudo amazon-linux-extras install -y epel
yum install -y \$(echo \${SYSTEM_PKGS[*]} \${SCHEDULER_PKGS[*]})
fi
if [[ \$SOCA_BASE_OS == "ubuntu18" ]]; then
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y \$(echo \${OPENLDAP_SERVER_PKGS_UBUNTU18[*]} \${SSSD_PKGS_UBUNTU18[*]})
apt-get install -y adcli
elif [[ \$SOCA_BASE_OS == "ubuntu20" ]]; then
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y \$(echo \${OPENLDAP_SERVER_PKGS_UBUNTU20[*]} \${SSSD_PKGS_UBUNTU20[*]})
apt-get install -y adcli
else
yum install -y \$(echo \${OPENLDAP_SERVER_PKGS[*]} \${SSSD_PKGS[*]})
fi
fi
# Install OpenPBS if needed
cd ~
OPENPBS_INSTALLED_VERS=\$(/opt/pbs/bin/qstat --version | awk {'print \$NF'})
if [[ "\$OPENPBS_INSTALLED_VERS" != "\$OPENPBS_VERSION" ]]; then
echo "OpenPBS Not Detected, Installing OpenPBS ..."
cd ~
wget \$OPENPBS_URL
if [[ \$(md5sum \$OPENPBS_TGZ | awk '{print \$1}') != \$OPENPBS_HASH ]]; then
echo -e "FATAL ERROR: Checksum for OpenPBS failed. File may be compromised." > /etc/motd
exit 1
fi
tar zxvf \$OPENPBS_TGZ
cd openpbs-\$OPENPBS_VERSION
./autogen.sh
./configure --prefix=/opt/pbs
make -j6
make install -j6
/opt/pbs/libexec/pbs_postinstall
chmod 4755 /opt/pbs/sbin/pbs_iff /opt/pbs/sbin/pbs_rcp
systemctl disable pbs
else
echo "OpenPBS already installed, and at correct version."
fi
# Edit path with new scheduler/python locations
echo "export PATH=\\"/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/pbs/bin:/opt/pbs/sbin:/opt/pbs/bin:/apps/soca/\$SOCA_CONFIGURATION/python/latest/bin\\"" >> /etc/environment
# Configure Host
SERVER_IP=\$(hostname -I)
SERVER_HOSTNAME=\$(hostname)
SERVER_HOSTNAME_ALT=\$(echo \$
SERVER_HOSTNAME | cut -d. -f1)
echo \$SERVER_IP \$SERVER_HOSTNAME \$SERVER_HOSTNAME_ALT >> /etc/hosts
# Configure Ldap if specified
if [[ "\$SOCA_AUTH_PROVIDER" == "openldap" ]]; then
MAX_ATTEMPT=10
LDAP_NAME=\$(\$AWS secretsmanager get-secret-value --secret-id \$SOCA_CONFIGURATION --query SecretString --output text | grep -oP '"LdapName": \\"(.*?)\\"' | sed 's/"LdapName": //g' | tr -d '"')
CURRENT_ATTEMPT=0
SLEEP_INTERVAL=180
# Loop to make sure SecretsManager produces a result in case we are ready too quickly for it
LDAP_CONFIG=\$(\$AWS secretsmanager get-secret-value --secret-id \$SOCA_CONFIGURATION --query SecretString --output text)
while [[ \$? -ne 0 ]] && [[ \$CURRENT_ATTEMPT -le \$MAX_ATTEMPT ]]
do
echo "Amazon Secrets Manager is not ready yet. Sleeping \$SLEEP_INTERVAL seconds.. Loop count is: \$CURRENT_ATTEMPT/\$MAX_ATTEMPT"
sleep \$SLEEP_INTERVAL
((CURRENT_ATTEMPT=CURRENT_ATTEMPT+1))
LDAP_CONFIG=\$(\$AWS secretsmanager get-secret-value --secret-id \$SOCA_CONFIGURATION --query SecretString --output text)
done
LDAP_BASE=\$(echo "\$LDAP_CONFIG" | grep -oP '"LdapBase":\\s*\\"(.*?)\\"' | sed 's/"LdapBase":\\s*//g' | tr -d '"')
LDAP_NAME=\$(echo "\$LDAP_CONFIG" | grep -oP '"LdapName":\\s*\\"(.*?)\\"' | sed 's/"LdapName":\\s*//g' | tr -d '"')
if [[ \$SOCA_BASE_OS == "ubuntu18" || \$SOCA_BASE_OS == "ubuntu20" ]]; then
echo "TLS_CACERT /etc/openldap/cacerts/openldap-server.pem" > /etc/ldap/ldap.conf
echo "URI ldap://\$LDAP_NAME" >> /etc/ldap/ldap.conf
echo "BASE \$LDAP_BASE" >> /etc/ldap/ldap.conf
mkdir -p /etc/openldap/cacerts
else
echo "URI ldap://\$LDAP_NAME" >> /etc/openldap/ldap.conf
echo "BASE \$LDAP_BASE" >> /etc/openldap/ldap.conf
fi
if [ -e /etc/sssd/sssd.conf ]; then
cp /etc/sssd/sssd.conf /etc/sssd/sssd.conf.orig
fi
echo -e "[domain/default]
enumerate = True
autofs_provider = ldap
cache_credentials = True
ldap_search_base = \$LDAP_BASE
id_provider = ldap
auth_provider = ldap
chpass_provider = ldap
sudo_provider = ldap
ldap_sudo_search_base = ou=Sudoers,\$LDAP_BASE
ldap_uri = ldap://\$SCHEDULER_HOSTNAME
ldap_id_use_start_tls = True
use_fully_qualified_names = False
ldap_tls_cacertdir = /etc/openldap/cacerts
[sssd]
services = nss, pam, autofs, sudo
full_name_format = %2\\\$s\\%1\\\$s
domains = default
[nss]
homedir_substring = /data/home
[pam]
[sudo]
ldap_sudo_full_refresh_interval=86400
ldap_sudo_smart_refresh_interval=3600
[autofs]
[ssh]
[pac]
[ifp]
[secrets]" > /etc/sssd/sssd.conf
echo | openssl s_client -connect \$SCHEDULER_HOSTNAME:389 -starttls ldap > /root/open_ssl_ldap
mkdir /etc/openldap/cacerts/
cat /root/open_ssl_ldap | openssl x509 > /etc/openldap/cacerts/openldap-server.pem
authconfig --disablesssd --disablesssdauth --disableldap --disableldapauth --disablekrb5 --disablekrb5kdcdns --disablekrb5realmdns --disablewinbind --disablewinbindauth --disablewinbindkrb5 --disableldaptls --disablerfc2307bis --updateall
sss_cache -E
authconfig --enablesssd --enablesssdauth --enableldap --enableldaptls --enableldapauth --ldapserver=ldap://\$SCHEDULER_HOSTNAME --ldapbasedn=\$LDAP_BASE --enablelocauthorize --enablemkhomedir --enablecachecreds --updateall
authconfig --enablesssd --enablesssdauth --enablelocauthorize --enablemkhomedir --enablecachecreds --updateall
else
# Configure Active Directory auth
if [[ ! -f /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/domain_name.cache ]]; then
DS_DOMAIN_NAME=\$(\$AWS secretsmanager get-secret-value --secret-id \$SOCA_CONFIGURATION --query SecretString --output text | grep -oP '"DSDomainName": \\"(.*?)\\"' | sed 's/"DSDomainName": //g' | tr -d '"')
else
DS_DOMAIN_NAME=\$(cat /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/domain_name.cache)
fi
UPPER_DS_DOMAIN_NAME=\$(echo \$DS_DOMAIN_NAME | tr a-z A-Z)
# Retrieve account with join permission if available, otherwise query SecretManager
if [[ ! -f /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/join_domain_user.cache ]]; then
DS_DOMAIN_ADMIN_USERNAME=\$(\$AWS secretsmanager get-secret-value --secret-id \$SOCA_CONFIGURATION --query SecretString --output text | grep -oP '"DSDomainAdminUsername": \\"(.*?)\\"' | sed 's/"DSDomainAdminUsername": //g' | tr -d '"')
echo -n \$DS_DOMAIN_ADMIN_USERNAME > /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/join_domain_user.cache
else
DS_DOMAIN_ADMIN_USERNAME=\$(cat /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/join_domain_user.cache)
fi
if [[ ! -f /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/join_domain.cache ]]; then
DS_DOMAIN_ADMIN_PASSWORD=\$(\$AWS secretsmanager get-secret-value --secret-id \$SOCA_CONFIGURATION --query SecretString --output text | grep -oP '"DSDomainAdminPassword": \\"(.*?)\\"' | sed 's/"DSDomainAdminPassword": //g' | tr -d '"')
echo -n \$DS_DOMAIN_ADMIN_PASSWORD > /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/join_domain.cache
else
DS_DOMAIN_ADMIN_PASSWORD=\$(cat /apps/soca/\$SOCA_CONFIGURATION/cluster_node_bootstrap/ad_automation/join_domain.cache)
fi
SERVER_UPPER_HOSTNAME=\$(hostname | awk '{split(\$0,h,"."); print toupper(h[1])}')
ADCLI=\$(command -v adcli)
REALM=\$(command -v realm)
MAX_ATTEMPT=10
CURRENT_ATTEMPT=0
echo \$DS_DOMAIN_ADMIN_PASSWORD | \$REALM join --user \$DS_DOMAIN_ADMIN_USERNAME \$UPPER_DS_DOMAIN_NAME --verbose
while [[ \$? -ne 0 ]] && [[ \$CURRENT_ATTEMPT -le \$MAX_ATTEMPT ]]
do
SLEEP_TIME=\$(( RANDOM % 60 ))
id \$DS_DOMAIN_ADMIN_USERNAME
echo "Realm join didn't complete successfully. Retrying in \$SLEEP_TIME seconds... Loop count is: \$CURRENT_ATTEMPT/\$MAX_ATTEMPT"
sleep \$SLEEP_TIME
((CURRENT_ATTEMPT=CURRENT_ATTEMPT+1))
echo \$DS_DOMAIN_ADMIN_PASSWORD | \$ADCLI delete-computer -U \$DS_DOMAIN_ADMIN_USERNAME --stdin-password --domain=\$DS_DOMAIN_NAME \$SERVER_UPPER_HOSTNAME
echo \$DS_DOMAIN_ADMIN_PASSWORD | \$REALM leave --user \$DS_DOMAIN_ADMIN_USERNAME \$UPPER_DS_DOMAIN_NAME --verbose
echo \$DS_DOMAIN_ADMIN_PASSWORD | \$REALM join --user \$DS_DOMAIN_ADMIN_USERNAME \$UPPER_DS_DOMAIN_NAME --verbose
done
echo -e "
## Add the \\"Amazon Delegated Administrators\\" group from the \${DS_DOMAIN_NAME} domain.
%Amazon\\ Delegated\\ Administrators ALL=(ALL:ALL) ALL
" >> /etc/sudoers
cp /etc/sssd/sssd.conf /etc/sssd/sssd.conf.orig
echo -e "[sssd]
domains = default
config_file_version = 2
services = nss, pam
[domain/default]
ad_domain = \$DS_DOMAIN_NAME
krb5_realm = \$UPPER_DS_DOMAIN_NAME
realmd_tags = manages-system joined-with-samba
cache_credentials = True
id_provider = ad
krb5_store_password_if_offline = True
default_shell = /bin/bash
ldap_id_mapping = True
use_fully_qualified_names = False
fallback_homedir = /data/home/%u
access_provider = ad
[nss]
homedir_substring = /data/home
[pam]
[autofs]
[ssh]
[secrets]" > /etc/sssd/sssd.conf
fi
chmod 600 /etc/sssd/sssd.conf
systemctl enable sssd
systemctl restart sssd
echo "sudoers: files sss" >> /etc/nsswitch.conf
# Disable SELINUX & firewalld
REQUIRE_REBOOT=0
if [[ -z \$(grep SELINUX=disabled /etc/selinux/config) ]]; then
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
REQUIRE_REBOOT=1
fi
systemctl stop firewalld
systemctl disable firewalld
# Disable StrictHostKeyChecking
echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config
echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config
# Configure PBS
cp /etc/pbs.conf /etc/pbs.conf.orig
echo -e "PBS_SERVER=\$SCHEDULER_HOSTNAME
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
" > /etc/pbs.conf
cp /var/spool/pbs/mom_priv/config /var/spool/pbs/mom_priv/config.orig
echo -e "
\\\$clienthost \$SCHEDULER_HOSTNAME
\\\$usecp *:/dev/null /dev/null
\\\$usecp *:/data /data
" > /var/spool/pbs/mom_priv/config
# Configure Chrony
yum remove -y ntp
mv /etc/chrony.conf /etc/chrony.conf.original
echo -e """
# use the local instance NTP service, if available
server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
# !!! [BEGIN] SOCA REQUIREMENT
# You will need to open UDP egress traffic on your security group if you want to enable public pool
#pool 2.amazon.pool.ntp.org iburst
# !!! [END] SOCA REQUIREMENT
# Record the rate at which the system clock gains/losses time.
driftfile /var/lib/chrony/drift
# Allow the system clock to be stepped in the first three updates
# if its offset is larger than 1 second.
makestep 1.0 3
# Specify file containing keys for NTP authentication.
keyfile /etc/chrony.keys
# Specify directory for log files.
logdir /var/log/chrony
# save data between restarts for fast re-load
dumponexit
dumpdir /var/run/chrony
""" > /etc/chrony.conf
#systemctl enable chronyd
if [[ \$SOCA_BASE_OS == "ubuntu18" || \$SOCA_BASE_OS == "ubuntu20" ]]; then
systemctl start chrony
fi
# Disable ulimit
echo -e "
* hard memlock unlimited
* soft memlock unlimited
" >> /etc/security/limits.conf
```
###### 2\\. 从 EFS 挂载 SOCA 所需的目录
```
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 10.0.139.159:/ /apps
sudo mount -t nfs4 -o sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport 10.0.130.227:/ /data
```
<!--StartFragment-->
###### 3\\. 执行安装脚本
```
sudo bash ./HybridComputeNode.sh \${SOCAScheduler 节点 hostname}
```
###### 4\\. 启动 OpenPBS 等相关服务
```
systemctl restart slapd
systemctl restart sssd
systemctl start pbs
```
###### 5\\. 为 admin 用户增加 NVIDIA 相关系统权限
```
sudo usermod -aG video admin
```
###### 6\\. 将 Orin 加入 OpenPBS 节点(本步骤需在 SOCA Scheduler 节点执行)
```
qmgr -c "create node tegra-ubuntu queue=alwayson"
qmgr -c "set node tegra-ubuntu resources_available.instance_type=tegra.2xlarge"
```
执行 pbsnodes 确认 orin 是否成功被加入到备选节点中
```
pbsnodes -a
```
### **用户体验**
用户可以使用 SOCA 构建应用,通过 WebUI 可视化方式,将算法任务部署至 NVIDIA Drive AGX Orin Kit,可查看任务执行状态。
下面举例介绍以 YOLO V5 模型在 NVIDIA Drive AGX Orin Kit 进行推理,识别图像的场景。
1. 访问 SOCA WebUI 并输入用户名、密码登录
![image.png](https://dev-media.amazoncloud.cn/b9388114791b4ca493816a4ec4105588_image.png "image.png")
![image.png](https://dev-media.amazoncloud.cn/d7d902e699b94d0faf4935c05b837929_image.png "image.png")
2. 选择 My Files -> Upload File,上传验证的模型。
![image.png](https://dev-media.amazoncloud.cn/3e394cf15cb34996b147967fad3cdefc_image.png "image.png")
3. 点击刚刚上传的模型旁边的仿真图标。
![image.png](https://dev-media.amazoncloud.cn/75ac27dac1d947069490e1cc932b92bc_image.png "image.png")
4. 选择预先定义好的 SOCA 应用,输入关键参数并提交任务。
![image.png](https://dev-media.amazoncloud.cn/84ca7473be884fa2a3b4a578a1d49cc8_image.png "image.png")
![image.png](https://dev-media.amazoncloud.cn/25ced91e2a9843a589160d4cdbf84f11_image.png "image.png")
5. 任务被提交至 Orin 并执行模型推理流程,输出的图片通过共享存储卷自动同步至 SOCA。
![image.png](https://dev-media.amazoncloud.cn/61a5ea2f429c4b908e10dcb76e2aa324_image.png "image.png")
6. 从 SOCA 端下载刚刚输出的图片结果。
![image.png](https://dev-media.amazoncloud.cn/6b10d712101a43c0bd20ad63724b32be_image.png "image.png")
7. 查看图片,图像内容被正确识别,实现了模型在 Orin 端的验证。
![image.png](https://dev-media.amazoncloud.cn/fc112a2fd60a4596a73f8ce84a81c585_image.png "image.png")
### **解决方案的收益**
通过基于 SOCA 的混合云架构,工程师和运维人员可以高效利用云端和本地资源进行车载软件的开发、验证和运维工作。他们能够自助启动云端计算资源进行开发、SiL 仿真验证,同时统一管理本地硬件资源实现 HiL 开发验证。该统一平台无缝集成了云上云下资源,为工程师和运维人员提供了流畅一致的工作环境,提高了工作效率。这种灵活的架构允许他们充分利用云端弹性计算资源,确保自动驾驶算法和模型的高效开发,助力快速迭代创新,为客户提供先进的智能驾驶解决方案。
### **后续计划**
在汽车行业的快速发展中,高效、可靠的软件开发和测试是当务之急。本实践针对硬件在环(HiL)验证进行了场景设计与应用,下一步将探索软件在环(SiL)场景,同时利用云上云下资源管理调度能力,为软件定义汽车中的虚拟对等环境开发提供支持。
该方案将探索软件开发、测试、系统集成和验证等场景的应用,通过模拟真实环境,减少实际硬件测试的需求,从而降低开发和测试成本,加快产品上市时间。同时,虚拟环境可以模拟各种极端情况,有助于提高车辆系统的质量和安全性,确保产品在复杂多变的实际工作环境中表现出卓越的性能。
此外,该方案将充分利用云计算技术,实现资源按需调配,提高资源利用率。云端集中管理测试数据,方便分析和共享,提高开发效率。通过灵活的虚拟化技术,可快速构建所需测试环境,支持并行测试,大幅缩短开发周期。
### **名词解释**
Transformer 是一种基于自注意力机制的神经网络模型,用于处理序列数据,特别是在自然语言处理任务中取得了重大突破,近年,Transformer 越来越多地应用于传感器数据处理、地图数据处理、预测和规划等任务,为自动驾驶系统提供感知、决策和规划能力。
SOAFEE(Scalable Open Architecture For Embedded Edge/可扩展嵌入式边缘开放架构)是一个适用于嵌入式系统的可扩展、开放的框架,用于构建和部署可扩展的边缘计算解决方案,实现高效灵活的边缘计算应用。
Drive AGX Orin Kit 是由 NVIDIA 推出的一款用于自动驾驶和无人驾驶研发的综合套件,提供高性能计算、感知和控制功能,帮助开发者快速构建和部署自动驾驶应用。
OpenVPN 是一种开源的虚拟专用网络(VPN)解决方案,用于建立安全的远程访问连接,并加密网络通信,保护用户的隐私和数据安全。
SiL(Software-in-the-Loop)是一种汽车行业中的测试方法,通过模拟车辆软件在计算机环境中的运行来评估和验证软件功能。
HiL(Hardware-in-the-Loop)是一种汽车行业中的测试方法,通过将车辆电子控制单元(ECU)与实际硬件部件连接,模拟真实的车辆环境,以验证和测试 ECU 的功能和性能。
OpenPBS(Open Portable Batch System)是一个开源的作业调度和管理系统,用于高性能计算集群环境中的作业调度和资源管理。它允许用户提交、调度和跟踪作业,并有效地分配计算资源,以实现集群的高效利用和作业的优化执行。
YOLO V5 是一种基于深度学习的目标检测算法,通过单一神经网络模型实现实时高效的目标检测和定位,具有较高的准确性和速度。
### **参考资料**
- https\://aws.amazon.com/cn/solutions/implementations/scale-out-computing-on-aws/?trk=cndc-detail
- https\://github.com/awslabs/scale-out-computing-on-aws?trk=cndc-detail
[![1.png](https://dev-media.amazoncloud.cn/957d80df5ec34a269b12b581ac41c178_1.png "1.png")](https://summit.amazoncloud.cn/2024/register.html?source=DSJAVfG2GS7gEk2Osm6kYXAa+8HnSEVdbCVjkuit7lE= )