使用 PB 级数据仓库 Amazon Redshift 构建并对比自然语言查询

SQL

Amazon Redshift

Amazon SageMaker

大语言模型

生成式人工智能

生成式 AI 代表了 AI 技术的一大飞跃，不仅能理解和处理数据，还能创造新内容和解决方案。这类生成式 AI 的特点是能自动产生高质量和高效率的内容及数据处理方式，大幅拓宽了生成式 AI 的应用领域，尤其在自然语言处理（NLP）方面影响深远。使用自然语言进行数据查询是其关键应用之一，允许用户用自然语言查询数据库，简化了非技术人员的数据访问和分析。传统上需用复杂 SQL 语句操作数据库，这对非编程背景者门槛较高。自然语言生成 SQL 通过解析自然语言查询并转换为 SQL 命令，降低了数据访问难度，提升了工作效率。在数据量激增的今天，该技术在数据处理中显得尤为重要，**它使数据分析变得更直观、容易，让数据分析师和业务人员都能通过简单的自然语言查询执行复杂数据操作，使得数据驱动决策更高效、准确。** ### **背景和挑战** 基于自然语言的查询面临多项挑战，包括语言的多样性与歧义性，需准确理解用户意图并映射到复杂的数据库结构。此外，处理复杂查询时，如嵌套查询与多表连接，需要高级的语义理解与逻辑推理能力。数据库的不同架构和命名习惯也增加了理解和转换的难度。性能优化、准确性提升以及适应不同领域的数据库也是关键挑战。此外，缺乏大量高质量、领域特定的训练数据限制了模型的泛化能力。在大数据时代，使用 PB （Petabytes，拍字节）级别数据仓库能够存储和分析海量数据，**[Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) 就是一款行业内出色的 PB 级数据仓库，可以显著提升决策制定和运营效率**。但用户直接编写查询语句面临挑战，包括复杂性高和技术要求严格，易出错且耗时。引入生成式 AI 技术，通过自然语言自动生成 SQL 进行数据查询，为这一问题提供了革命性解决方案。**这项技术简化了数据查询过程，使得没有深入 SQL 技术知识的更广泛用户群体也能轻松访问和分析海量数据，推动了数据访问与分析的普惠化，大幅加速了洞察获取速度**。大语言模型（LLM）自然语言理解的准确性对于生成有效的查询至关重要，但语言的歧义性和复杂性增加了理解的难度。 ### **使用自然语言进行 Amazon Redshift 数据查询的不同方法** 在生成式 AI 领域，使用自然语言对 [Amazon Redshift](https://aws.amazon.com/cn/redshift/?trk=cndc-detail) 以及其他的数据源进行数据查询可以采用不同的方法： 1. 可以直接通过 LLM 处理查询需求生成 SQL 然后再到数据库中执行查询返回结果； 2. 也可以通过 LangChain 的原生的 Chain 将数据查询 SQL 语句生成以及连接数据库和返回结果链接起来； 3. 也可以使用 Fine-tuning 的方法，基于历史知识积累训练 LLM，增加准确性，在此基础上进行数据查询。每一种方法的成本、性能，以及复杂性都各不相同。本实验将**以亚马逊云科技的 PB 级数据仓库 Amazon Redshift 举例说明，对使用自然语言进行数据查询的三种方法展开介绍与对比分析，帮助用户根据不同的需求场景选择合适的方法。** #### **基于 LLM 直接进行 Amazon Redshift 数据查询** 该方法通过将数据查询的问题和 prompt 发送给部署 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 大语言模型，并通过 LLM 查询出结果后，发送给 Amazon Redshift，或者其他的数据源，执行查询后返回结果。本实验重点以 Amazon Redshift 为例进行说明，采用该方法除了需要准备好数据查询相关的问题之外，还需要准备数据查询相关的 schema。本实验选择的 LLM SQLCoder 进行数据查询相关的 prediction。 ![image.png](https://dev-media.amazoncloud.cn/7ae0bb786ecd46dea4ed9025f4d48102_image.png "image.png") 核心样例代码如下：安装所需的 packages： ```js import ast import json import logging import os import psycopg2 import boto3 import pandas as pd import yaml from botocore.exceptions import ClientError ``` 配置 [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) 相关的信息，包括 runtime、 region、以及部署 LLM 产生的 endpoint： ```js # sagemaker region and endpoint info sagemaker_region = "Sagemaker region" ##Fill in your sagemaker region. sagemaker_endpoint_name = "Sagemaker endpoint name." ##Fill in the deployed LLM sagemaker endpoint name. # create the SageMaker runtime client client = boto3.client('sagemaker-runtime') ``` 设置问题，数据查询模版，以及进行 Tokenize，此处可以替换为客户自己的问题和查询 schema 模版： ```js # prepare the question and the prompt template question = "How many artworks are by the artist 'Pablo Picasso'?" prompt = """### Task Generate a SQL query to answer the following question: `{question}` ### Database Schema This query will run on a database whose schema is represented in this string: CREATE TABLE artists ( artist_id integer NOT NULL, full_name character varying(200), nationality character varying(50), gender character varying(25), birth_year integer, death_year integer, CONSTRAINT artists_pk PRIMARY KEY (artist_id) ); CREATE TABLE artworks ( artwork_id integer NOT NULL, title character varying(500), artist_id integer NOT NULL, date integer, medium character varying(250), dimensions text, acquisition_date text, credit text, catalogue character varying(250), department character varying(250), classification character varying(250), object_number text, diameter_cm text, circumference_cm text, height_cm text, length_cm text, width_cm text, depth_cm text, weight_kg text, durations integer, CONSTRAINT artworks_pk PRIMARY KEY (artwork_id) ); -- artists.artist_id can be joined with artworks.artist_id ### SQL Given the database schema, here is the SQL query that answers `{question}`: ```sql """.format(question=question) ``` Note：采用基于 LLM 直接查询的方式，可以根据业务场景，通过 prompt engineering 添加提示信息，增加准确性，例如通过如下的提示信息代码，指示 LLM 可以进行哪些表的 Join 操作。 ```js -- artists.artist_id can be joined with artworks.artist_id ``` 调用一个 Amazon SageMaker 的模型端点，发送数据并接收模型的预测结果： ```js def invoke_endpoint(payload, smclient, endpoint_name): res = smclient.invoke_endpoint( EndpointName=endpoint_name, Body=json.dumps(payload), ContentType="application/json", CustomAttributes="accept_eula=true") return res["Body"].read().decode("utf8") ``` 指定 LLM 预期的输入数据和文本生成数据的行为参数： ```js payload = { "inputs": prompt, "parameters": { "max_new_tokens": 512, "top_p": 0.9, "temperature": 0.1, "return_full_text": False, }, } ``` max_new_tokens：指定生成的最大新 token （通常是单词或字符）数量，这里设置为 512，意味着 LLM 输出的文本长度不会超过 512 个 token。 top_p：设置为 0.9，意味着 LLM 将仅考虑累计概率达到 90% 的最可能的数据。 temperature：控制生成的随机性。较低的值（如 0.1）使生成的文本更确定、更一致，而较高的值会增加文本的随机性和创造性。该场景下设置为 0.1，希望 LLM 提供更高的准确性。 return_full_text：指定是否返回输入文本的完整文本加上生成的文本。设置为 False 意味着仅返回模型生成的文本部分；如果为 Ture 则在返回结果中会提供更多的解释信息和背景参考信息。调用部署 LLM 后生成的 Amazon SageMaker 的 endpoint，然后处理该调用的响应： ```js response = invoke_endpoint(payload, client, sagemaker_endpoint_name) response_json = json.loads(response) sql_query = response_json[0]["generated_text"] ``` 生成的 SQL 语句如下： ```js SELECT COUNT(artwork_id) FROM artworks WHERE artist_id IN (SELECT artist_id FROM artists WHERE full_name ilike '%Pablo%Picasso%'); ``` 连接 Amazon Redshift，将得到的结果执行查询，得到结果为 1278。 ```js # Create Redshift database connection conn = psycopg2.connect( dbname='Database name', user='User name', password='Password', host='redshift host name', port='5439' ) # Create cursor cur = conn.cursor() # Execute SQL query cur.execute(sql_query) # Get and print the query result rows = cur.fetchall() for row in rows: print(row) # Close cursor and connection cur.close() conn.close() ``` 运行该代码，实际耗时为 1.963s。 ```js sh-4.2$ time python m1_LLM_redshift.py (1278,) real 0m1.963s user 0m0.968s sys 0m0.377s ``` 采用该方案进行数据查询，会面临两个问题：第一，**如果 Schema 的数目过多，很容易超过 Token 的上限**；第二，**采用该方案进行数据查询，针对更为复杂的查询场景，在准确性方面会面临挑战**。针对以上的问题，可以结合自定义 RAG 和 Few-shot 进行优化，其中 RAG 是一种结合了信息检索和文本生成技术的自然语言处理方法，仅召回和查询问题相关性最高的 Top K 条相关信息，同时还可以提高模型的准确性，Few-shot 学习能使模型仅用少量样本适应新任务，提高学习效率和泛化能力。 #### **通过 LangChain 的 Chain 进行 Amazon Redshift 数据查询** 该解决方案用户的前端托管在 Amazon ECS 中，问题通过 Amazon SageMaker 的 Endpoint 发给 LangChain 的 SQLDBChain 进行进一步的处理，SQLDBChain 将 prompt examples 通过开源向量数据库 Chroma 处理后返回相似性最高的 examples，LLM 根据返回的 Few shot 结果进行预测提高准确性，并再次通过 SQLDBChain 将生成的 SQL 发送到 Amazon Redshift 进行查询后返回结果。 ![image.png](https://dev-media.amazoncloud.cn/88f9cf1c53da40c5b6733d85bd971096_image.png "image.png") ##### **LangChain 是什么？** LangChain 是一个用于开发由语言模型驱动的应用程序的框架。它使得应用程序能够： * **感知上下文**：将语言模型与上下文源连接起来（提示说明、少量示例、用于基础回应的内容等）。 * **推理**：依赖语言模型进行推理（关于如何基于提供的上下文回答问题、采取什么行动等）。 ##### **LangChain 的 Chain 是什么？** LLM 是 LangChain 中的基本单元。然而，正如 LangChain 的名字所暗示的，你可以根据特定任务将 LLM 调用链接起来。SQLDatabaseChain 是一个简单的链，允许对数据库执行 SQL 查询。它接收一个 SQLDatabase 对象，并依次调用 sql_query 和 sql_print_result 等工具来运行和打印查询结果。 ##### **Few-Shot 是什么？** Few-shot prompting 是一种技术，旨在通过提供少量示例来增强 LLM 的上下文学习能力，从而提高其在复杂任务上的表现。这种方法通过在提示中加入示例来指导模型更好地理解和执行特定任务。 ##### **Chroma 是什么？** Chroma DB 是一款用于存储和检索向量嵌入的开源向量存储库。其主要用途是保存嵌入及其元数据，以便稍后由大型语言模型使用。此外，它还可以用于文本数据的语义搜索引擎，便于实现简易、快速的数据分析及自托管服务。如果您希望选择托管向量数据库，推荐使用 Pinecone。Chroma 进行 embedding 和 RAG 默认使用的是 all-MiniLM-L6-v2 模型，也可以更换为其他的模型。该方案通过 redshift-connector 和 SQLAlchemy 构建出 Redshift 的 engine，发送给 LangChain 的 SQLDB Chain 从而实现数据查询并返回查询结果。核心代码如下：执行前请先安装 redshift_connector 和 sqlalchemy-redshift。Amazon Redshift SQLAlchemy dialect 使用 Amazon Redshift Python 驱动程序（redshift_connector），允许您安全地连接到您的 Amazon Redshift 数据库： ```js pip install sqlalchemy-redshift pip install redshift_connector ``` 安装其他需要的 packages： ```js import ast import json import logging import os import re import boto3 import pandas as pd import yaml from botocore.exceptions import ClientError from langchain import FewShotPromptTemplate, PromptTemplate, SQLDatabase from langchain.chains.sql_database.prompt import PROMPT_SUFFIX, _postgres_prompt from langchain.embeddings.huggingface import HuggingFaceEmbeddings from langchain.prompts.example_selector.semantic_similarity import ( SemanticSimilarityExampleSelector, ) from langchain.vectorstores import Chroma from langchain_experimental.sql import SQLDatabaseChain from langchain import LLMChain from langchain import SagemakerEndpoint from langchain.llms.sagemaker_endpoint import LLMContentHandler import sqlalchemy from sqlalchemy import create_engine ``` 配置 Amazon SageMaker 所在的 region 和部署 LLM 产生的 Amazon SageMaker endpoint： ```js REGION_NAME = os.environ.get("REGION_NAME", {region_name}) ##Replace with your region name # sagemaker region and endpoint info sagemaker_region = {region_name} ##Replace with your sagemaker region sagemaker_endpoint_name = {sagemaker_endpoint_name} ##Replace with your sagemaker endpoint name ``` 在此处定义 LangChain 的 Content handler class，指定 LLM 处理数据的 encode 和 decode 的格式，注意不同的 LLM 由于 prediction 之后产生的输出格式不同，在此处的处理方式会有所差异，有可能需要使用正则表达式做进一步的处理。 ```js ##Define content handler class for SQLColder model class ContentHandler(LLMContentHandler): content_type = "application/json" accepts = "application/json" def transform_input(self, prompt: str, model_kwargs={}) -> bytes: input_str = json.dumps({"inputs": prompt, "parameters": model_kwargs}) return input_str.encode("utf-8") def transform_output(self, output: bytes) -> str: # def transform_output(self, output: bytes): response_json = json.loads(output.read().decode("utf-8")) if "generated_text" not in response_json[0]: raise ValueError("'generated_text' not found in the response.") return response_json[0]["generated_text"] ``` 通过 SQL Alchemy 连接 Amazon Redshift 数据库：在底层，LangChain 使用 SQLAlchemy 连接到 SQL 数据库。因此，SQLDatabaseChain 可以用于 SQLAlchemy 支持的任何 SQL 语言，如 MS SQL、MySQL、MariaDB、PostgreSQL、Oracle SQL 和 SQLite。 ```js ##connect to redshift def get_rds_uri(region_name): # URI format: redshift+redshift_connector://user:pwd@hostname:port/dbname ### Create the redshift connection string returnf"redshift+redshift_connector://{rds_username}:{rds_password}@{rds_endpoint}:{rds_port}/{rds_db_name}" rds_uri = get_rds_uri(REGION_NAME) engine_redshift = create_engine(rds_uri) ``` 通过 customize prompt、 RAG、 few shot、 Vector Database 相结合的方式提升结果的准确性： ```js def load_samples(): # Load the sql examples for few-shot prompting examples sql_samples = None # with open("examples.yaml", "r") as stream, replace with your few-shot prompting examples with open("/path/to/examples.yaml", "r") as stream: sql_samples = yaml.safe_load(stream) return sql_samples def load_few_shot_chain(llm, db, examples): example_prompt = PromptTemplate( input_variables=["table_info", "input", "sql_cmd", "sql_result", "answer"], template=( "{table_info}\n\nQuestion: {input}\nSQLQuery: {sql_cmd}\nSQLResult:" " {sql_result}\nAnswer: {answer}" ), ) # Can replace the default embedding LLM all-MiniLM-L6-v2 for Chroma as needed local_embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" ) example_selector = SemanticSimilarityExampleSelector.from_examples( examples, local_embeddings, Chroma, k=min(3, len(examples)), # k=min(3, len(examples)), ) few_shot_prompt = FewShotPromptTemplate( example_selector=example_selector, example_prompt=example_prompt, prefix=_postgres_prompt + "Here are some examples:", suffix=PROMPT_SUFFIX, input_variables=["table_info", "input", "top_k"], ) return SQLDatabaseChain.from_llm( llm, db, prompt=few_shot_prompt, use_query_checker=False, verbose=True, return_intermediate_steps=False, ) ``` * **Customize Prompt**：为了进一步改善结果，我们还可以使用 LangChain 的提示模板。具体细节可以参考 Langchain 的官方链接 [Prompt Template](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/?trk=article-ssr-frontend-pulse_little-text-block?trk=cndc-detail)。 * **few_shot_prompt**：为了提高 SQL 查询的准确性，LangChain 允许我们使用 few_shot_prompt 学习。根据维基百科的说法，“在自然语言处理中，上下文学习、小样本学习或小样本提示是一种提示技术，它允许模型在尝试任务之前处理示例。” * **Retrieval-Augmented Generation（RAG）**：该实验使用了向量数据库 Chroma，和 Embeddings 模型 all-MiniLM-L6-v2，根据 few shot 模版进返回和结果相似性最高的 3 条数据。 Note：在本实验中，通过在 examples.yaml 中添加 few shot 参考信息，提升 LLM 回答的准确性，example 参考信息如下，在实际使用中，可以根据历史的数据查询，替换为相应的内容。 ```js - answer: There are 121211 rows in the artworks table. input: How many rows are in the artworks table? sql_cmd: SELECT count(*) FROM artworks; sql_result: '[(121211,)]' table_info: | CREATE TABLE artworks ( artwork_id integer NOT NULL, title character varying(500), artist_id integer NOT NULL, date integer, medium character varying(250), dimensions text, acquisition_date text, credit text, catalogue character varying(250), department character varying(250), classification character varying(250), object_number text, diameter_cm text, circumference_cm text, height_cm text, length_cm text, width_cm text, depth_cm text, weight_kg text, durations integer, CONSTRAINT artworks_pk PRIMARY KEY (artwork_id) ``` 定义问题，通过 LangChain 预先定义好的框架，基于 Chain 将 LLM 进行 prediction，去 Amazon Redshift 中进行查询，返回结果整个过程链接起来： ```js if __name__ == "__main__": NO_ANSWER_MSG = "Sorry, I was unable to answer your question." content_handler = ContentHandler() parameters = { "max_new_tokens": 512, "top_p": 0.9, "return_full_text": False, "temperature": 0.1, } llm=SagemakerEndpoint( endpoint_name=sagemaker_endpoint_name, region_name=sagemaker_region, model_kwargs=parameters, content_handler=content_handler ) examples = load_samples() db = SQLDatabase(engine_redshift) sql_db_chain = load_few_shot_chain(llm, db, examples) sql_db_chain("query") # Replace the query with customer own question. ``` 通过执行和方法一中相同的问题，替换 query 的内容如下： ```js How many artworks are by the artist 'Pablo Picasso'? ``` 运行代码，实际时长为 14.047s。 ```js sh-4.2$ time python3 m2_langchain_redshift.py > Entering new SQLDatabaseChain chain... How many artworks are by the artist 'Pablo Picasso'? SQLQuery:SELECT COUNT(*) FROM artworks WHERE artist_id IN (SELECT artist_id FROM artists WHERE full_name ilike '%Pablo%Picasso%'); SQLResult: [(1278,)] Answer:There are 1278 artworks by Pablo Picasso. > Finished chain. real 0m14.047s user 0m8.864s sys 0m1.793s ``` 结合 RAG 和 Few shot 可以提升准确性，支持更为复杂的查询，更换 sql_db_chain(“query”) 中的 query 为如下的问题： ```js What is the ratio of male to female artists? Return as a ratio. ``` 生成的 SQL 语句如下： ```js SELECT CAST(COUNT(CASE WHEN gender = 'Male' THEN 1 END) AS FLOAT) / NULLIF(COUNT(CASE WHEN gender = 'Female' THEN 1 END), 0) AS male_to_female_ratio FROM artists; ``` 运行代码，实际时长为 19.681s。 ```js sh-4.2$ time python3 m2_langchain_redshift.py > Entering new SQLDatabaseChain chain... What is the ratio of male to female artists? Return as a ratio. SQLQuery:SELECT CAST(COUNT(CASE WHEN gender = 'Male' THEN 1 END) AS FLOAT) / NULLIF(COUNT(CASE WHEN gender = 'Female' THEN 1 END), 0) AS male_to_female_ratio FROM artists; SQLResult: [(4.479708162334702,)] Answer:There is a 4.479708162334702 ratio of male to female artists. > Finished chain. real 0m19.681s user 0m10.155s sys 0m2.077s ``` #### **通过 Fine-Tuning 优化 Amazon Redshift 的数据查询生成结果** 如果希望降低 prediction 的成本，则考虑使用小模型。由于小模型的处理能力相较于大模型会弱一些，如果想降低成本的同时，提升准确性，可以考虑对小模型进行微调，因为小模型它无法在不进行微调的情况下执行复杂任务。因此，除了 Few shot 和 RAG 之外，微调通过在一小批手动标记的数据上进行，以使模型适应特定的目标任务。例如，通过历史积累的数据查询的数据集，可能需要在一小批手动标记的文本示例上重新训练模型，以达到合理的准确性。代码实现可以参考 [GitHub](https://github.com/haozhx23/Llama2-QLoRA-on-SageMaker?trk=cndc-detail)。 ![image.png](https://dev-media.amazoncloud.cn/50235996979941388a407566f1d44f5c_image.png "image.png") 不同的 LLM 进行数据查询时，prompt 的格式会有所不同，因此准备 FT 数据集时，需要考虑 LLM 的 input 格式要求，本实验以 HuggingFace 官方提供的用来进行数据查询的 FT 数据集进行相应的格式化做为输入，数据集由 prompt 和 completion 两部分组成，FT 的部分 sample data 具体如下： ```js {"prompt":"### Task\nGenerate a SQL query to answer the following question:\n`Which event had Carlos Alexandre Pereira as opponent?`\n\n### Database Schema\nThe query will run on a database with the following schema:\n```\\nCREATE TABLE table_1_11585313_1 (\\n\\t\\"number of goals\\" numeric,\\n\\t\\"number of caps\\" numeric,\\n\\t\\"date of birth\\" text,\\n\\tname text,\\n\\t\\"date of death\\u2020\\" text,\\n\\t\\"date of debut\\" text,\\n\\tnumber numeric\\n);\\nCREATE TABLE table_2_18981955_2 (\\n\\tRound numeric,\\n\\tEvent text,\\n\\tMethod text,\\n\\tOpponent text,\\n\\tLocation text,\\n\\tRecord text,\\n\\tTime text,\\n\\t\\"Res.\\" text\\n);\\n\\n```\n\n### Answer\nHere is the SQL query that answers the question:\n`Which event had Carlos Alexandre Pereira as opponent?`\n```sql\\n","completion":"SELECT Event FROM table_2_18981955_2 WHERE \\"Opponent\\" ILIKE '%carlos%alexandre%pereira%'"} {"prompt":"### Task\\nGenerate a SQL query to answer the following question:\\n`Which total is highest with 0 gold and more than 0 silver, in Ukraine?`\\n\\n### Database Schema\\nThe query will run on a database with the following schema:\\n```\nCREATE TABLE table_1_1140105_6 (\n\t\"Winning driver\" text,\n\t\"Race Name\" text,\n\tReport text,\n\tCircuit text,\n\tConstructor text,\n\tDate text\n);\nCREATE TABLE table_1_11585313_2 (\n\tnumber numeric,\n\t\"date of debut\" text,\n\t\"date of birth\" text,\n\t\"number of goals\" numeric,\n\t\"date of death\u2020\" text,\n\t\"number of caps\" numeric,\n\tname text\n);\nCREATE TABLE table_2_12392804_3 (\n\tSilver numeric,\n\tRank numeric,\n\tTotal numeric,\n\tNation text,\n\tBronze numeric,\n\tGold numeric\n);\nCREATE TABLE table_1_12251936_1 (\n\t\"Whites as % of Pop.\" text,\n\t\"TFR 2006\" text,\n\t\"Live births 2006\" numeric,\n\tCounty text,\n\t\"GFR 2006\" text\n);\nCREATE TABLE table_1_1231316_7 (\n\tLocation text,\n\tDate text,\n\t\"Wind (m\/s)\" text,\n\t\"Fastest time (s)\" text,\n\tAthlete text\n);\nCREATE TABLE table_1_12148018_2 (\n\tWriter text,\n\t\"#\" numeric,\n\tViewers text,\n\tEpisode text,\n\t\"Original airdate\" text,\n\tDirector text\n);\n\n```\\n\\n### Answer\\nHere is the SQL query that answers the question:\\n`Which total is highest with 0 gold and more than 0 silver, in Ukraine?`\\n```sql\n","completion":"SELECT MAX(Total) FROM table_2_12392804_3 WHERE \"Gold\" = '0' AND \"Silver\" > '0' AND \"Nation\" ILIKE '%ukraine%'"} ``` #### **三种基于生成式 AI 的方法对Amazon Redshift 进行数据查询的成本性能对比分析** 本实验将从成本、性能、复杂性等几个维度，针对介绍的三种对 Amazon Redshift 进行数据查询的方法进行对比分析，从而帮助客户更好的选择合适的解决方案。 ![image.png](https://dev-media.amazoncloud.cn/0182b6ffda584f4899f2d825aa31ba0f_image.png "image.png") 综合分析： * 采用基于 LLM 的方法，进行数据查询，在性能上具有明显的优势，同时支持用户根据需求进行自定义开发，比如进行自定义的 RAG 开发，可以和现有的云基础设施架构的服务更灵活的集成。 * 采用 LangChain 的框架，使得使用大语言模型构建复杂应用变得更加容易，它允许用户将 LLM 连接到其他的数据源，从而使应用程序处理更广泛的信息。同时 LangChain 拥有一个强大，活跃的社区，开发团队持续增强库的速度，确保用户可以访问最新的 LLM 功能，同时在需要时可以随时寻求帮助。 * 基于 Fine-tuning 进行数据查询的优化是三种方法中复杂度最高的方法，建议在数据查询项目运行一段时间，有了一定的数据 QA 对积累之后，联系相关的生成式 AI 技术专家提供支持，进行该方案的部署和使用。 ### **结论** **本实验研究了针对 PB 级数据仓库进行自然语言查询的不同方法，并对比了不同方法的特点，可以根据实际业务场景下的性能、成本、复杂性等需求，选择合适的方案**。随着生成式 AI 的不断发展，DataGenAI 作为其中一种帮助客户降本增效的方法，相信在未来会有更多的探索和发展。我们也看到其他一些支持数据查询的框架，比如 [DBGPT](https://github.com/eosphoros-ai/DB-GPT?trk=cndc-detail) 开源数据查询的社区，提供基于多种数据平台的生成式 AI 查询框架，以及 [LlamaIndex](https://github.com/run-llama?trk=cndc-detail) 作为和 LangChain 平行的生成式 AI 框架，在未来可能也会为用户在该领域提供更多的可能性。 ![image.png](https://dev-media.amazoncloud.cn/ba741cf72df844eeb1cd32f6dbc92f0c_image.png "image.png") ![开发者尾巴.gif](https://dev-media.amazoncloud.cn/1385de1e2de9452ea9f7e34fee7cde77_%E5%BC%80%E5%8F%91%E8%80%85%E5%B0%BE%E5%B7%B4.gif "开发者尾巴.gif")