Link Search Menu Expand Document Documentation Menu

向量数据库工具

2.13 版本引入

VectorDBTool 执行密集向量检索。有关 OpenSearch 向量数据库功能的更多信息,请参阅神经搜索

步骤 1:注册并部署稀疏编码模型

OpenSearch 支持多种预训练模型。您可以使用这些模型之一,使用自己的自定义模型,或为外部托管模型创建连接器。有关支持的预训练模型列表,请参阅OpenSearch 提供的预训练模型。有关自定义模型的更多信息,请参阅自定义本地模型。有关集成外部托管模型的信息,请参阅连接到外部托管模型

在此示例中,您将使用 huggingface/sentence-transformers/all-MiniLM-L12-v2 预训练模型进行数据摄入和搜索。要向 OpenSearch 注册和部署模型,请发送以下请求:

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
  "version": "1.0.2",
  "model_format": "TORCH_SCRIPT"
}

OpenSearch 会响应一个用于模型注册和部署任务的任务 ID。

{
  "task_id": "M_9KY40Bk4MTqirc5lP8",
  "status": "CREATED"
}

您可以通过调用 Tasks API 来监控任务状态。

GET _plugins/_ml/tasks/M_9KY40Bk4MTqirc5lP8

模型注册和部署完成后,任务 state 将变为 COMPLETED,OpenSearch 会返回该模型的模型 ID。

{
  "model_id": "Hv_PY40Bk4MTqircAVmm",
  "task_type": "REGISTER_MODEL",
  "function_name": "TEXT_EMBEDDING",
  "state": "COMPLETED",
  "worker_node": [
    "UyQSTQ3nTFa3IP6IdFKoug"
  ],
  "create_time": 1706767869692,
  "last_update_time": 1706767935556,
  "is_async": true
}

步骤 2:将数据摄入到索引中

首先,您将设置一个摄入管道,使用上一步中设置的稀疏编码模型对文档进行编码。

PUT /_ingest/pipeline/test-pipeline-local-model
{
  "description": "text embedding pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "Hv_PY40Bk4MTqircAVmm",
        "field_map": {
          "text": "embedding"
        }
      }
    }
  ]
}

接下来,创建一个 k-NN 索引,并将该管道指定为默认管道。

PUT my_test_data
{
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "embedding": {
        "type": "knn_vector",
        "dimension": 384
      }
    }
  },
  "settings": {
    "index": {
      "knn.space_type": "cosinesimil",
      "default_pipeline": "test-pipeline-local-model",
      "knn": "true"
    }
  }
}

最后,通过发送批量请求将数据摄入到索引中。

POST _bulk
{"index": {"_index": "my_test_data", "_id": "1"}}
{"text": "Chart and table of population level and growth rate for the Ogden-Layton metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\nThe current metro area population of Ogden-Layton in 2023 is 750,000, a 1.63% increase from 2022.\nThe metro area population of Ogden-Layton in 2022 was 738,000, a 1.79% increase from 2021.\nThe metro area population of Ogden-Layton in 2021 was 725,000, a 1.97% increase from 2020.\nThe metro area population of Ogden-Layton in 2020 was 711,000, a 2.16% increase from 2019."}
{"index": {"_index": "my_test_data", "_id": "2"}}
{"text": "Chart and table of population level and growth rate for the New York City metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\nThe current metro area population of New York City in 2023 is 18,937,000, a 0.37% increase from 2022.\\nThe metro area population of New York City in 2022 was 18,867,000, a 0.23% increase from 2021.\\nThe metro area population of New York City in 2021 was 18,823,000, a 0.1% increase from 2020.\\nThe metro area population of New York City in 2020 was 18,804,000, a 0.01% decline from 2019."}
{"index": {"_index": "my_test_data", "_id": "3"}}
{"text": "Chart and table of population level and growth rate for the Chicago metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\nThe current metro area population of Chicago in 2023 is 8,937,000, a 0.4% increase from 2022.\\nThe metro area population of Chicago in 2022 was 8,901,000, a 0.27% increase from 2021.\\nThe metro area population of Chicago in 2021 was 8,877,000, a 0.14% increase from 2020.\\nThe metro area population of Chicago in 2020 was 8,865,000, a 0.03% increase from 2019."}
{"index": {"_index": "my_test_data", "_id": "4"}}
{"text": "Chart and table of population level and growth rate for the Miami metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\nThe current metro area population of Miami in 2023 is 6,265,000, a 0.8% increase from 2022.\\nThe metro area population of Miami in 2022 was 6,215,000, a 0.78% increase from 2021.\\nThe metro area population of Miami in 2021 was 6,167,000, a 0.74% increase from 2020.\\nThe metro area population of Miami in 2020 was 6,122,000, a 0.71% increase from 2019."}
{"index": {"_index": "my_test_data", "_id": "5"}}
{"text": "Chart and table of population level and growth rate for the Austin metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\nThe current metro area population of Austin in 2023 is 2,228,000, a 2.39% increase from 2022.\\nThe metro area population of Austin in 2022 was 2,176,000, a 2.79% increase from 2021.\\nThe metro area population of Austin in 2021 was 2,117,000, a 3.12% increase from 2020.\\nThe metro area population of Austin in 2020 was 2,053,000, a 3.43% increase from 2019."}
{"index": {"_index": "my_test_data", "_id": "6"}}
{"text": "Chart and table of population level and growth rate for the Seattle metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\nThe current metro area population of Seattle in 2023 is 3,519,000, a 0.86% increase from 2022.\\nThe metro area population of Seattle in 2022 was 3,489,000, a 0.81% increase from 2021.\\nThe metro area population of Seattle in 2021 was 3,461,000, a 0.82% increase from 2020.\\nThe metro area population of Seattle in 2020 was 3,433,000, a 0.79% increase from 2019."}

步骤 3:注册一个将运行 VectorDBTool 的流程智能体

流程智能体按顺序运行一系列工具并返回最后一个工具的输出。要创建流程智能体,请发送以下请求,提供在步骤 1 中设置的模型的模型 ID。此模型将您的查询编码为向量嵌入。

POST /_plugins/_ml/agents/_register
{
  "name": "Test_Agent_For_VectorDB",
  "type": "flow",
  "description": "this is a test agent",
  "tools": [
    {
      "type": "VectorDBTool",
      "parameters": {
        "model_id": "Hv_PY40Bk4MTqircAVmm",
        "index": "my_test_data",
        "embedding_field": "embedding",
        "source_field": ["text"],
        "input": "${parameters.question}"
      }
    }
  ]
}

有关参数描述,请参阅注册参数

OpenSearch 返回一个代理 ID

{
  "agent_id": "9X7xWI0Bpc3sThaJdY9i"
}

步骤 4:运行代理

在运行智能体之前,请确保添加 OpenSearch Dashboards 的 Sample web logs 示例数据集。要了解更多信息,请参阅添加示例数据

然后,通过发送以下请求运行代理

POST /_plugins/_ml/agents/9X7xWI0Bpc3sThaJdY9i/_execute
{
  "parameters": {
    "question": "what's the population increase of Seattle from 2021 to 2023"
  }
}

OpenSearch 执行向量搜索并返回相关文档。

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "result": """{"_index":"my_test_data","_source":{"text":"Chart and table of population level and growth rate for the Seattle metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\n
          The current metro area population of Seattle in 2023 is 3,519,000, a 0.86% increase from 2022.\\n
          The metro area population of Seattle in 2022 was 3,489,000, a 0.81% increase from 2021.\\n
          The metro area population of Seattle in 2021 was 3,461,000, a 0.82% increase from 2020.\\n
          The metro area population of Seattle in 2020 was 3,433,000, a 0.79% increase from 2019."},"_id":"6","_score":0.8173238}
        {"_index":"my_test_data","_source":{"text":"Chart and table of population level and growth rate for the New York City metro area from 1950 to 2023. United Nations population projections are also included through the year 2035.\\n
        The current metro area population of New York City in 2023 is 18,937,000, a 0.37% increase from 2022.\\n
        The metro area population of New York City in 2022 was 18,867,000, a 0.23% increase from 2021.\\n
        The metro area population of New York City in 2021 was 18,823,000, a 0.1% increase from 2020.\\n
        The metro area population of New York City in 2020 was 18,804,000, a 0.01% decline from 2019."},"_id":"2","_score":0.6641471}
        """
        }
      ]
    }
  ]
}

注册参数

下表列出了注册代理时可用的所有工具参数。

参数 类型 必需/可选 描述
model_id 字符串 必需 在搜索时使用的模型的模型 ID。
index 字符串 必需 要搜索的索引。
embedding_field 字符串 必需 当模型对原始文本文档进行编码时,编码结果将保存到某个字段中。将此字段指定为 embedding_field。神经搜索通过计算查询文本与文档的 embedding_field 中文本之间的相似度得分来匹配文档与查询。
source_field 字符串 必需 要返回的文档字段。您可以提供多个字段的列表作为字符串数组,例如 ["field1", "field2"]
input 字符串 流智能体的必需参数 从流程智能体参数获取的运行时输入。如果使用大型语言模型(LLM),此字段将填充 LLM 响应。
doc_size 整数 可选 要获取的文档数量。默认为 2
k 整数 可选 执行神经搜索时要查找的最近邻居数量。默认为 10
nested_path 字符串 可选 嵌套查询的嵌套对象的路径。仅用于嵌套字段。默认为 null

执行参数

下表列出了运行代理时可用的所有工具参数。

参数 类型 必需/可选 描述
question 字符串 必需 要发送到 LLM 的自然语言问题。