神经稀疏搜索工具

2.13 版本引入

NeuralSparseSearchTool 执行稀疏向量检索。有关神经稀疏搜索的更多信息，请参阅神经稀疏搜索。

步骤 1：注册并部署稀疏编码模型

OpenSearch 支持多种预训练的稀疏编码模型。您可以使用其中一种模型，也可以使用自己的自定义模型。有关支持的预训练模型列表，请参阅稀疏编码模型。更多信息请参阅OpenSearch 提供的预训练模型和自定义本地模型。

在此示例中，您将使用 amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill 预训练模型进行数据摄取和搜索。要注册模型并将其部署到 OpenSearch，请发送以下请求

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT"
}

OpenSearch 返回模型注册和部署任务的任务 ID

{
  "task_id": "M_9KY40Bk4MTqirc5lP8",
  "status": "CREATED"
}

您可以通过调用任务 API 监控任务状态

GET _plugins/_ml/tasks/M_9KY40Bk4MTqirc5lP8

模型注册并部署完成后，任务 state 将变为 COMPLETED，OpenSearch 将返回模型的模型 ID

{
  "model_id": "Nf9KY40Bk4MTqirc6FO7",
  "task_type": "REGISTER_MODEL",
  "function_name": "SPARSE_ENCODING",
  "state": "COMPLETED",
  "worker_node": [
    "UyQSTQ3nTFa3IP6IdFKoug"
  ],
  "create_time": 1706767869692,
  "last_update_time": 1706767935556,
  "is_async": true
}

步骤 2：将数据摄取到索引中

首先，您将设置一个摄取管道，使用上一步中设置的稀疏编码模型对文档进行编码

PUT /_ingest/pipeline/pipeline-sparse
{
  "description": "An sparse encoding ingest pipeline",
  "processors": [
    {
      "sparse_encoding": {
        "model_id": "Nf9KY40Bk4MTqirc6FO7",
        "field_map": {
          "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

接下来，创建一个索引，并指定该管道为默认管道

PUT index_for_neural_sparse
{
  "settings": {
    "default_pipeline": "pipeline-sparse"
  },
  "mappings": {
    "properties": {
      "passage_embedding": {
        "type": "rank_features"
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

最后，通过发送批量请求将数据摄取到索引中

POST _bulk
{ "index" : { "_index" : "index_for_neural_sparse", "_id" : "1" } }
{ "passage_text" : "company AAA has a history of 123 years" }
{ "index" : { "_index" : "index_for_neural_sparse", "_id" : "2" } }
{ "passage_text" : "company AAA has over 7000 employees" }
{ "index" : { "_index" : "index_for_neural_sparse", "_id" : "3" } }
{ "passage_text" : "Jack and Mark established company AAA" }
{ "index" : { "_index" : "index_for_neural_sparse", "_id" : "4" } }
{ "passage_text" : "company AAA has a net profit of 13 millions in 2022" }
{ "index" : { "_index" : "index_for_neural_sparse", "_id" : "5" } }
{ "passage_text" : "company AAA focus on the large language models domain" }

步骤 3：注册一个将运行 NeuralSparseSearchTool 的流代理

流代理按顺序运行一系列工具并返回最后一个工具的输出。要创建流代理，请发送以下请求，提供在步骤 1 中设置的模型的模型 ID。该模型会将您的查询编码为稀疏向量嵌入

POST /_plugins/_ml/agents/_register
{
  "name": "Test_Neural_Sparse_Agent_For_RAG",
  "type": "flow",
  "tools": [
    {
      "type": "NeuralSparseSearchTool",
      "parameters": {
        "description":"use this tool to search data from the knowledge base of company AAA",
        "model_id": "Nf9KY40Bk4MTqirc6FO7",
        "index": "index_for_neural_sparse",
        "embedding_field": "passage_embedding",
        "source_field": ["passage_text"],
        "input": "${parameters.question}",
        "doc_size":2
      }
    }
  ]
}

有关参数描述，请参阅注册参数。

OpenSearch 返回一个代理 ID

{
  "agent_id": "9X7xWI0Bpc3sThaJdY9i"
}

步骤 4：运行代理

在运行代理之前，请确保添加 OpenSearch Dashboards 示例 Sample web logs 数据集。要了解更多信息，请参阅添加示例数据。

然后，通过发送以下请求运行代理

POST /_plugins/_ml/agents/9X7xWI0Bpc3sThaJdY9i/_execute
{
  "parameters": {
    "question":"how many employees does AAA have?"
  }
}

OpenSearch 返回推理结果

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "result": """{"_index":"index_for_neural_sparse","_source":{"passage_text":"company AAA has over 7000 employees"},"_id":"2","_score":30.586042}
{"_index":"index_for_neural_sparse","_source":{"passage_text":"company AAA has a history of 123 years"},"_id":"1","_score":16.088133}
"""
        }
      ]
    }
  ]
}

注册参数

下表列出了注册代理时可用的所有工具参数。

参数	类型	必需/可选	描述
`model_id`	字符串	必需	搜索时要使用的稀疏编码模型的模型 ID。
`index`	字符串	必需	要搜索的索引。
`embedding_field`	字符串	必需	当神经稀疏模型编码原始文本文档时，编码结果会保存在一个字段中。将此字段指定为 `embedding_field`。神经稀疏搜索通过计算查询文本与文档 `embedding_field` 中文本之间的相似度分数来匹配文档与查询。
`source_field`	字符串	必需	要返回的文档字段。您可以提供多个字段的列表作为字符串数组，例如 `["field1", "field2"]`。
`input`	字符串	流智能体的必需参数	来自流代理参数的运行时输入。如果使用大型语言模型 (LLM)，此字段将填充 LLM 响应。
`名称`	字符串	可选	工具名称。当 LLM 需要为任务选择合适的工具时很有用。
`description`	字符串	可选	工具描述。当 LLM 需要为任务选择合适的工具时很有用。
`doc_size`	整数	可选	要获取的文档数量。默认为 `2`。
`nested_path`	字符串	可选	嵌套查询的嵌套对象的路径。仅用于嵌套字段。默认为 `null`。

执行参数

下表列出了运行代理时可用的所有工具参数。

参数	类型	必需/可选	描述
`question`	字符串	必需	要发送到 LLM 的自然语言问题。

步骤 1：注册并部署稀疏编码模型
步骤 2：将数据摄取到索引中
步骤 3：注册一个将运行 NeuralSparseSearchTool 的流代理
步骤 4：运行代理
注册参数
执行参数

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想做贡献？编辑此页面或创建议题。