使用自定义配置进行神经稀疏搜索

使用自动生成的向量嵌入的神经稀疏搜索有两种操作模式：仅文档模式 (doc-only) 和双编码器模式 (bi-encoder)。有关更多信息，请参阅自动生成稀疏向量嵌入。

在查询时，您可以通过以下方式使用自定义模型

双编码器模式 (Bi-encoder mode)：使用您部署的稀疏编码模型从查询文本生成嵌入。此模型必须与您在摄入时使用的模型相同。
仅文档模式（带自定义分词器）(Doc-only mode with a custom tokenizer)：使用您部署的分词器模型对查询文本进行分词。令牌权重从预计算的查找表中获取。

以下是使用自定义模型进行神经稀疏搜索的完整示例。

步骤 1：配置稀疏编码模型/分词器

当使用双编码器模式和带自定义分词器的仅文档模式时，您必须为摄入配置一个稀疏编码模型。双编码器模式在搜索时使用相同的模型；仅文档模式在搜索时使用单独的分词器。

步骤 1(a)：选择搜索模式

选择搜索模式和适当的模型/分词器组合

双编码器：在摄入和搜索期间都使用 amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill 模型。
仅文档模式（带自定义分词器）：在摄入时使用 amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v3-distill 模型，在搜索时使用 amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 分词器。

下表提供了两种搜索模式所有可用组合的搜索相关性比较，以便您可以为您的用例选择最佳组合。

英语模型

模式	摄入模型	搜索模型	BEIR 上的平均搜索相关性	模型参数
仅文档模式	`amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1`	`amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1`	0.49	133M
仅文档模式	`amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill`	`amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1`	0.504	67M
仅文档模式	`amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-mini`	`amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1`	0.497	23M
仅文档模式	`amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v3-distill`	`amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1`	0.517	67M
双编码器	`amazon/neural-sparse/opensearch-neural-sparse-encoding-v1`	`amazon/neural-sparse/opensearch-neural-sparse-encoding-v1`	0.524	133M
双编码器	`amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill`	`amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill`	0.528	67M

多语言模型

模式	摄入模型	搜索模型	MIRACL 上的平均搜索相关性	模型参数
仅文档模式	`amazon/neural-sparse/opensearch-neural-sparse-encoding-multilingual-v1`	`amazon/neural-sparse/opensearch-neural-sparse-tokenizer-multilingual-v1`	0.629	168M

步骤 1(b)：注册模型/分词器

对于这两种模式，请注册稀疏编码模型。对于带自定义分词器的仅文档模式，除了稀疏编码模型外，还需要注册一个自定义分词器。

双编码器模式

在使用双编码器模式时，您只需注册 amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill 模型。

注册稀疏编码模型

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT"
}

注册模型是一个异步任务。OpenSearch 会为您注册的每个模型返回一个任务 ID。

{
  "task_id": "aFeif4oB5Vm0Tdw8yoN7",
  "status": "CREATED"
}

您可以通过调用 Tasks API 来检查任务状态：

GET /_plugins/_ml/tasks/aFeif4oB5Vm0Tdw8yoN7

任务完成后，任务状态将变为 COMPLETED，并且任务 API 响应将包含已注册模型的模型 ID。

{
  "model_id": "<bi-encoder model ID>",
  "task_type": "REGISTER_MODEL",
  "function_name": "SPARSE_ENCODING",
  "state": "COMPLETED",
  "worker_node": [
    "4p6FVOmJRtu3wehDD74hzQ"
  ],
  "create_time": 1694358489722,
  "last_update_time": 1694358499139,
  "is_async": true
}

请记下您创建的模型的 model_id；后续步骤中将需要它。

仅文档模式（带自定义分词器）

当使用带自定义分词器的仅文档模式时，您需要注册 amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v3-distill 模型（将在摄入时使用）和 amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 分词器（将在搜索时使用）。

注册稀疏编码模型

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v3-distill",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT"
}

注册分词器

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  "version": "1.0.1",
  "model_format": "TORCH_SCRIPT"
}

与双编码器模式类似，使用任务 API 检查注册任务的状态。在任务 API 返回后，任务状态将变为 COMPLETED。请记下您创建的模型和分词器的 model_id；后续步骤中将需要它们。

步骤 2：摄入数据

在双编码器和仅文档模式下，您都将在摄入时使用稀疏编码模型来生成稀疏向量嵌入。

步骤 2(a)：创建摄入管道

为了生成稀疏向量嵌入，您需要创建一个摄入管道，其中包含一个sparse_encoding 处理器，该处理器会将文档字段中的文本转换为向量嵌入。该处理器的 field_map 决定了用于生成向量嵌入的输入字段以及用于存储嵌入的输出字段。

以下示例请求创建了一个摄入管道，其中 passage_text 中的文本将被转换为稀疏向量嵌入，并存储在 passage_embedding 中。在请求中提供已注册模型的模型 ID。

PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
{
  "description": "An sparse encoding ingest pipeline",
  "processors": [
    {
      "sparse_encoding": {
        "model_id": "<bi-encoder or doc-only model ID>",
        "prune_type": "max_ratio",
        "prune_ratio": 0.1,
        "field_map": {
          "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

要将长文本拆分为段落，请在 sparse_encoding 处理器之前使用 text_chunking 摄入处理器。有关更多信息，请参阅文本分块。

步骤 2(b)：创建用于摄入的索引

为了使用管道中定义的稀疏编码处理器，请创建一个排名特性索引，并将上一步中创建的管道添加为默认管道。确保 field_map 中定义的字段映射为正确的类型。继续以该示例为例，passage_embedding 字段必须映射为 rank_features。类似地，passage_text 字段必须映射为 text。

以下示例请求创建了一个配置了默认摄入管道的排名特性索引。

PUT /my-nlp-index
{
  "settings": {
    "default_pipeline": "nlp-ingest-pipeline-sparse"
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "rank_features"
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

为了节省磁盘空间，您可以按如下方式从源中排除嵌入向量：

PUT /my-nlp-index
{
  "settings": {
    "default_pipeline": "nlp-ingest-pipeline-sparse"
  },
  "mappings": {
    "_source": {
      "excludes": [
        "passage_embedding"
      ]
    },
    "properties": {
      "id": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "rank_features"
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

一旦 <token, weight> 对从源中排除，它们将无法恢复。在应用此优化之前，请确保您的应用程序不需要这些 <token, weight> 对。

步骤 2(c)：将文档摄入到索引中

要将文档摄取到上一步创建的索引中，请发送以下请求：

PUT /my-nlp-index/_doc/1
{
  "passage_text": "Hello world",
  "id": "s1"
}

PUT /my-nlp-index/_doc/2
{
  "passage_text": "Hi planet",
  "id": "s2"
}

在文档被摄入到索引之前，摄入管道会在文档上运行 sparse_encoding 处理器，为 passage_text 字段生成向量嵌入。索引文档包括包含原始文本的 passage_text 字段，以及包含向量嵌入的 passage_embedding 字段。

步骤 3：搜索数据

要在您的索引上执行神经稀疏搜索，请在 Query DSL 查询中使用 neural_sparse 查询子句。

以下示例请求使用 neural_sparse 查询来通过原始文本查询搜索相关文档。请提供双编码器模式的模型 ID 或仅文档模式（使用自定义分词器）的分词器 ID。

GET my-nlp-index/_search
{
  "query": {
    "neural_sparse": {
      "passage_embedding": {
        "query_text": "Hi world",
        "model_id": "<bi-encoder or tokenizer ID>"
      }
    }
  }
}

响应包含匹配文档：

{
  "took" : 688,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 30.0029,
    "hits" : [
      {
        "_index" : "my-nlp-index",
        "_id" : "1",
        "_score" : 30.0029,
        "_source" : {
          "passage_text" : "Hello world",
          "passage_embedding" : {
            "!" : 0.8708904,
            "door" : 0.8587369,
            "hi" : 2.3929274,
            "worlds" : 2.7839446,
            "yes" : 0.75845814,
            "##world" : 2.5432441,
            "born" : 0.2682308,
            "nothing" : 0.8625516,
            "goodbye" : 0.17146169,
            "greeting" : 0.96817183,
            "birth" : 1.2788506,
            "come" : 0.1623208,
            "global" : 0.4371151,
            "it" : 0.42951578,
            "life" : 1.5750692,
            "thanks" : 0.26481047,
            "world" : 4.7300377,
            "tiny" : 0.5462298,
            "earth" : 2.6555297,
            "universe" : 2.0308156,
            "worldwide" : 1.3903781,
            "hello" : 6.696973,
            "so" : 0.20279501,
            "?" : 0.67785245
          },
          "id" : "s1"
        }
      },
      {
        "_index" : "my-nlp-index",
        "_id" : "2",
        "_score" : 16.480486,
        "_source" : {
          "passage_text" : "Hi planet",
          "passage_embedding" : {
            "hi" : 4.338913,
            "planets" : 2.7755864,
            "planet" : 5.0969057,
            "mars" : 1.7405145,
            "earth" : 2.6087382,
            "hello" : 3.3210192
          },
          "id" : "s2"
        }
      }
    ]
  }
}

配置搜索的默认模型

使用自定义模型时，您可以在索引级别配置默认模型 ID 以简化您的查询。这消除了在每个查询中指定 model_id 的需要。

首先，创建一个包含 neural_query_enricher 处理器的搜索管道

PUT /_search/pipeline/neural_search_pipeline
{
  "request_processors": [
    {
      "neural_query_enricher" : {
        "default_model_id": "<bi-encoder model/tokenizer ID>"
      }
    }
  ]
}

然后将此管道设置为您的索引的默认管道

PUT /my-nlp-index/_settings 
{
  "index.search.default_pipeline" : "neural_search_pipeline"
}

配置默认模型后，您可以在运行查询时省略 model_id。

有关在索引上设置默认模型，或了解如何在特定字段上设置默认模型的更多信息，请参阅在索引或字段上设置默认模型。

后续步骤

浏览我们的教程，了解如何构建 AI 搜索应用程序。

步骤 1：配置稀疏编码模型/分词器
- 步骤 1(a)：选择搜索模式
- 步骤 1(b)：注册模型/分词器
步骤 2：摄入数据
步骤 3：搜索数据
配置搜索的默认模型
后续步骤

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想贡献？编辑此页面或创建一个问题。