Link Search Menu Expand Document Documentation Menu

使用文本分块的语义搜索

本教程将展示如何在 OpenSearch 2.19 或更高版本中使用文本分块功能对长文档执行语义搜索。

在本教程中,您将使用以下 OpenSearch 组件:

将以 your_ 为前缀的占位符替换为您自己的值。

第 1 步:创建嵌入模型

在本教程中,您将使用 Amazon Bedrock Titan 文本嵌入模型

如果使用 Python,您可以使用 opensearch-py-ml 客户端 CLI 创建 Amazon Bedrock Titan 嵌入连接器并测试模型。CLI 自动化了许多配置步骤,使设置更快并减少错误的可能性。有关使用 CLI 的更多信息,请参阅 CLI 文档

如果使用自管理 OpenSearch,请使用蓝图创建模型。

如果使用 Amazon OpenSearch Service,请使用 此 Python Notebook 创建模型。另外,您也可以按照本教程手动创建连接器。

第 1.1 步:创建连接器

要创建连接器,请发送以下请求。由于您将在本教程中使用 ML 推理处理器,因此无需在连接器中指定预处理或后处理函数。

POST _plugins/_ml/connectors/_create
{
  "name": "Amazon Bedrock Connector: embedding",
  "description": "The connector to bedrock Titan embedding model",
  "version": 1,
  "protocol": "aws_sigv4",
  "parameters": {
    "region": "us-west-2",
    "service_name": "bedrock",
    "model": "amazon.titan-embed-text-v2:0",
    "dimensions": 1024,
    "normalize": true,
    "embeddingTypes": ["float"]
  },
  "credential": {
    "access_key": "your_aws_access_key",
    "secret_key": "your_aws_secret_key",
    "session_token": "your_aws_session_token"
  },
  "actions": [
    {
      "action_type": "predict",
      "method": "POST",
      "url": "https://bedrock-runtime.${parameters.region}.amazonaws.com/model/${parameters.model}/invoke",
      "headers": {
        "content-type": "application/json",
        "x-amz-content-sha256": "required"
      },
      "request_body": "{ \"inputText\": \"${parameters.inputText}\", \"dimensions\": ${parameters.dimensions}, \"normalize\": ${parameters.normalize}, \"embeddingTypes\": ${parameters.embeddingTypes} }"
    }
  ]
}

响应包含一个连接器 ID。

{
  "connector_id": "vhR15JQBLopfJ2xsx9p5"
}

记下连接器 ID;您将在下一步中使用它。

第 1.2 步:注册模型

要注册模型,请发送以下请求。

POST _plugins/_ml/models/_register?deploy=true
{
  "name": "Bedrock embedding model",
  "function_name": "remote",
  "description": "Bedrock text embedding model v2",
  "connector_id": "vhR15JQBLopfJ2xsx9p5"
}

响应包含模型 ID

{
  "task_id": "xRR35JQBLopfJ2xsO9pU",
  "status": "CREATED",
  "model_id": "xhR35JQBLopfJ2xsO9pr"
}

记下模型 ID;您将在下一步中使用它。

第 1.3 步:测试模型

要测试模型,请发送以下请求:

POST /_plugins/_ml/models/xhR35JQBLopfJ2xsO9pr/_predict
{
    "parameters": {
        "inputText": "hello world"
    }
}

响应包含模型生成的嵌入

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "embedding": [
              -0.020442573353648186,...
            ],
            "embeddingsByType": {
              "float": [
                -0.020442573353648186, ...
              ]
            },
            "inputTextTokenCount": 3.0
          }
        }
      ],
      "status_code": 200
    }
  ]
}

步骤 2:创建摄取管道

许多文本嵌入模型都有输入大小限制。Amazon Titan Text Embeddings V2 模型最大支持 8,192 个文本标记。要处理长文档,您需要将其分成较小的块,并将每个块发送到模型。文本分块处理器将原始文档分成较小的部分,而 ML 推理处理器为每个块生成嵌入。要创建包含这两个处理器的摄入管道,请发送以下请求。

PUT _ingest/pipeline/bedrock-text-embedding-pipeline
{
  "description": "ingest reviews, generate embedding, and format chunks",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 100,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    },
    {
      "foreach": {
        "field": "passage_chunk",
        "processor": {
          "set": {
            "field": "_ingest._value",
            "value": {
              "text": ""
            }
          }
        }
      }
    },
    {
      "foreach": {
        "field": "passage_chunk",
        "processor": {
          "ml_inference": {
            "model_id": "xhR35JQBLopfJ2xsO9pr",
            "input_map": [
              {
                "inputText": "_ingest._value.text"
              }
            ],
            "output_map": [
              {
                "_ingest._value.embedding": "embedding"
              }
            ]
          }
        }
      }
    }
  ]
}

要测试管道,请发送以下请求。

POST _ingest/pipeline/bedrock-text-embedding-pipeline/_simulate
{
  "docs": [
    {
      "_index": "testindex",
      "_id": "1",
      "_source":{
         "passage_text": "Ingest pipelines\nAn ingest pipeline is a sequence of processors that are applied to documents as they are ingested into an index. Each processor in a pipeline performs a specific task, such as filtering, transforming, or enriching data.\n\nProcessors are customizable tasks that run in a sequential order as they appear in the request body. This order is important, as each processor depends on the output of the previous processor. The modified documents appear in your index after the processors are applied.\n\nOpenSearch ingest pipelines compared to OpenSearch Data Prepper\nOpenSeach ingest pipelines run within the OpenSearch cluster, whereas OpenSearch Data Prepper is an external component that runs on the OpenSearch cluster.\n\nOpenSearch ingest pipelines perform actions on indexes and are preferred for use cases involving pre-processing simple datasets, machine learning (ML) processors, and vector embedding processors. OpenSearch ingest pipelines are recommended for simple data pre-processing and small datasets.\n\nOpenSearch Data Prepper is recommended for any data processing tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the OpenSearch Data Prepper documentation for more information.\n\nOpenSearch ingest pipelines can only be managed using Ingest API operations.\n\nPrerequisites\nThe following are prerequisites for using OpenSearch ingest pipelines:\n\nWhen using ingestion in a production environment, your cluster should contain at least one node with the node roles permission set to ingest. For information about setting up node roles within a cluster, see Cluster Formation.\nIf the OpenSearch Security plugin is enabled, you must have the cluster_manage_pipelines permission to manage ingest pipelines.\nDefine a pipeline\nA pipeline definition describes the sequence of an ingest pipeline and can be written in JSON format. An ingest pipeline consists of the following:\n\n{\n    \"description\" : \"...\"\n    \"processors\" : [...]\n}\nRequest body fields\nField\tRequired\tType\tDescription\nprocessors\tRequired\tArray of processor objects\tA component that performs a specific data processing task as the data is being ingested into OpenSearch.\ndescription\tOptional\tString\tA description of the ingest pipeline.\n"
      }
    }
  ]
}

响应显示了已处理的文档,该文档已分块并包含每个块的嵌入。

{
  "docs": [
    {
      "doc": {
        "_index": "testindex",
        "_id": "1",
        "_source": {
          "passage_text": """Ingest pipelines
An ingest pipeline is a sequence of processors that are applied to documents as they are ingested into an index. Each processor in a pipeline performs a specific task, such as filtering, transforming, or enriching data.

Processors are customizable tasks that run in a sequential order as they appear in the request body. This order is important, as each processor depends on the output of the previous processor. The modified documents appear in your index after the processors are applied.

OpenSearch ingest pipelines compared to OpenSearch Data Prepper
OpenSeach ingest pipelines run within the OpenSearch cluster, whereas OpenSearch Data Prepper is an external component that runs on the OpenSearch cluster.

OpenSearch ingest pipelines perform actions on indexes and are preferred for use cases involving pre-processing simple datasets, machine learning (ML) processors, and vector embedding processors. OpenSearch ingest pipelines are recommended for simple data pre-processing and small datasets.

OpenSearch Data Prepper is recommended for any data processing tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the OpenSearch Data Prepper documentation for more information.

OpenSearch ingest pipelines can only be managed using Ingest API operations.

Prerequisites
The following are prerequisites for using OpenSearch ingest pipelines:

When using ingestion in a production environment, your cluster should contain at least one node with the node roles permission set to ingest. For information about setting up node roles within a cluster, see Cluster Formation.
If the OpenSearch Security plugin is enabled, you must have the cluster_manage_pipelines permission to manage ingest pipelines.
Define a pipeline
A pipeline definition describes the sequence of an ingest pipeline and can be written in JSON format. An ingest pipeline consists of the following:

{
    "description" : "..."
    "processors" : [...]
}
Request body fields
Field	Required	Type	Description
processors	Required	Array of processor objects	A component that performs a specific data processing task as the data is being ingested into OpenSearch.
description	Optional	String	A description of the ingest pipeline.
""",
          "passage_chunk": [
            {
              "text": """Ingest pipelines\nAn ingest pipeline is a sequence of processors that are applied to documents as they are ingested into an index. Each processor in a pipeline performs a specific task, such as filtering, transforming, or enriching data.\n\nProcessors are customizable tasks that run in a sequential order as they appear in the request body. This order is important, as each processor depends on the output of the previous processor. The modified documents appear in your index after the processors are applied.\n\nOpenSearch ingest pipelines compared to OpenSearch Data Prepper\nOpenSeach ingest pipelines run within the OpenSearch cluster, whereas OpenSearch Data Prepper is an external component that runs on the OpenSearch cluster.\n\nOpenSearch ingest pipelines perform actions on indexes and are preferred for use cases involving pre-processing simple datasets, machine learning (ML) processors, and vector embedding processors. OpenSearch ingest pipelines are recommended for simple data pre-processing and small datasets.\n\nOpenSearch Data Prepper is recommended for any data processing tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the OpenSearch """,
              "embedding": [
                0.04044651612639427,
                ...
              ]
            },
            {
              "text": """tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the OpenSearch Data Prepper documentation for more information.\n\nOpenSearch ingest pipelines can only be managed using Ingest API operations.\n\nPrerequisites\nThe following are prerequisites for using OpenSearch ingest pipelines:\n\nWhen using ingestion in a production environment, your cluster should contain at least one node with the node roles permission set to ingest. For information about setting up node roles within a cluster, see Cluster Formation.\nIf the OpenSearch Security plugin is enabled, you must have the cluster_manage_pipelines permission to manage ingest pipelines.\nDefine a pipeline\nA pipeline definition describes the sequence of an ingest pipeline and can be written in JSON format. An ingest pipeline consists of the following:\n\n{\n    \"description\" : \"...\"\n    \"processors\" : [...]\n}\nRequest body fields\nField\tRequired\tType\tDescription\nprocessors\tRequired\tArray of processor objects\tA component that performs a specific data processing task as the data is being ingested into OpenSearch.\ndescription\tOptional\tString\tA description of the ingest pipeline.\n""",
              "embedding": [
                0.02055041491985321,
                ...
              ]
            }
          ]
        },
        "_ingest": {
          "_value": null,
          "timestamp": "2025-02-08T07:49:43.484543119Z"
        }
      }
    }
  ]
}

第 3 步:创建索引并摄入数据

要创建向量索引,请发送以下请求。

PUT opensearch_docs
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "bedrock-text-embedding-pipeline"
  },
  "mappings": {
    "properties": {
      "passage_chunk": {
        "type": "nested",
        "properties": {
          "text": {
            "type": "text"
          },
          "embedding": {
            "type": "knn_vector",
            "dimension": 1024
          }
        }
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

将测试数据摄入到索引中。

POST _bulk
{"index": {"_index": "opensearch_docs"}}
{"passage_text": "Ingest pipelines\nAn ingest pipeline is a sequence of processors that are applied to documents as they are ingested into an index. Each processor in a pipeline performs a specific task, such as filtering, transforming, or enriching data.\n\nProcessors are customizable tasks that run in a sequential order as they appear in the request body. This order is important, as each processor depends on the output of the previous processor. The modified documents appear in your index after the processors are applied.\n\nOpenSearch ingest pipelines compared to OpenSearch Data Prepper\nOpenSeach ingest pipelines run within the OpenSearch cluster, whereas OpenSearch Data Prepper is an external component that runs on the OpenSearch cluster.\n\nOpenSearch ingest pipelines perform actions on indexes and are preferred for use cases involving pre-processing simple datasets, machine learning (ML) processors, and vector embedding processors. OpenSearch ingest pipelines are recommended for simple data pre-processing and small datasets.\n\nOpenSearch Data Prepper is recommended for any data processing tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the OpenSearch Data Prepper documentation for more information.\n\nOpenSearch ingest pipelines can only be managed using Ingest API operations.\n\nPrerequisites\nThe following are prerequisites for using OpenSearch ingest pipelines:\n\nWhen using ingestion in a production environment, your cluster should contain at least one node with the node roles permission set to ingest. For information about setting up node roles within a cluster, see Cluster Formation.\nIf the OpenSearch Security plugin is enabled, you must have the cluster_manage_pipelines permission to manage ingest pipelines.\nDefine a pipeline\nA pipeline definition describes the sequence of an ingest pipeline and can be written in JSON format. An ingest pipeline consists of the following:\n\n{\n    \"description\" : \"...\"\n    \"processors\" : [...]\n}\nRequest body fields\nField\tRequired\tType\tDescription\nprocessors\tRequired\tArray of processor objects\tA component that performs a specific data processing task as the data is being ingested into OpenSearch.\ndescription\tOptional\tString\tA description of the ingest pipeline.\n"}
{"index": {"_index": "opensearch_docs"}}
{"passage_text": "Monitors\nProactively monitor your data in OpenSearch with features available in Alerting and Anomaly Detection. For example, you can pair Anomaly Detection with Alerting to ensure that you’re notified as soon as an anomaly is detected. You can do this by setting up a detector to automatically detect outliers in your streaming data and monitors to alert you through notifications when data exceeds certain thresholds.\n\nMonitor types\nThe Alerting plugin provides the following monitor types:\n\nper query: Runs a query and generates alert notifications based on the matching criteria. See Per query monitors for information about creating and using this monitor type.\nper bucket: Runs a query that evaluates trigger criteria based on aggregated values in the dataset. See Per bucket monitors for information about creating and using this monitor type.\nper cluster metrics: Runs API requests on the cluster to monitor its health. See Per cluster metrics monitors for information about creating and using this monitor type.\nper document: Runs a query (or multiple queries combined by a tag) that returns individual documents that match the alert notification trigger condition. See Per document monitors for information about creating and using this monitor type.\ncomposite monitor: Runs multiple monitors in a single workflow and generates a single alert based on multiple trigger conditions. See Composite monitors for information about creating and using this monitor type.\nThe maximum number of monitors you can create is 1,000. You can change the default maximum number of alerts for your cluster by updating the plugins.alerting.monitor.max_monitors setting using the cluster settings API."}
{"index": {"_index": "opensearch_docs"}}
{"passage_text": "Search pipelines\nYou can use search pipelines to build new or reuse existing result rerankers, query rewriters, and other components that operate on queries or results. Search pipelines make it easier for you to process search queries and search results within OpenSearch. Moving some of your application functionality into an OpenSearch search pipeline reduces the overall complexity of your application. As part of a search pipeline, you specify a list of processors that perform modular tasks. You can then easily add or reorder these processors to customize search results for your application.\n\nTerminology\nThe following is a list of search pipeline terminology:\n\nSearch request processor: A component that intercepts a search request (the query and the metadata passed in the request), performs an operation with or on the search request, and returns the search request.\nSearch response processor: A component that intercepts a search response and search request (the query, results, and metadata passed in the request), performs an operation with or on the search response, and returns the search response.\nSearch phase results processor: A component that runs between search phases at the coordinating node level. A search phase results processor intercepts the results retrieved from one search phase and transforms them before passing them to the next search phase.\nProcessor: Either a search request processor or a search response processor.\nSearch pipeline: An ordered list of processors that is integrated into OpenSearch. The pipeline intercepts a query, performs processing on the query, sends it to OpenSearch, intercepts the results, performs processing on the results, and returns them to the calling application, as shown in the following diagram.\n"}

要验证文档是否已正确处理,请搜索索引以查看生成的块和嵌入。

GET opensearch_docs/_search

第 4 步:使用 ML 推理处理器进行搜索

创建一个带有 ML 推理处理器的搜索管道,该处理器将输入文本转换为嵌入。

PUT _search/pipeline/bedrock_semantic_search_pipeline
{
  "request_processors": [
    {
      "ml_inference": {
        "model_id": "xhR35JQBLopfJ2xsO9pr",
        "input_map": [
          {
            "inputText": "ext.ml_inference.params.text"
          }
        ],
        "output_map": [
          {
            "ext.ml_inference.params.vector": "embedding"
          }
        ]
      }
    }
  ]
}

使用以下模板查询执行语义搜索。

GET opensearch_docs/_search?search_pipeline=bedrock_semantic_search_pipeline
{
  "query": {
    "template": {
      "nested": {
        "path": "passage_chunk",
        "query": {
          "knn": {
            "passage_chunk.embedding": {
              "vector": "${ext.ml_inference.params.vector}",
              "k": 5
            }
          }
        }
      }
    }
  },
  "ext": {
    "ml_inference": {
      "params": {
        "text": "What's OpenSearch ingest pipeline"
      }
    }
  },
  "_source": {
    "excludes": [
      "passage_chunk"
    ]
  },
  "size": 1
}

管道将 inputText 映射到 ext.ml_inference.params.text。在输入处理过程中,管道从搜索请求中的路径 ext.ml_inference.params.text 中检索值。在此示例中,此路径中的值为 "What's OpenSearch ingest pipeline",此值将传递给 inputText 参数中的模型。

在搜索过程中,搜索查询引用 "vector": "${ext.ml_inference.params.vector}"。此向量值未在初始搜索请求中提供;相反,ML 推理处理器通过调用 Amazon Bedrock Titan Embeddings 模型生成它。该模型从您的搜索文本中创建一个嵌入向量,并将该向量存储在 ext.ml_inference.params.vector 中。然后,OpenSearch 使用此生成的向量查找相似文档。

{
  "took": 398,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.78014797,
    "hits": [
      {
        "_index": "opensearch_docs",
        "_id": "rj2T5JQBg4dihuRifxJT",
        "_score": 0.78014797,
        "_source": {
          "passage_text": """Ingest pipelines
An ingest pipeline is a sequence of processors that are applied to documents as they are ingested into an index. Each processor in a pipeline performs a specific task, such as filtering, transforming, or enriching data.

Processors are customizable tasks that run in a sequential order as they appear in the request body. This order is important, as each processor depends on the output of the previous processor. The modified documents appear in your index after the processors are applied.

OpenSearch ingest pipelines compared to OpenSearch Data Prepper
OpenSeach ingest pipelines run within the OpenSearch cluster, whereas OpenSearch Data Prepper is an external component that runs on the OpenSearch cluster.

OpenSearch ingest pipelines perform actions on indexes and are preferred for use cases involving pre-processing simple datasets, machine learning (ML) processors, and vector embedding processors. OpenSearch ingest pipelines are recommended for simple data pre-processing and small datasets.

OpenSearch Data Prepper is recommended for any data processing tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the OpenSearch Data Prepper documentation for more information.

OpenSearch ingest pipelines can only be managed using Ingest API operations.

Prerequisites
The following are prerequisites for using OpenSearch ingest pipelines:

When using ingestion in a production environment, your cluster should contain at least one node with the node roles permission set to ingest. For information about setting up node roles within a cluster, see Cluster Formation.
If the OpenSearch Security plugin is enabled, you must have the cluster_manage_pipelines permission to manage ingest pipelines.
Define a pipeline
A pipeline definition describes the sequence of an ingest pipeline and can be written in JSON format. An ingest pipeline consists of the following:

{
    "description" : "..."
    "processors" : [...]
}
Request body fields
Field	Required	Type	Description
processors	Required	Array of processor objects	A component that performs a specific data processing task as the data is being ingested into OpenSearch.
description	Optional	String	A description of the ingest pipeline.
"""
        }
      }
    ]
  }
}