使用非对称嵌入模型的语义搜索
本教程演示了如何使用非对称嵌入模型生成文本嵌入以执行语义搜索。本教程使用 Hugging Face 的多语言 intfloat/multilingual-e5-small
模型。有关更多信息,请参阅语义搜索。
将以 your_
为前缀的占位符替换为您自己的值。
步骤 1:更新集群设置
要将集群配置为允许您使用外部 URL 注册模型并在非机器学习 (ML) 节点上运行模型,请发送以下请求:
PUT _cluster/settings
{
"persistent": {
"plugins.ml_commons.allow_registering_model_via_url": "true",
"plugins.ml_commons.only_run_on_ml_node": "false",
"plugins.ml_commons.model_access_control_enabled": "true",
"plugins.ml_commons.native_memory_threshold": "99"
}
}
步骤 2:准备模型以在 OpenSearch 中使用
在本教程中,您将使用 Hugging Face intfloat/multilingual-e5-small
模型。请按照以下步骤准备模型并将其压缩为 zip 文件,以便在 OpenSearch 中使用。
步骤 2.1:从 Hugging Face 下载模型
要下载模型,请使用以下步骤:
-
如果您尚未安装 Git Large File Storage (LFS),请安装它。
git lfs install
-
克隆模型存储库。
git clone https://hugging-face.cn/intfloat/multilingual-e5-small
模型文件现已下载到您的本地机器上的一个目录中。
步骤 2.2:压缩模型文件
要将模型上传到 OpenSearch,您必须压缩必要的模型文件 (model.onnx
、sentencepiece.bpe.model
和 tokenizer.json
)。您可以在克隆存储库的 onnx
目录中找到这些文件。
要压缩文件,请在包含这些文件的目录中运行以下命令:
zip -r intfloat-multilingual-e5-small-onnx.zip model.onnx tokenizer.json sentencepiece.bpe.model
这些文件现已存档在一个名为 intfloat-multilingual-e5-small-onnx.zip
的 zip 文件中。
步骤 2.3:计算模型文件的哈希值
在注册模型之前,您必须计算 zip 文件的 SHA-256 哈希值。运行此命令以生成哈希值:
shasum -a 256 intfloat-multilingual-e5-small-onnx.zip
请记下哈希值;在模型注册期间您将需要它。
步骤 2.4:使用 Python HTTP 服务器提供模型文件
为了让 OpenSearch 能够访问模型文件,您可以通过 HTTP 提供它。由于本教程使用本地开发环境,您可以使用 Python 内置的 HTTP 服务器命令。
导航到包含 zip 文件的目录并运行以下命令:
python3 -m http.server 8080 --bind 0.0.0.0
这将通过 http://0.0.0.0:8080/intfloat-multilingual-e5-small-onnx.zip
提供 zip 文件。注册模型后,您可以按 Ctrl+C
停止服务器。
步骤 3:注册模型组
在注册模型本身之前,您需要创建一个模型组。这有助于在 OpenSearch 中组织模型。运行以下请求以创建新的模型组:
POST /_plugins/_ml/model_groups/_register
{
"name": "Asymmetric Model Group",
"description": "A model group for local asymmetric models"
}
记下响应中返回的模型组 ID;您将使用它来注册模型。
步骤 4:注册模型
现在您已拥有模型 zip 文件和模型组 ID,您可以在 OpenSearch 中注册模型:
POST /_plugins/_ml/models/_register
{
"name": "e5-small-onnx",
"version": "1.0.0",
"description": "Asymmetric multilingual-e5-small model",
"model_format": "ONNX",
"model_group_id": "your_group_id",
"model_content_hash_value": "your_model_zip_content_hash_value",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers",
"query_prefix": "query: ",
"passage_prefix": "passage: ",
"all_config": "{ \"_name_or_path\": \"intfloat/multilingual-e5-small\", \"architectures\": [ \"BertModel\" ], \"attention_probs_dropout_prob\": 0.1, \"hidden_size\": 384, \"num_attention_heads\": 12, \"num_hidden_layers\": 12, \"tokenizer_class\": \"XLMRobertaTokenizer\" }"
},
"url": "https://:8080/intfloat-multilingual-e5-small-onnx.zip"
}
将 your_group_id
和 your_model_zip_content_hash_value
替换为前面步骤中的值。这将启动模型注册过程,您将在响应中收到一个任务 ID。
要检查注册状态,请运行以下请求:
GET /_plugins/_ml/tasks/your_task_id
任务完成后,记下模型 ID;部署和推理时您将需要它。
步骤 5:部署模型
模型注册后,通过运行以下请求来部署它:
POST /_plugins/_ml/models/your_model_id/_deploy
使用任务 ID 检查部署状态:
GET /_plugins/_ml/tasks/your_task_id
当模型成功部署后,其状态变为 DEPLOYED(已部署),即可使用。
步骤 6:生成嵌入
现在您的模型已部署,您可以使用它为查询和段落生成文本嵌入。
生成段落嵌入
要为段落生成嵌入,请使用以下请求:
POST /_plugins/_ml/_predict/text_embedding/your_model_id
{
"parameters": {
"content_type": "passage"
},
"text_docs": [
"Today is Friday, tomorrow will be my break day. After that, I will go to the library. When is lunch?"
],
"target_response": ["sentence_embedding"]
}
响应包含生成的嵌入
{
"inference_results": [
{
"output": [
{
"name": "sentence_embedding",
"data_type": "FLOAT32",
"shape": [384],
"data": [0.0419328, 0.047480892, ..., 0.31158513, 0.21784715]
}
]
}
]
}
生成查询嵌入
同样,您可以为查询生成嵌入:
POST /_plugins/_ml/_predict/text_embedding/your_model_id
{
"parameters": {
"content_type": "query"
},
"text_docs": ["What day is it today?"],
"target_response": ["sentence_embedding"]
}
响应包含生成的嵌入
{
"inference_results": [
{
"output": [
{
"name": "sentence_embedding",
"data_type": "FLOAT32",
"shape": [384],
"data": [0.2338349, -0.13603798, ..., 0.37335885, 0.10653384]
}
]
}
]
}
步骤 7:运行语义搜索
现在您将使用生成的嵌入运行语义搜索。首先,您将使用 ML 推理处理器创建一个摄取管道,以便在摄取过程中创建文档嵌入。然后,您将创建一个搜索管道,使用相同的非对称嵌入模型生成查询嵌入。
步骤 7.1:创建向量索引
要创建向量索引,请发送以下请求:
PUT nyc_facts
{
"settings": {
"index": {
"default_pipeline": "asymmetric_embedding_ingest_pipeline",
"knn": true,
"knn.algo_param.ef_search": 100
}
},
"mappings": {
"properties": {
"fact_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "nmslib",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
步骤 7.2:创建摄取管道
要创建用于生成文档嵌入的摄取管道,请发送以下请求:
PUT _ingest/pipeline/asymmetric_embedding_ingest_pipeline
{
"description": "ingest passage text and generate a embedding using an asymmetric model",
"processors": [
{
"ml_inference": {
"model_input": "{\"text_docs\":[\"${input_map.text_docs}\"],\"target_response\":[\"sentence_embedding\"],\"parameters\":{\"content_type\":\"query\"}}",
"function_name": "text_embedding",
"model_id": "",
"input_map": [
{
"text_docs": "description"
}
],
"output_map": [
{
"fact_embedding": "$.inference_results[0].output[0].data",
"embedding_size": "$.inference_results.*.output.*.shape[0]"
}
]
}
}
]
}
2.3 测试管道
通过运行以下请求来测试管道:
POST /_ingest/pipeline/asymmetric_embedding_ingest_pipeline/_simulate
{
"docs": [
{
"_index": "my-index",
"_id": "1",
"_source": {
"title": "Central Park",
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities."
}
}
]
}
响应包含模型生成的嵌入
{
"docs": [
{
"doc": {
"_index": "my-index",
"_id": "1",
"_source": {
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
"fact_embedding": [
[
0.06344555,
0.30067796,
...
0.014804064,
-0.022822019
]
],
"title": "Central Park",
"embedding_size": [
384.0
]
},
"_ingest": {
"timestamp": "2024-12-16T20:59:07.152169Z"
}
}
}
]
}
步骤 7.4:摄取数据
当您执行批量摄取时,摄取管道将为每个文档生成嵌入:
POST /_bulk
{ "index": { "_index": "nyc_facts" } }
{ "title": "Central Park", "description": "A large public park in the heart of New York City, offering a wide range of recreational activities." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Empire State Building", "description": "An iconic skyscraper in New York City offering breathtaking views from its observation deck." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Statue of Liberty", "description": "A colossal neoclassical sculpture on Liberty Island, symbolizing freedom and democracy in the United States." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Brooklyn Bridge", "description": "A historic suspension bridge connecting Manhattan and Brooklyn, offering pedestrian walkways with great views." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Times Square", "description": "A bustling commercial and entertainment hub in Manhattan, known for its neon lights and Broadway theaters." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Yankee Stadium", "description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "The Bronx Zoo", "description": "One of the largest zoos in the world, located in the Bronx, featuring diverse animal exhibits and conservation efforts." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "New York Botanical Garden", "description": "A large botanical garden in the Bronx, known for its diverse plant collections and stunning landscapes." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Flushing Meadows-Corona Park", "description": "A major park in Queens, home to the USTA Billie Jean King National Tennis Center and the Unisphere." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Citi Field", "description": "The home stadium of the New York Mets, located in Queens, known for its modern design and fan-friendly atmosphere." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Rockefeller Center", "description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Queens Botanical Garden", "description": "A peaceful, beautiful botanical garden located in Flushing, Queens, featuring seasonal displays and plant collections." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Arthur Ashe Stadium", "description": "The largest tennis stadium in the world, located in Flushing Meadows-Corona Park, Queens, hosting the U.S. Open." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Wave Hill", "description": "A public garden and cultural center in the Bronx, offering stunning views of the Hudson River and a variety of nature programs." }
{ "index": { "_index": "nyc_facts" } }
{ "title": "Louis Armstrong House", "description": "The former home of jazz legend Louis Armstrong, located in Corona, Queens, now a museum celebrating his life and music." }
步骤 7.5:创建搜索管道
创建一个搜索管道,将您的查询转换为嵌入,并在索引上运行向量搜索以返回最匹配的文档:
PUT /_search/pipeline/asymmetric_embedding_search_pipeline
{
"description": "ingest passage text and generate a embedding using an asymmetric model",
"request_processors": [
{
"ml_inference": {
"query_template": "{\"size\": 3,\"query\": {\"knn\": {\"fact_embedding\": {\"vector\": ${query_embedding},\"k\": 4}}}}",
"function_name": "text_embedding",
"model_id": "",
"model_input": "{ \"text_docs\": [\"${input_map.query}\"], \"target_response\": [\"sentence_embedding\"], \"parameters\" : {\"content_type\" : \"query\" } }",
"input_map": [
{
"query": "query.term.fact_embedding.value"
}
],
"output_map": [
{
"query_embedding": "$.inference_results[0].output[0].data",
"embedding_size": "$.inference_results.*.output.*.shape[0]"
}
]
}
}
]
}
步骤 7.6:运行查询
使用上一步中创建的搜索管道运行查询:
GET /nyc_facts/_search?search_pipeline=asymmetric_embedding_search_pipeline
{
"query": {
"term": {
"fact_embedding": {
"value": "What are some places for sports in NYC?",
"boost": 1
}
}
}
}
响应包含前三个匹配的文档:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 0.12496973,
"hits": [
{
"_index": "nyc_facts",
"_id": "hb9X0ZMBICPs-TP0ijZX",
"_score": 0.12496973,
"_source": {
"fact_embedding": [
...
],
"embedding_size": [
384.0
],
"description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
"title": "Central Park"
}
},
{
"_index": "nyc_facts",
"_id": "ir9X0ZMBICPs-TP0ijZX",
"_score": 0.114651985,
"_source": {
"fact_embedding": [
...
],
"embedding_size": [
384.0
],
"description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx.",
"title": "Yankee Stadium"
}
},
{
"_index": "nyc_facts",
"_id": "j79X0ZMBICPs-TP0ijZX",
"_score": 0.110090025,
"_source": {
"fact_embedding": [
...
],
"embedding_size": [
384.0
],
"description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink.",
"title": "Rockefeller Center"
}
}
]
}
}
参考文献
- Wang, Liang, et al. (2024). Multilingual E5 Text Embeddings: A Technical Report. arXiv 预印本 arXiv:2402.05672. 链接