使用字节量化向量的语义搜索
本教程展示如何使用 Cohere Embed 模型和字节量化向量构建语义搜索。有关使用字节量化向量的更多信息,请参阅字节向量和语义搜索。
Cohere Embed v3 模型支持多种 embedding_types
。在本教程中,您将使用 INT8
类型对字节量化向量进行编码。
Cohere Embed v3 模型支持多种输入类型。本教程使用以下输入类型:
search_document
: 当您有要存储在向量数据库中的文本(以文档形式)时,请使用此输入类型。search_query
: 在构建搜索查询以在向量数据库中查找最相关文档时,请使用此输入类型。
有关输入类型的更多信息,请参阅 Cohere 文档。
在本教程中,您将创建两个模型:
- 一个用于摄入的模型,其
input_type
为search_document
- 一个用于搜索的模型,其
input_type
为search_query
将以 your_
为前缀的占位符替换为您自己的值。
步骤 1:创建用于摄入的嵌入模型
为 Cohere 模型创建连接器,并指定 search_document
输入类型
POST /_plugins/_ml/connectors/_create
{
"name": "Cohere embedding connector with int8 embedding type for ingestion",
"description": "Test connector for Cohere embedding model",
"version": 1,
"protocol": "http",
"credential": {
"cohere_key": "your_cohere_api_key"
},
"parameters": {
"model": "embed-english-v3.0",
"embedding_types": ["int8"],
"input_type": "search_document"
},
"actions": [
{
"action_type": "predict",
"method": "POST",
"headers": {
"Authorization": "Bearer ${credential.cohere_key}",
"Request-Source": "unspecified:opensearch"
},
"url": "https://api.cohere.ai/v1/embed",
"request_body": "{ \"model\": \"${parameters.model}\", \"texts\": ${parameters.texts}, \"input_type\":\"${parameters.input_type}\", \"embedding_types\": ${parameters.embedding_types} }",
"pre_process_function": "connector.pre_process.cohere.embedding",
"post_process_function": "\n def name = \"sentence_embedding\";\n def data_type = \"FLOAT32\";\n def result;\n if (params.embeddings.int8 != null) {\n data_type = \"INT8\";\n result = params.embeddings.int8;\n } else if (params.embeddings.uint8 != null) {\n data_type = \"UINT8\";\n result = params.embeddings.uint8;\n } else if (params.embeddings.float != null) {\n data_type = \"FLOAT32\";\n result = params.embeddings.float;\n }\n \n if (result == null) {\n return \"Invalid embedding result\";\n }\n \n def embedding_list = new StringBuilder(\"[\");\n \n for (int m=0; m<result.length; m++) {\n def embedding_size = result[m].length;\n def embedding = new StringBuilder(\"[\");\n def shape = [embedding_size];\n for (int i=0; i<embedding_size; i++) {\n def val;\n if (\"FLOAT32\".equals(data_type)) {\n val = result[m][i].floatValue();\n } else if (\"INT8\".equals(data_type) || \"UINT8\".equals(data_type)) {\n val = result[m][i].intValue();\n }\n embedding.append(val);\n if (i < embedding_size - 1) {\n embedding.append(\",\"); \n }\n }\n embedding.append(\"]\"); \n \n // workaround for compatible with neural-search\n def dummy_data_type = 'FLOAT32';\n \n def json = '{' +\n '\"name\":\"' + name + '\",' +\n '\"data_type\":\"' + dummy_data_type + '\",' +\n '\"shape\":' + shape + ',' +\n '\"data\":' + embedding +\n '}';\n embedding_list.append(json);\n if (m < result.length - 1) {\n embedding_list.append(\",\"); \n }\n }\n embedding_list.append(\"]\"); \n return embedding_list.toString();\n "
}
]
}
为了确保与 OpenSearch 兼容,后处理函数中必须将 data_type
(响应中 inference_results.output.data_type
字段的输出)设置为 FLOAT32
,即使实际的嵌入类型将是 INT8
。
请记下响应中的连接器 ID;您将使用它来注册模型。
注册模型,提供其连接器 ID
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "Cohere embedding model for INT8 with search_document input type",
"function_name": "remote",
"description": "test model",
"connector_id": "your_connector_id"
}
记下响应中的模型 ID;您将在后续步骤中使用它。
测试模型,提供模型 ID
POST /_plugins/_ml/models/your_embedding_model_id/_predict
{
"parameters": {
"texts": ["hello", "goodbye"]
}
}
响应包含推理结果
{
"inference_results": [
{
"output": [
{
"name": "sentence_embedding",
"data_type": "FLOAT32",
"shape": [
1024
],
"data": [
20,
-11,
-60,
-91,
...
]
},
{
"name": "sentence_embedding",
"data_type": "FLOAT32",
"shape": [
1024
],
"data": [
58,
-30,
9,
-51,
...
]
}
],
"status_code": 200
}
]
}
步骤 2:摄入数据
首先,创建一个摄入管道
PUT /_ingest/pipeline/pipeline-cohere
{
"description": "Cohere embedding ingest pipeline",
"processors": [
{
"text_embedding": {
"model_id": "your_embedding_model_id_created_in_step1",
"field_map": {
"passage_text": "passage_embedding"
}
}
}
]
}
接下来,创建一个向量索引,并将 passage_embedding
字段的 data_type
设置为 byte
,以便它可以存储字节量化向量
PUT my_test_data
{
"settings": {
"index": {
"knn": true,
"knn.algo_param.ef_search": 100,
"default_pipeline": "pipeline-cohere"
}
},
"mappings": {
"properties": {
"passage_text": {
"type": "text"
},
"passage_embedding": {
"type": "knn_vector",
"dimension": 1024,
"data_type": "byte",
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "lucene",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
最后,摄入测试数据
POST _bulk
{ "index" : { "_index" : "my_test_data" } }
{ "passage_text" : "OpenSearch is the flexible, scalable, open-source way to build solutions for data-intensive applications. Explore, enrich, and visualize your data with built-in performance, developer-friendly tools, and powerful integrations for machine learning, data processing, and more." }
{ "index" : { "_index" : "my_test_data"} }
{ "passage_text" : "BM25 is a keyword-based algorithm that performs well on queries containing keywords but fails to capture the semantic meaning of the query terms. Semantic search, unlike keyword-based search, takes into account the meaning of the query in the search context. Thus, semantic search performs well when a query requires natural language understanding." }
步骤 3:配置语义搜索
使用 search_query
输入类型创建到嵌入模型的连接器
POST /_plugins/_ml/connectors/_create
{
"name": "Cohere embedding connector with int8 embedding type for search",
"description": "Test connector for Cohere embedding model. Use this connector for search.",
"version": 1,
"protocol": "http",
"credential": {
"cohere_key": "your_cohere_api_key"
},
"parameters": {
"model": "embed-english-v3.0",
"embedding_types": ["int8"],
"input_type": "search_query"
},
"actions": [
{
"action_type": "predict",
"method": "POST",
"headers": {
"Authorization": "Bearer ${credential.cohere_key}",
"Request-Source": "unspecified:opensearch"
},
"url": "https://api.cohere.ai/v1/embed",
"request_body": "{ \"model\": \"${parameters.model}\", \"texts\": ${parameters.texts}, \"input_type\":\"${parameters.input_type}\", \"embedding_types\": ${parameters.embedding_types} }",
"pre_process_function": "connector.pre_process.cohere.embedding",
"post_process_function": "\n def name = \"sentence_embedding\";\n def data_type = \"FLOAT32\";\n def result;\n if (params.embeddings.int8 != null) {\n data_type = \"INT8\";\n result = params.embeddings.int8;\n } else if (params.embeddings.uint8 != null) {\n data_type = \"UINT8\";\n result = params.embeddings.uint8;\n } else if (params.embeddings.float != null) {\n data_type = \"FLOAT32\";\n result = params.embeddings.float;\n }\n \n if (result == null) {\n return \"Invalid embedding result\";\n }\n \n def embedding_list = new StringBuilder(\"[\");\n \n for (int m=0; m<result.length; m++) {\n def embedding_size = result[m].length;\n def embedding = new StringBuilder(\"[\");\n def shape = [embedding_size];\n for (int i=0; i<embedding_size; i++) {\n def val;\n if (\"FLOAT32\".equals(data_type)) {\n val = result[m][i].floatValue();\n } else if (\"INT8\".equals(data_type) || \"UINT8\".equals(data_type)) {\n val = result[m][i].intValue();\n }\n embedding.append(val);\n if (i < embedding_size - 1) {\n embedding.append(\",\"); \n }\n }\n embedding.append(\"]\"); \n \n // workaround for compatible with neural-search\n def dummy_data_type = 'FLOAT32';\n \n def json = '{' +\n '\"name\":\"' + name + '\",' +\n '\"data_type\":\"' + dummy_data_type + '\",' +\n '\"shape\":' + shape + ',' +\n '\"data\":' + embedding +\n '}';\n embedding_list.append(json);\n if (m < result.length - 1) {\n embedding_list.append(\",\"); \n }\n }\n embedding_list.append(\"]\"); \n return embedding_list.toString();\n "
}
]
}
请记下响应中的连接器 ID;您将使用它来注册模型。
注册模型,提供其连接器 ID
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "Cohere embedding model for INT8 with search_document input type",
"function_name": "remote",
"description": "test model",
"connector_id": "your_connector_id"
}
请记下响应中的模型 ID;您将使用它来运行查询。
运行向量搜索,提供模型 ID
POST /my_test_data/_search
{
"query": {
"neural": {
"passage_embedding": {
"query_text": "semantic search",
"model_id": "your_embedding_model_id",
"k": 100
}
}
},
"size": "1",
"_source": ["passage_text"]
}
响应包含查询结果
{
"took": 143,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 9.345969e-7,
"hits": [
{
"_index": "my_test_data",
"_id": "_IXCuY0BJr_OiKWden7i",
"_score": 9.345969e-7,
"_source": {
"passage_text": "BM25 is a keyword-based algorithm that performs well on queries containing keywords but fails to capture the semantic meaning of the query terms. Semantic search, unlike keyword-based search, takes into account the meaning of the query in the search context. Thus, semantic search performs well when a query requires natural language understanding."
}
}
]
}
}