Link Search Menu Expand Document Documentation Menu

使用字节量化向量的语义搜索

本教程展示如何使用 Cohere Embed 模型和字节量化向量构建语义搜索。有关使用字节量化向量的更多信息,请参阅字节向量语义搜索

Cohere Embed v3 模型支持多种 embedding_types。在本教程中,您将使用 INT8 类型对字节量化向量进行编码。

Cohere Embed v3 模型支持多种输入类型。本教程使用以下输入类型:

  • search_document: 当您有要存储在向量数据库中的文本(以文档形式)时,请使用此输入类型。
  • search_query: 在构建搜索查询以在向量数据库中查找最相关文档时,请使用此输入类型。

有关输入类型的更多信息,请参阅 Cohere 文档

在本教程中,您将创建两个模型:

  • 一个用于摄入的模型,其 input_typesearch_document
  • 一个用于搜索的模型,其 input_typesearch_query

将以 your_ 为前缀的占位符替换为您自己的值。

步骤 1:创建用于摄入的嵌入模型

为 Cohere 模型创建连接器,并指定 search_document 输入类型

POST /_plugins/_ml/connectors/_create
{
    "name": "Cohere embedding connector with int8 embedding type for ingestion",
    "description": "Test connector for Cohere embedding model",
    "version": 1,
    "protocol": "http",
    "credential": {
        "cohere_key": "your_cohere_api_key"
    },
    "parameters": {
        "model": "embed-english-v3.0",
        "embedding_types": ["int8"],
        "input_type": "search_document"
    },
    "actions": [
        {
            "action_type": "predict",
            "method": "POST",
            "headers": {
                "Authorization": "Bearer ${credential.cohere_key}",
                "Request-Source": "unspecified:opensearch"
            },
            "url": "https://api.cohere.ai/v1/embed",
            "request_body": "{ \"model\": \"${parameters.model}\", \"texts\": ${parameters.texts}, \"input_type\":\"${parameters.input_type}\", \"embedding_types\": ${parameters.embedding_types} }",
            "pre_process_function": "connector.pre_process.cohere.embedding",
            "post_process_function": "\n    def name = \"sentence_embedding\";\n    def data_type = \"FLOAT32\";\n    def result;\n    if (params.embeddings.int8 != null) {\n      data_type = \"INT8\";\n      result = params.embeddings.int8;\n    } else if (params.embeddings.uint8 != null) {\n      data_type = \"UINT8\";\n      result = params.embeddings.uint8;\n    } else if (params.embeddings.float != null) {\n      data_type = \"FLOAT32\";\n      result = params.embeddings.float;\n    }\n    \n    if (result == null) {\n      return \"Invalid embedding result\";\n    }\n    \n    def embedding_list = new StringBuilder(\"[\");\n    \n    for (int m=0; m<result.length; m++) {\n      def embedding_size = result[m].length;\n      def embedding = new StringBuilder(\"[\");\n      def shape = [embedding_size];\n      for (int i=0; i<embedding_size; i++) {\n        def val;\n        if (\"FLOAT32\".equals(data_type)) {\n          val = result[m][i].floatValue();\n        } else if (\"INT8\".equals(data_type) || \"UINT8\".equals(data_type)) {\n          val = result[m][i].intValue();\n        }\n        embedding.append(val);\n        if (i < embedding_size - 1) {\n          embedding.append(\",\");  \n        }\n      }\n      embedding.append(\"]\");  \n      \n      // workaround for compatible with neural-search\n      def dummy_data_type = 'FLOAT32';\n      \n      def json = '{' +\n                   '\"name\":\"' + name + '\",' +\n                   '\"data_type\":\"' + dummy_data_type + '\",' +\n                   '\"shape\":' + shape + ',' +\n                   '\"data\":' + embedding +\n                   '}';\n      embedding_list.append(json);\n      if (m < result.length - 1) {\n        embedding_list.append(\",\");  \n      }\n    }\n    embedding_list.append(\"]\");  \n    return embedding_list.toString();\n    "
        }
    ]
}

为了确保与 OpenSearch 兼容,后处理函数中必须将 data_type(响应中 inference_results.output.data_type 字段的输出)设置为 FLOAT32,即使实际的嵌入类型将是 INT8

请记下响应中的连接器 ID;您将使用它来注册模型。

注册模型,提供其连接器 ID

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "Cohere embedding model for INT8 with search_document input type",
    "function_name": "remote",
    "description": "test model",
    "connector_id": "your_connector_id"
}

记下响应中的模型 ID;您将在后续步骤中使用它。

测试模型,提供模型 ID

POST /_plugins/_ml/models/your_embedding_model_id/_predict
{
    "parameters": {
        "texts": ["hello", "goodbye"]
    }
}

响应包含推理结果

{
    "inference_results": [
        {
            "output": [
                {
                    "name": "sentence_embedding",
                    "data_type": "FLOAT32",
                    "shape": [
                        1024
                    ],
                    "data": [
                        20,
                        -11,
                        -60,
                        -91,
                        ...
                    ]
                },
                {
                    "name": "sentence_embedding",
                    "data_type": "FLOAT32",
                    "shape": [
                        1024
                    ],
                    "data": [
                        58,
                        -30,
                        9,
                        -51,
                        ...
                    ]
                }
            ],
            "status_code": 200
        }
    ]
}

步骤 2:摄入数据

首先,创建一个摄入管道

PUT /_ingest/pipeline/pipeline-cohere
{
  "description": "Cohere embedding ingest pipeline",
  "processors": [
    {
      "text_embedding": {
        "model_id": "your_embedding_model_id_created_in_step1",
        "field_map": {
          "passage_text": "passage_embedding"
        }
      }
    }
  ]
}

接下来,创建一个向量索引,并将 passage_embedding 字段的 data_type 设置为 byte,以便它可以存储字节量化向量

PUT my_test_data
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100,
      "default_pipeline": "pipeline-cohere"
    }
  },
  "mappings": {
    "properties": {
      "passage_text": {
        "type": "text"
      },
      "passage_embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "data_type": "byte",
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "lucene",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
}

最后,摄入测试数据

POST _bulk
{ "index" : { "_index" : "my_test_data" } }
{ "passage_text" : "OpenSearch is the flexible, scalable, open-source way to build solutions for data-intensive applications. Explore, enrich, and visualize your data with built-in performance, developer-friendly tools, and powerful integrations for machine learning, data processing, and more." }
{ "index" : { "_index" : "my_test_data"} }
{ "passage_text" : "BM25 is a keyword-based algorithm that performs well on queries containing keywords but fails to capture the semantic meaning of the query terms. Semantic search, unlike keyword-based search, takes into account the meaning of the query in the search context. Thus, semantic search performs well when a query requires natural language understanding." }

使用 search_query 输入类型创建到嵌入模型的连接器

POST /_plugins/_ml/connectors/_create
{
    "name": "Cohere embedding connector with int8 embedding type for search",
    "description": "Test connector for Cohere embedding model. Use this connector for search.",
    "version": 1,
    "protocol": "http",
    "credential": {
        "cohere_key": "your_cohere_api_key"
    },
    "parameters": {
        "model": "embed-english-v3.0",
        "embedding_types": ["int8"],
        "input_type": "search_query"
    },
    "actions": [
        {
            "action_type": "predict",
            "method": "POST",
            "headers": {
                "Authorization": "Bearer ${credential.cohere_key}",
                "Request-Source": "unspecified:opensearch"
            },
            "url": "https://api.cohere.ai/v1/embed",
            "request_body": "{ \"model\": \"${parameters.model}\", \"texts\": ${parameters.texts}, \"input_type\":\"${parameters.input_type}\", \"embedding_types\": ${parameters.embedding_types} }",
            "pre_process_function": "connector.pre_process.cohere.embedding",
            "post_process_function": "\n    def name = \"sentence_embedding\";\n    def data_type = \"FLOAT32\";\n    def result;\n    if (params.embeddings.int8 != null) {\n      data_type = \"INT8\";\n      result = params.embeddings.int8;\n    } else if (params.embeddings.uint8 != null) {\n      data_type = \"UINT8\";\n      result = params.embeddings.uint8;\n    } else if (params.embeddings.float != null) {\n      data_type = \"FLOAT32\";\n      result = params.embeddings.float;\n    }\n    \n    if (result == null) {\n      return \"Invalid embedding result\";\n    }\n    \n    def embedding_list = new StringBuilder(\"[\");\n    \n    for (int m=0; m<result.length; m++) {\n      def embedding_size = result[m].length;\n      def embedding = new StringBuilder(\"[\");\n      def shape = [embedding_size];\n      for (int i=0; i<embedding_size; i++) {\n        def val;\n        if (\"FLOAT32\".equals(data_type)) {\n          val = result[m][i].floatValue();\n        } else if (\"INT8\".equals(data_type) || \"UINT8\".equals(data_type)) {\n          val = result[m][i].intValue();\n        }\n        embedding.append(val);\n        if (i < embedding_size - 1) {\n          embedding.append(\",\");  \n        }\n      }\n      embedding.append(\"]\");  \n      \n      // workaround for compatible with neural-search\n      def dummy_data_type = 'FLOAT32';\n      \n      def json = '{' +\n                   '\"name\":\"' + name + '\",' +\n                   '\"data_type\":\"' + dummy_data_type + '\",' +\n                   '\"shape\":' + shape + ',' +\n                   '\"data\":' + embedding +\n                   '}';\n      embedding_list.append(json);\n      if (m < result.length - 1) {\n        embedding_list.append(\",\");  \n      }\n    }\n    embedding_list.append(\"]\");  \n    return embedding_list.toString();\n    "
        }
    ]
}

请记下响应中的连接器 ID;您将使用它来注册模型。

注册模型,提供其连接器 ID

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "Cohere embedding model for INT8 with search_document input type",
    "function_name": "remote",
    "description": "test model",
    "connector_id": "your_connector_id"
}

请记下响应中的模型 ID;您将使用它来运行查询。

运行向量搜索,提供模型 ID

POST /my_test_data/_search
{
  "query": {
    "neural": {
      "passage_embedding": {
        "query_text": "semantic search",
        "model_id": "your_embedding_model_id",
        "k": 100
      }
    }
  },
  "size": "1",
  "_source": ["passage_text"]
}

响应包含查询结果

{
  "took": 143,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 9.345969e-7,
    "hits": [
      {
        "_index": "my_test_data",
        "_id": "_IXCuY0BJr_OiKWden7i",
        "_score": 9.345969e-7,
        "_source": {
          "passage_text": "BM25 is a keyword-based algorithm that performs well on queries containing keywords but fails to capture the semantic meaning of the query terms. Semantic search, unlike keyword-based search, takes into account the meaning of the query in the search context. Thus, semantic search performs well when a query requires natural language understanding."
        }
      }
    ]
  }
}
剩余 350 字符

有问题?

想贡献?