Link Search Menu Expand Document Documentation Menu

嵌套字段搜索

在向量索引中使用嵌套字段,您可以在单个文档中存储多个向量。例如,如果您的文档由各种组件组成,您可以为每个组件生成一个向量值,并将每个向量存储在嵌套字段中。

向量搜索在字段级别操作。对于包含嵌套字段的文档,OpenSearch 仅检查最接近查询向量的向量,以决定是否将文档包含在结果中。例如,考虑一个包含文档 AB 的索引。文档 A 由向量 A1A2 表示,文档 B 由向量 B1 表示。此外,查询 Q 的相似度顺序是 A1A2B1。如果您使用 k 值为 2 的查询 Q 进行搜索,则搜索将返回文档 AB,而不是只返回文档 A

请注意,在近似搜索的情况下,结果是近似值而非精确匹配。

HNSW 算法支持 Lucene 和 Faiss 引擎的嵌套字段向量搜索。

索引和搜索嵌套字段

要使用嵌套字段进行向量搜索,您必须通过将 index.knn 设置为 true 来创建向量索引。通过将其 type 设置为 nested 来创建嵌套字段,并在嵌套字段中指定一个或多个 knn_vector 数据类型字段。在此示例中,knn_vector 字段 my_vector 嵌套在 nested_field 字段中。

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "nested_field": {
        "type": "nested",
        "properties": {
          "my_vector": {
            "type": "knn_vector",
            "dimension": 3,
            "space_type": "l2",
            "method": {
              "name": "hnsw",
              "engine": "lucene",
              "parameters": {
                "ef_construction": 100,
                "m": 16
              }
            }
          },
          "color": {
            "type": "text",
            "index": false
          }
        }
      }
    }
  }
}

创建索引后,向其中添加一些数据。

PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"nested_field":[{"my_vector":[1,1,1], "color": "blue"},{"my_vector":[2,2,2], "color": "yellow"},{"my_vector":[3,3,3], "color": "white"}]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"nested_field":[{"my_vector":[10,10,10], "color": "red"},{"my_vector":[20,20,20], "color": "green"},{"my_vector":[30,30,30], "color": "black"}]}

然后使用 knn 查询类型对数据运行向量搜索。

GET my-knn-index-1/_search
{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [1,1,1],
            "k": 2
          }
        }
      }
    }
  }
}

尽管最接近查询向量的三个向量都在文档 1 中,但由于 k 设置为 2,查询仍返回文档 1 和 2。

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "nested_field": [
            {
              "my_vector": [
                1,
                1,
                1
              ],
              "color": "blue"
            },
            {
              "my_vector": [
                2,
                2,
                2
              ],
              "color": "yellow"
            },
            {
              "my_vector": [
                3,
                3,
                3
              ],
              "color": "white"
            }
          ]
        }
      },
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 0.0040983604,
        "_source": {
          "nested_field": [
            {
              "my_vector": [
                10,
                10,
                10
              ],
              "color": "red"
            },
            {
              "my_vector": [
                20,
                20,
                20
              ],
              "color": "green"
            },
            {
              "my_vector": [
                30,
                30,
                30
              ],
              "color": "black"
            }
          ]
        }
      }
    ]
  }
}

内部匹配项

当您根据嵌套字段中的匹配项检索文档时,默认情况下,响应不包含有关哪些内部对象匹配查询的信息。因此,不清楚文档为何匹配。要在响应中包含有关匹配嵌套字段的信息,您可以在查询中提供 inner_hits 对象。要在 inner_hits 中仅返回匹配文档的某些字段,请在 fields 数组中指定文档字段。通常,您还应该从结果中排除 _source 以避免返回整个文档。以下示例仅返回 nested_fieldcolor 内部字段。

GET my-knn-index-1/_search
{
  "_source": false,
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [1,1,1],
            "k": 2
          }
        }
      },
      "inner_hits": {
        "_source": false,
        "fields":["nested_field.color"]
      }
    }
  }
}

响应包含匹配的文档。对于每个匹配的文档,inner_hits 对象仅包含 fields 数组中匹配文档的 nested_field.color 字段。

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "1",
        "_score": 1.0,
        "inner_hits": {
          "nested_field": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "my-knn-index-1",
                  "_id": "1",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 0
                  },
                  "_score": 1.0,
                  "fields": {
                    "nested_field.color": [
                      "blue"
                    ]
                  }
                }
              ]
            }
          }
        }
      },
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 0.0040983604,
        "inner_hits": {
          "nested_field": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 0.0040983604,
              "hits": [
                {
                  "_index": "my-knn-index-1",
                  "_id": "2",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 0
                  },
                  "_score": 0.0040983604,
                  "fields": {
                    "nested_field.color": [
                      "red"
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

检索所有嵌套匹配项

默认情况下,当您查询嵌套字段时,仅考虑得分最高的嵌套文档。要检索每个父文档中所有嵌套字段文档的得分,请在查询中将 expand_nested_docs 设置为 true。父文档的得分是其得分的平均值。要将嵌套字段文档中的最高得分用作父文档的得分,请将 score_mode 设置为 max

GET my-knn-index-1/_search
{
  "_source": false,
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [1,1,1],
            "k": 2,
            "expand_nested_docs": true
          }
        }
      },
      "inner_hits": {
        "_source": false,
        "fields":["nested_field.color"]
      },
      "score_mode": "max"
    }
  }
}

响应包含所有匹配的文档。

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "1",
        "_score": 1.0,
        "inner_hits": {
          "nested_field": {
            "hits": {
              "total": {
                "value": 3,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "my-knn-index-1",
                  "_id": "1",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 0
                  },
                  "_score": 1.0,
                  "fields": {
                    "nested_field.color": [
                      "blue"
                    ]
                  }
                },
                {
                  "_index": "my-knn-index-1",
                  "_id": "1",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 1
                  },
                  "_score": 0.25,
                  "fields": {
                    "nested_field.color": [
                      "blue"
                    ]
                  }
                },
                {
                  "_index": "my-knn-index-1",
                  "_id": "1",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 2
                  },
                  "_score": 0.07692308,
                  "fields": {
                    "nested_field.color": [
                      "white"
                    ]
                  }
                }
              ]
            }
          }
        }
      },
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 0.0040983604,
        "inner_hits": {
          "nested_field": {
            "hits": {
              "total": {
                "value": 3,
                "relation": "eq"
              },
              "max_score": 0.0040983604,
              "hits": [
                {
                  "_index": "my-knn-index-1",
                  "_id": "2",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 0
                  },
                  "_score": 0.0040983604,
                  "fields": {
                    "nested_field.color": [
                      "blue"
                    ]
                  }
                },
                {
                  "_index": "my-knn-index-1",
                  "_id": "2",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 1
                  },
                  "_score": 9.2250924E-4,
                  "fields": {
                    "nested_field.color": [
                      "yellow"
                    ]
                  }
                },
                {
                  "_index": "my-knn-index-1",
                  "_id": "2",
                  "_nested": {
                    "field": "nested_field",
                    "offset": 2
                  },
                  "_score": 3.9619653E-4,
                  "fields": {
                    "nested_field.color": [
                      "white"
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

带嵌套字段过滤的向量搜索

您可以对带有嵌套字段的向量搜索应用过滤器。过滤器可以应用于顶级字段或嵌套字段内的字段。

以下示例将过滤器应用于顶级字段。

首先,创建一个带有嵌套字段的向量索引。

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "nested_field": {
        "type": "nested",
        "properties": {
          "my_vector": {
            "type": "knn_vector",
            "dimension": 3,
            "space_type": "l2",
            "method": {
              "name": "hnsw",
              "engine": "lucene",
              "parameters": {
                "ef_construction": 100,
                "m": 16
              }
            }
          }
        }
      }
    }
  }
}

创建索引后,向其中添加一些数据。

PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"parking": false, "nested_field":[{"my_vector":[1,1,1]},{"my_vector":[2,2,2]},{"my_vector":[3,3,3]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"parking": true, "nested_field":[{"my_vector":[10,10,10]},{"my_vector":[20,20,20]},{"my_vector":[30,30,30]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{"parking": true, "nested_field":[{"my_vector":[100,100,100]},{"my_vector":[200,200,200]},{"my_vector":[300,300,300]}]}

然后使用带有过滤器的 knn 查询类型对数据运行向量搜索。以下查询返回 parking 字段设置为 true 的文档。

GET my-knn-index-1/_search
{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [
              1,
              1,
              1
            ],
            "k": 3,
            "filter": {
              "term": {
                "parking": true
              }
            }
          }
        }
      }
    }
  }
}

尽管最接近查询向量的三个向量都在文档 1 中,但由于文档 1 被过滤掉,查询仍返回文档 2 和 3。

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.0040983604,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 0.0040983604,
        "_source": {
          "parking": true,
          "nested_field": [
            {
              "my_vector": [
                10,
                10,
                10
              ]
            },
            {
              "my_vector": [
                20,
                20,
                20
              ]
            },
            {
              "my_vector": [
                30,
                30,
                30
              ]
            }
          ]
        }
      },
      {
        "_index": "my-knn-index-1",
        "_id": "3",
        "_score": 3.400898E-5,
        "_source": {
          "parking": true,
          "nested_field": [
            {
              "my_vector": [
                100,
                100,
                100
              ]
            },
            {
              "my_vector": [
                200,
                200,
                200
              ]
            },
            {
              "my_vector": [
                300,
                300,
                300
              ]
            }
          ]
        }
      }
    ]
  }
}
剩余 350 字符

有问题?

想贡献内容?