嵌套字段搜索
在向量索引中使用嵌套字段,您可以在单个文档中存储多个向量。例如,如果您的文档由各种组件组成,您可以为每个组件生成一个向量值,并将每个向量存储在嵌套字段中。
向量搜索在字段级别操作。对于包含嵌套字段的文档,OpenSearch 仅检查最接近查询向量的向量,以决定是否将文档包含在结果中。例如,考虑一个包含文档 A
和 B
的索引。文档 A
由向量 A1
和 A2
表示,文档 B
由向量 B1
表示。此外,查询 Q 的相似度顺序是 A1
、A2
、B1
。如果您使用 k 值为 2 的查询 Q 进行搜索,则搜索将返回文档 A
和 B
,而不是只返回文档 A
。
请注意,在近似搜索的情况下,结果是近似值而非精确匹配。
HNSW 算法支持 Lucene 和 Faiss 引擎的嵌套字段向量搜索。
索引和搜索嵌套字段
要使用嵌套字段进行向量搜索,您必须通过将 index.knn
设置为 true
来创建向量索引。通过将其 type
设置为 nested
来创建嵌套字段,并在嵌套字段中指定一个或多个 knn_vector
数据类型字段。在此示例中,knn_vector
字段 my_vector
嵌套在 nested_field
字段中。
PUT my-knn-index-1
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"nested_field": {
"type": "nested",
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 3,
"space_type": "l2",
"method": {
"name": "hnsw",
"engine": "lucene",
"parameters": {
"ef_construction": 100,
"m": 16
}
}
},
"color": {
"type": "text",
"index": false
}
}
}
}
}
}
创建索引后,向其中添加一些数据。
PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"nested_field":[{"my_vector":[1,1,1], "color": "blue"},{"my_vector":[2,2,2], "color": "yellow"},{"my_vector":[3,3,3], "color": "white"}]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"nested_field":[{"my_vector":[10,10,10], "color": "red"},{"my_vector":[20,20,20], "color": "green"},{"my_vector":[30,30,30], "color": "black"}]}
然后使用 knn
查询类型对数据运行向量搜索。
GET my-knn-index-1/_search
{
"query": {
"nested": {
"path": "nested_field",
"query": {
"knn": {
"nested_field.my_vector": {
"vector": [1,1,1],
"k": 2
}
}
}
}
}
}
尽管最接近查询向量的三个向量都在文档 1 中,但由于 k 设置为 2,查询仍返回文档 1 和 2。
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "1",
"_score": 1.0,
"_source": {
"nested_field": [
{
"my_vector": [
1,
1,
1
],
"color": "blue"
},
{
"my_vector": [
2,
2,
2
],
"color": "yellow"
},
{
"my_vector": [
3,
3,
3
],
"color": "white"
}
]
}
},
{
"_index": "my-knn-index-1",
"_id": "2",
"_score": 0.0040983604,
"_source": {
"nested_field": [
{
"my_vector": [
10,
10,
10
],
"color": "red"
},
{
"my_vector": [
20,
20,
20
],
"color": "green"
},
{
"my_vector": [
30,
30,
30
],
"color": "black"
}
]
}
}
]
}
}
内部匹配项
当您根据嵌套字段中的匹配项检索文档时,默认情况下,响应不包含有关哪些内部对象匹配查询的信息。因此,不清楚文档为何匹配。要在响应中包含有关匹配嵌套字段的信息,您可以在查询中提供 inner_hits
对象。要在 inner_hits
中仅返回匹配文档的某些字段,请在 fields
数组中指定文档字段。通常,您还应该从结果中排除 _source
以避免返回整个文档。以下示例仅返回 nested_field
的 color
内部字段。
GET my-knn-index-1/_search
{
"_source": false,
"query": {
"nested": {
"path": "nested_field",
"query": {
"knn": {
"nested_field.my_vector": {
"vector": [1,1,1],
"k": 2
}
}
},
"inner_hits": {
"_source": false,
"fields":["nested_field.color"]
}
}
}
}
响应包含匹配的文档。对于每个匹配的文档,inner_hits
对象仅包含 fields
数组中匹配文档的 nested_field.color
字段。
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "1",
"_score": 1.0,
"inner_hits": {
"nested_field": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "1",
"_nested": {
"field": "nested_field",
"offset": 0
},
"_score": 1.0,
"fields": {
"nested_field.color": [
"blue"
]
}
}
]
}
}
}
},
{
"_index": "my-knn-index-1",
"_id": "2",
"_score": 0.0040983604,
"inner_hits": {
"nested_field": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.0040983604,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "2",
"_nested": {
"field": "nested_field",
"offset": 0
},
"_score": 0.0040983604,
"fields": {
"nested_field.color": [
"red"
]
}
}
]
}
}
}
}
]
}
}
检索所有嵌套匹配项
默认情况下,当您查询嵌套字段时,仅考虑得分最高的嵌套文档。要检索每个父文档中所有嵌套字段文档的得分,请在查询中将 expand_nested_docs
设置为 true
。父文档的得分是其得分的平均值。要将嵌套字段文档中的最高得分用作父文档的得分,请将 score_mode
设置为 max
。
GET my-knn-index-1/_search
{
"_source": false,
"query": {
"nested": {
"path": "nested_field",
"query": {
"knn": {
"nested_field.my_vector": {
"vector": [1,1,1],
"k": 2,
"expand_nested_docs": true
}
}
},
"inner_hits": {
"_source": false,
"fields":["nested_field.color"]
},
"score_mode": "max"
}
}
}
响应包含所有匹配的文档。
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "1",
"_score": 1.0,
"inner_hits": {
"nested_field": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "1",
"_nested": {
"field": "nested_field",
"offset": 0
},
"_score": 1.0,
"fields": {
"nested_field.color": [
"blue"
]
}
},
{
"_index": "my-knn-index-1",
"_id": "1",
"_nested": {
"field": "nested_field",
"offset": 1
},
"_score": 0.25,
"fields": {
"nested_field.color": [
"blue"
]
}
},
{
"_index": "my-knn-index-1",
"_id": "1",
"_nested": {
"field": "nested_field",
"offset": 2
},
"_score": 0.07692308,
"fields": {
"nested_field.color": [
"white"
]
}
}
]
}
}
}
},
{
"_index": "my-knn-index-1",
"_id": "2",
"_score": 0.0040983604,
"inner_hits": {
"nested_field": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 0.0040983604,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "2",
"_nested": {
"field": "nested_field",
"offset": 0
},
"_score": 0.0040983604,
"fields": {
"nested_field.color": [
"blue"
]
}
},
{
"_index": "my-knn-index-1",
"_id": "2",
"_nested": {
"field": "nested_field",
"offset": 1
},
"_score": 9.2250924E-4,
"fields": {
"nested_field.color": [
"yellow"
]
}
},
{
"_index": "my-knn-index-1",
"_id": "2",
"_nested": {
"field": "nested_field",
"offset": 2
},
"_score": 3.9619653E-4,
"fields": {
"nested_field.color": [
"white"
]
}
}
]
}
}
}
}
]
}
}
带嵌套字段过滤的向量搜索
您可以对带有嵌套字段的向量搜索应用过滤器。过滤器可以应用于顶级字段或嵌套字段内的字段。
以下示例将过滤器应用于顶级字段。
首先,创建一个带有嵌套字段的向量索引。
PUT my-knn-index-1
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"nested_field": {
"type": "nested",
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 3,
"space_type": "l2",
"method": {
"name": "hnsw",
"engine": "lucene",
"parameters": {
"ef_construction": 100,
"m": 16
}
}
}
}
}
}
}
}
创建索引后,向其中添加一些数据。
PUT _bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"parking": false, "nested_field":[{"my_vector":[1,1,1]},{"my_vector":[2,2,2]},{"my_vector":[3,3,3]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"parking": true, "nested_field":[{"my_vector":[10,10,10]},{"my_vector":[20,20,20]},{"my_vector":[30,30,30]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{"parking": true, "nested_field":[{"my_vector":[100,100,100]},{"my_vector":[200,200,200]},{"my_vector":[300,300,300]}]}
然后使用带有过滤器的 knn
查询类型对数据运行向量搜索。以下查询返回 parking
字段设置为 true
的文档。
GET my-knn-index-1/_search
{
"query": {
"nested": {
"path": "nested_field",
"query": {
"knn": {
"nested_field.my_vector": {
"vector": [
1,
1,
1
],
"k": 3,
"filter": {
"term": {
"parking": true
}
}
}
}
}
}
}
}
尽管最接近查询向量的三个向量都在文档 1 中,但由于文档 1 被过滤掉,查询仍返回文档 2 和 3。
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.0040983604,
"hits": [
{
"_index": "my-knn-index-1",
"_id": "2",
"_score": 0.0040983604,
"_source": {
"parking": true,
"nested_field": [
{
"my_vector": [
10,
10,
10
]
},
{
"my_vector": [
20,
20,
20
]
},
{
"my_vector": [
30,
30,
30
]
}
]
}
},
{
"_index": "my-knn-index-1",
"_id": "3",
"_score": 3.400898E-5,
"_source": {
"parking": true,
"nested_field": [
{
"my_vector": [
100,
100,
100
]
},
{
"my_vector": [
200,
200,
200
]
},
{
"my_vector": [
300,
300,
300
]
}
]
}
}
]
}
}