组合聚合
composite
聚合根据一个或多个文档字段或来源创建桶。composite
聚合为每个单独的源值组合创建一个桶。默认情况下,结果中会省略一个或多个单独字段中缺少值的组合。
每个源都有四种聚合类型之一
terms
类型按唯一(通常是String
)值进行分组。histogram
类型按指定宽度的桶进行数值分组。date_histogram
类型按指定宽度的日期或时间范围进行分组。geotile_grid
类型将地理点分组到具有指定分辨率的网格中。
composite
聚合通过将其源键组合到桶中来工作。结果桶在源之间和源内部都有序排列
- 跨源:桶按照源在聚合请求中排列的顺序进行嵌套。
- 源内:每个源中值的顺序决定了该源的桶顺序。排序可以是字母顺序、数字顺序、日期时间顺序或地理瓦片顺序,具体取决于源类型。
考虑以下马拉松参与者索引中的字段:
{... "city": "Albuquerque", "place": "Bronze" ...}
{... "city": "Boston", ...}
{... "city": "Chicago", "place": "Bronze" ...}
{... "city": "Albuquerque", "place": "Gold" ...}
{... "city": "Chicago", "place": "Silver" ...}
{... "city": "Boston", "place": "Bronze" ...}
{... "city": "Chicago", "place": "Gold" ...}
假设请求按如下方式指定源:
...
"sources": [
{ "marathon_city": { "terms": { "field": "city" }}},
{ "participant_medal": { "terms": { "field": "place" }}}
],
...
您必须为每个源分配一个唯一的键名。
结果 composite
包含以下有序的桶:
{ "city": "Albuquerque", "place": "Bronze" }
{ "city": "Albuquerque", "place": "Gold" }
{ "city": "Boston", "place": "Bronze" }
{ "city": "Boston", "place": "Silver" }
{ "city": "Chicago", "place": "Bronze" }
{ "city": "Chicago", "place": "Gold" }
{ "city": "Chicago", "place": "Silver" }
请注意,city
和 place
字段都按字母顺序排序。
参数
composite
聚合接受以下参数:
参数 | 必需/可选 | 数据类型 | 描述 |
---|---|---|---|
sources | 必需 | 数组 | 源对象数组。有效类型包括 terms 、histogram 、date_histogram 和 geotile_grid 。 |
size | 可选 | 数值 | 结果中要返回的 composite 桶的数量。默认值为 10 。请参阅分页组合结果。 |
after | 可选 | 字符串 | 一个键,指定从何处继续显示分页的 composite 桶。请参阅分页组合结果。 |
顺序 | 可选 | 字符串 | 对于每个源,是否按升序或降序排列值。有效值为 asc 和 desc 。默认值为 asc 。 |
missing_bucket | 可选 | 布尔型 | 对于每个源,是否包含缺少值的文档。默认值为 false 。如果设置为 true ,OpenSearch 将包含这些文档,并提供 null 作为字段的键。Null 值在升序中排在首位。 |
有关特定于聚合的参数,请参阅相应的聚合文档。
词项集
使用 terms
聚合来聚合字符串或布尔数据。有关更多信息,请参阅 术语聚合。
您可以使用 terms
源为任何类型的数据创建组合桶。但是,由于 terms
源为每个唯一值创建桶,因此您通常会改用 histogram
源来处理数值数据。
以下示例请求返回 OpenSearch Dashboards 示例电子商务数据中每周某天和客户性别的首 4
个组合桶:
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "day": { "terms": { "field": "day_of_week" }}},
{ "gender": { "terms": { "field": "customer_gender" }}}
],
"size": 4
}
}
}
}
由于此示例的数据集包含每个桶的有效数据,因此聚合会为性别和每周某天的每种组合生成一个桶,总共生成 14 个桶。
因为请求指定了大小为 4
,所以响应包含前四个组合桶。由于源是 terms
类型,因此桶在源之间和源内部都按字母升序排列
{
"took": 51,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"day": "Monday",
"gender": "MALE"
},
"buckets": [
{
"key": {
"day": "Friday",
"gender": "FEMALE"
},
"doc_count": 399
},
{
"key": {
"day": "Friday",
"gender": "MALE"
},
"doc_count": 371
},
{
"key": {
"day": "Monday",
"gender": "FEMALE"
},
"doc_count": 320
},
{
"key": {
"day": "Monday",
"gender": "MALE"
},
"doc_count": 259
}
]
}
}
}
您可以使用响应中返回的 after_key
来查看更多结果。请参阅下一节中的示例。
直方图
使用 histogram
源创建数值数据的组合聚合。有关更多信息,请参阅 直方图聚合。
对于 histogram
源,每个组合桶键中使用的名称是该键直方图间隔中的最低值。每个源直方图间隔包含 [lower_bound, lower_bound + interval)
范围内的值。第一个间隔的名称是源字段中的最低值(对于升序值源)。
以下示例请求返回 OpenSearch Dashboards 示例电子商务数据中数量和基本单价的首 6
个组合桶,桶宽度分别为 1
和 50
:
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "quantity": { "histogram": { "field": "products.quantity", "interval": 1 }}},
{ "unit_price": { "histogram": { "field": "products.base_unit_price", "interval": 50 }}}
],
"size": 6
}
}
}
}
聚合返回两个 histogram
源的首 6
个桶键和文档计数。与 terms
示例中一样,桶在源字段之间和源内部都有序排列。但是,在这种情况下,顺序是数值的,并且基于每个直方图宽度的包含下限。
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"quantity": 2,
"unit_price": 150
},
"buckets": [
{
"key": {
"quantity": 1,
"unit_price": 0
},
"doc_count": 17691
},
{
"key": {
"quantity": 1,
"unit_price": 50
},
"doc_count": 5014
},
{
"key": {
"quantity": 1,
"unit_price": 100
},
"doc_count": 482
},
{
"key": {
"quantity": 1,
"unit_price": 150
},
"doc_count": 148
},
{
"key": {
"quantity": 1,
"unit_price": 200
},
"doc_count": 32
},
{
"key": {
"quantity": 2,
"unit_price": 150
},
"doc_count": 4
}
]
}
}
}
每个字段的桶键是字段间隔的下限。例如,第一个组合桶的 unit_price
键是 0
。
要检索接下来的 6
个桶,请按照如下方式,使用响应中的 after_key
对象提供 after
参数:
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "quantity": { "histogram": { "field": "products.quantity", "interval": 1 }}},
{ "unit_price": { "histogram": { "field": "products.base_unit_price", "interval": 50 }}}
],
"size": 6,
"after": {
"quantity": 2,
"unit_price": 150
}
}
}
}
}
仅剩两个桶。
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"quantity": 2,
"unit_price": 500
},
"buckets": [
{
"key": {
"quantity": 2,
"unit_price": 200
},
"doc_count": 8
},
{
"key": {
"quantity": 2,
"unit_price": 500
},
"doc_count": 4
}
]
}
}
}
日期直方图
要创建日期范围的组合聚合,请使用 date_histogram
聚合。有关更多信息,请参阅 日期直方图聚合。
OpenSearch 将日期(包括 date_interval
桶键)表示为长整型,该长整型表示自 Unix 时间纪元以来的毫秒数。您可以使用 format
参数格式化日期输出。这不会改变键的顺序。
OpenSearch 以 UTC 存储日期时间。您可以使用 time_zone
参数以不同的时区显示输出结果。
以下示例请求返回 OpenSearch Dashboards 示例电子商务数据中每个已售产品创建年份和销售日期的首 4
个组合桶,桶宽度分别为 1 年和 1 天。
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "product_creation_date": { "date_histogram": { "field": "products.created_on", "calendar_interval": "1y", "format": "yyyy" }}},
{ "order_date": { "date_histogram": { "field": "order_date", "calendar_interval": "1d", "format": "yyyy-MM-dd" }}}
],
"size": 4
}
}
}
}
聚合返回格式化的基于日期的桶键和计数。对于 date_interval
组合聚合,字段排序是按日期进行的。
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"product_creation_date": "2016",
"order_date": "2025-02-23"
},
"buckets": [
{
"key": {
"product_creation_date": "2016",
"order_date": "2025-02-20"
},
"doc_count": 146
},
{
"key": {
"product_creation_date": "2016",
"order_date": "2025-02-21"
},
"doc_count": 153
},
{
"key": {
"product_creation_date": "2016",
"order_date": "2025-02-22"
},
"doc_count": 143
},
{
"key": {
"product_creation_date": "2016",
"order_date": "2025-02-23"
},
"doc_count": 140
}
]
}
}
}
地理瓦片网格
使用 geotile_grid
源将 geo_point
值聚合成表示地图瓦片的桶。与其它组合聚合源一样,默认情况下,结果仅包含包含数据的桶。有关更多信息,请参阅 地理瓦片网格聚合。
每个单元格对应一个 地图瓦片。单元格标签使用 {zoom}/{x}/{y}
格式。
以下示例请求返回精确度为 8
的 geoip.location
字段中包含位置的首 6
个瓦片:
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "tile": { "geotile_grid": { "field": "geoip.location", "precision": 8 } } }
],
"size": 6
}
}
}
}
聚合返回指定的地理瓦片和点计数。
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"tile": "8/122/104"
},
"buckets": [
{
"key": {
"tile": "8/43/102"
},
"doc_count": 310
},
{
"key": {
"tile": "8/75/96"
},
"doc_count": 896
},
{
"key": {
"tile": "8/75/124"
},
"doc_count": 178
},
{
"key": {
"tile": "8/122/104"
},
"doc_count": 408
}
]
}
}
}
组合源
您可以组合任意两种或多种不同类型的源。
以下示例请求返回由三种不同源类型组成的桶:
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "order_date": { "date_histogram": { "field": "order_date", "calendar_interval": "1M", "format": "yyyy-MM" }}},
{ "gender": { "terms": { "field": "customer_gender" }}},
{ "unit_price": { "histogram": { "field": "products.base_unit_price", "interval": 200 }}}
],
"size": 10
}
}
}
}
聚合返回混合类型的组合桶和文档计数。
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"order_date": "2025-03",
"gender": "MALE",
"unit_price": 200
},
"buckets": [
{
"key": {
"order_date": "2025-02",
"gender": "FEMALE",
"unit_price": 0
},
"doc_count": 1517
},
{
"key": {
"order_date": "2025-02",
"gender": "MALE",
"unit_price": 0
},
"doc_count": 1369
},
{
"key": {
"order_date": "2025-02",
"gender": "MALE",
"unit_price": 200
},
"doc_count": 6
},
{
"key": {
"order_date": "2025-02",
"gender": "MALE",
"unit_price": 400
},
"doc_count": 1
},
{
"key": {
"order_date": "2025-03",
"gender": "FEMALE",
"unit_price": 0
},
"doc_count": 3656
},
{
"key": {
"order_date": "2025-03",
"gender": "FEMALE",
"unit_price": 200
},
"doc_count": 1
},
{
"key": {
"order_date": "2025-03",
"gender": "MALE",
"unit_price": 0
},
"doc_count": 3530
},
{
"key": {
"order_date": "2025-03",
"gender": "MALE",
"unit_price": 200
},
"doc_count": 7
}
]
}
}
}
子聚合
当组合聚合与子聚合结合使用时,它们最有用,可以揭示组合桶中文档的信息。
以下示例请求比较了 OpenSearch Dashboards 示例电子商务数据中每周某天按性别划分的平均支出:
GET opensearch_dashboards_sample_data_ecommerce/_search
{
"size": 0,
"aggs": {
"composite_buckets": {
"composite": {
"sources": [
{ "weekday": { "terms": { "field": "day_of_week" }}},
{ "gender": { "terms": { "field": "customer_gender" }}}
],
"size": 6
},
"aggs": {
"avg_spend": {
"avg": { "field": "taxful_total_price" }
}
}
}
}
}
聚合返回前 6
个桶的平均 taxful_total_price
。
{
"took": 30,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4675,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"composite_buckets": {
"after_key": {
"weekday": "Saturday",
"gender": "MALE"
},
"buckets": [
{
"key": {
"weekday": "Friday",
"gender": "FEMALE"
},
"doc_count": 399,
"avg_spend": {
"value": 71.7733395989975
}
},
{
"key": {
"weekday": "Friday",
"gender": "MALE"
},
"doc_count": 371,
"avg_spend": {
"value": 79.72514108827494
}
},
{
"key": {
"weekday": "Monday",
"gender": "FEMALE"
},
"doc_count": 320,
"avg_spend": {
"value": 72.1588623046875
}
},
{
"key": {
"weekday": "Monday",
"gender": "MALE"
},
"doc_count": 259,
"avg_spend": {
"value": 86.1754946911197
}
},
{
"key": {
"weekday": "Saturday",
"gender": "FEMALE"
},
"doc_count": 365,
"avg_spend": {
"value": 73.53236301369863
}
},
{
"key": {
"weekday": "Saturday",
"gender": "MALE"
},
"doc_count": 371,
"avg_spend": {
"value": 72.78092360175202
}
}
]
}
}
}
组合结果分页
如果请求结果超过 size
个桶,则返回 size
个桶。在这种情况下,结果包含一个 after_key
对象,其中包含列表中下一个桶的键。要检索请求的下一个 size
个桶,请再次发送请求,并在 after
参数中提供 after_key
。有关示例,请参阅直方图中的请求。
始终使用 after_key
,而不是复制最后一个桶来继续分页响应。两者有时是不同的。
通过索引排序提高性能
为了加快大型数据集上的组合聚合速度,您可以使用与聚合源中相同的字段和顺序对索引进行排序。当 index.sort.field
和 index.sort.order
与组合聚合中使用的源字段和顺序匹配时,OpenSearch 可以更高效地返回结果,并减少内存使用。虽然索引排序在索引期间会增加少量开销,但组合聚合的查询性能提升是显著的。
以下示例请求为 my-sorted-index
索引中的每个字段设置排序字段和排序顺序:
PUT /my-sorted-index
{
"settings": {
"index": {
"sort.field": ["customer_id", "timestamp"],
"sort.order": ["asc", "desc"]
}
},
"mappings": {
"properties": {
"customer_id": {
"type": "keyword"
},
"timestamp": {
"type": "date"
},
"price": {
"type": "double"
}
}
}
}
以下请求在 my-sorted-index
索引上创建了一个组合聚合。由于索引按 customer_id
升序和 timestamp
降序排序,并且聚合源与该排序顺序匹配,因此此查询运行速度更快,内存压力更小。
GET /my-sorted-index/_search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 1000,
"sources": [
{ "customer": { "terms": { "field": "customer_id", "order": "asc" } } },
{ "time": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
]
}
}
}
}