词项向量
term_vector
映射参数控制在索引期间是否为单个文本字段存储词条级信息。此信息包括词条频率、位置和字符偏移等详细信息,可用于自定义评分和高亮显示等高级功能。
默认情况下,term_vector
处于禁用状态。启用后,词向量将被存储,并可以使用 _termvectors
API 进行检索。
启用 term_vector
会增加索引大小。仅在需要详细词条级数据时使用它。
配置选项
term_vector
参数支持以下有效值
no
(默认):不存储词向量。yes
:存储词条频率(词条在特定文档中出现的次数)和基本位置。with_positions
:存储词条位置。词条在字段中出现的顺序。with_offsets
:存储字符偏移量。词条在字段文本中的精确起始和结束字符位置。with_positions_offsets
:同时存储位置和偏移量。with_positions_payloads
:存储词条位置以及有效负载,有效负载是可以在索引期间附加到单个词条的可选自定义元数据(例如标签或数值)。有效负载用于自定义评分或标记等高级场景,但需要特殊分析器才能设置。with_positions_offsets_payloads
:存储所有词向量数据。
在字段上启用 term_vector
以下请求创建一个名为 articles
的索引,其中 content
字段配置为存储词向量,包括位置和偏移量
PUT /articles
{
"mappings": {
"properties": {
"content": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
索引一个示例文档
PUT /articles/_doc/1
{
"content": "OpenSearch is an open-source search and analytics suite."
}
使用 _termvectors
API 检索词条级统计信息
POST /articles/_termvectors/1
{
"fields": ["content"],
"term_statistics": true,
"positions": true,
"offsets": true
}
以下响应包含文档 ID 1
中 content
字段的详细词条级统计信息,例如词条频率、文档频率、词元位置和字符偏移量
{
"_index": "articles",
"_id": "1",
"_version": 1,
"found": true,
"took": 4,
"term_vectors": {
"content": {
"field_statistics": {
"sum_doc_freq": 9,
"doc_count": 1,
"sum_ttf": 9
},
"terms": {
"an": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 2,
"start_offset": 14,
"end_offset": 16
}
]
},
"analytics": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 7,
"start_offset": 40,
"end_offset": 49
}
]
},
"and": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 6,
"start_offset": 36,
"end_offset": 39
}
]
},
"is": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 11,
"end_offset": 13
}
]
},
"open": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 3,
"start_offset": 17,
"end_offset": 21
}
]
},
"opensearch": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 10
}
]
},
"search": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 5,
"start_offset": 29,
"end_offset": 35
}
]
},
"source": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 4,
"start_offset": 22,
"end_offset": 28
}
]
},
"suite": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 8,
"start_offset": 50,
"end_offset": 55
}
]
}
}
}
}
}
使用词向量进行高亮显示
使用以下命令搜索词条“analytics”并使用字段存储的词向量对其进行高亮显示
POST /articles/_search
{
"query": {
"match": {
"content": "analytics"
}
},
"highlight": {
"fields": {
"content": {
"type": "fvh"
}
}
}
}
以下响应显示了一个匹配文档,其中在 content
字段中找到了词条“analytics”。highlight
部分包含用 标签包裹的匹配词条,使用字段存储的词向量进行高效准确的高亮显示
{
...
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "articles",
"_id": "1",
"_score": 0.2876821,
"_source": {
"content": "OpenSearch is an open-source search and analytics suite."
},
"highlight": {
"content": [
"OpenSearch is an open-source search and <em>analytics</em> suite."
]
}
}
]
}
}