关键词重复分词过滤器
keyword_repeat
分词过滤器将词元(token)的关键词版本(keyword version)发散到词元流中。此过滤器通常用于在进一步的词元转换(如词干提取或同义词扩展)后,希望保留原始词元及其修改版本的情况。重复的词元允许原始、未更改的词元与修改后的版本一起保留在最终分析中。
keyword_repeat
分词过滤器应放置在词干提取过滤器之前。词干提取并非应用于每个词元,因此在词干提取后,您可能会在相同位置出现重复的词元。要删除重复的词元,请在词干提取器之后使用 remove_duplicates
分词过滤器。
示例
以下示例请求创建一个名为 my_index
的新索引,并配置一个包含 keyword_repeat
过滤器的分析器。
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_kstem": {
"type": "kstem"
},
"my_lowercase": {
"type": "lowercase"
}
},
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_lowercase",
"keyword_repeat",
"my_kstem"
]
}
}
}
}
}
生成的词元
使用以下请求检查使用该分析器生成的词元
POST /my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Stopped quickly"
}
响应包含生成的词元
{
"tokens": [
{
"token": "stopped",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "stop",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "quickly",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "quick",
"start_offset": 8,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 1
}
]
}
您可以通过向 _analyze
查询添加以下参数来进一步检查 keyword_repeat
分词过滤器的影响。
POST /my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "Stopped quickly",
"explain": true,
"attributes": "keyword"
}
响应包含详细信息,例如分词、过滤以及特定分词过滤器的应用。
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "standard",
"tokens": [
{"token": "OpenSearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0},
{"token": "helped","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1},
{"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2},
{"token": "employers","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3}
]
},
"tokenfilters": [
{
"name": "lowercase",
"tokens": [
{"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0},
{"token": "helped","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1},
{"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2},
{"token": "employers","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3}
]
},
{
"name": "keyword_marker_filter",
"tokens": [
{"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0,"keyword": true},
{"token": "helped","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1,"keyword": false},
{"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2,"keyword": false},
{"token": "employers","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3,"keyword": false}
]
},
{
"name": "kstem_filter",
"tokens": [
{"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0,"keyword": true},
{"token": "help","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1,"keyword": false},
{"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2,"keyword": false},
{"token": "employer","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3,"keyword": false}
]
}
]
}
}