Kuromoji 补全分词过滤器
kuromoji_completion
分词过滤器用于对日语中的片假名单词进行词干提取,这些单词通常用于表示外来词或借用词。此过滤器对于自动补全或建议查询特别有用,因为片假名单词的部分匹配可以扩展以包含其完整形式。
要使用此分词过滤器,您必须首先通过运行 bin/opensearch-plugin install analysis-kuromoji
在所有节点上安装 analysis-kuromoji
插件,然后重新启动集群。有关安装其他插件的更多信息,请参阅其他插件。
示例
以下示例请求创建了一个名为 kuromoji_sample
的新索引,并配置了一个带有 kuromoji_completion
过滤器的分析器
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_katakana_stemmer"
]
}
},
"filter": {
"my_katakana_stemmer": {
"type": "kuromoji_completion"
}
}
}
}
}
}
生成的词元
使用以下请求检查使用分析器和翻译为“使用计算机”的文本生成的 token
POST /kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "コンピューターを使う"
}
响应包含生成的词元
{
"tokens": [
{
"token": "コンピューター", // The original Katakana word "computer".
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "konpyuーtaー", // Romanized version (Romaji) of "コンピューター".
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "konnpyuーtaー", // Another possible romanized version of "コンピューター" (with a slight variation in the spelling).
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "を", // A Japanese particle, "wo" or "o"
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "wo", // Romanized form of the particle "を" (often pronounced as "o").
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "o", // Another version of the romanization.
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 1
},
{
"token": "使う", // The verb "use" in Kanji.
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "tukau", // Romanized version of "使う"
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 2
},
{
"token": "tsukau", // Another romanized version of "使う", where "tsu" is more phonetically correct
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 2
}
]
}