创建自定义分析器
要创建自定义分析器,请指定以下组件的组合:
-
字符过滤器(零个或多个)
-
分词器(一个)
-
分词过滤器(零个或多个)
配置
以下参数可用于配置自定义分析器。
参数 | 必需/可选 | 描述 |
---|---|---|
type | 可选 | 分析器类型。默认值为 custom 。您也可以使用此参数指定一个预构建分析器。 |
tokenizer | 必需 | 要包含在分析器中的分词器。 |
char_filter | 可选 | 要包含在分析器中的字符过滤器列表。 |
filter | 可选 | 要包含在分析器中的分词过滤器列表。 |
position_increment_gap | 可选 | 在索引具有多个值的文本字段时,在值之间应用的额外间距。有关更多信息,请参阅Position increment gap。默认值为 100 。 |
示例
以下示例演示了各种自定义分析器配置。
带有 HTML 剥离字符过滤器的自定义分析器
以下示例分析器在分词前从文本中删除 HTML 标签
PUT simple_html_strip_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"html_strip_analyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "whitespace",
"filter": ["lowercase"]
}
}
}
}
}
使用以下请求检查使用该分析器生成的词元
GET simple_html_strip_analyzer_index/_analyze
{
"analyzer": "html_strip_analyzer",
"text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
}
响应包含生成的词元
{
"tokens": [
{
"token": "opensearch",
"start_offset": 3,
"end_offset": 13,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 14,
"end_offset": 16,
"type": "word",
"position": 1
},
{
"token": "awesome!",
"start_offset": 25,
"end_offset": 42,
"type": "word",
"position": 2
}
]
}
带有映射字符过滤器用于同义词替换的自定义分析器
以下示例分析器在应用同义词过滤器之前替换特定字符和模式
PUT mapping_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"synonym_mapping_analyzer": {
"type": "custom",
"char_filter": ["underscore_to_space"],
"tokenizer": "standard",
"filter": ["lowercase", "stop", "synonym_filter"]
}
},
"char_filter": {
"underscore_to_space": {
"type": "mapping",
"mappings": ["_ => ' '"]
}
},
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"quick, fast, speedy",
"big, large, huge"
]
}
}
}
}
}
使用以下请求检查使用该分析器生成的词元
GET mapping_analyzer_index/_analyze
{
"analyzer": "synonym_mapping_analyzer",
"text": "The slow_green_turtle is very large"
}
响应包含生成的词元
{
"tokens": [
{"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
{"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
{"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
{"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
{"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
{"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
{"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
]
}
带有自定义基于模式的字符过滤器用于数字规范化的自定义分析器
以下示例分析器通过删除破折号和空格来规范电话号码,并对规范化后的文本应用边缘 n-gram 以支持部分匹配
PUT advanced_pattern_replace_analyzer_index
{
"settings": {
"analysis": {
"analyzer": {
"phone_number_analyzer": {
"type": "custom",
"char_filter": ["phone_normalization"],
"tokenizer": "standard",
"filter": ["lowercase", "edge_ngram"]
}
},
"char_filter": {
"phone_normalization": {
"type": "pattern_replace",
"pattern": "[-\\s]",
"replacement": ""
}
},
"filter": {
"edge_ngram": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
}
}
使用以下请求检查使用该分析器生成的词元
GET advanced_pattern_replace_analyzer_index/_analyze
{
"analyzer": "phone_number_analyzer",
"text": "123-456 7890"
}
响应包含生成的词元
{
"tokens": [
{"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
{"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
]
}
处理正则表达式模式中的特殊字符
在分析器中使用自定义正则表达式模式时,请确保正确处理特殊字符或非英文字符。默认情况下,Java 的正则表达式仅将 [A-Za-z0-9_]
视为词字符 (\w
)。这在使用 \w
或 \b
时可能导致意外行为,它们匹配词与非词字符之间的边界。
例如,以下分析器尝试使用模式 (\b\p{L}+\b)
来匹配一个或多个来自任何语言的字母字符 (\p{L}
),这些字符被词边界包围
PUT /buggy_custom_analyzer
{
"settings": {
"analysis": {
"filter": {
"capture_words": {
"type": "pattern_capture",
"patterns": [
"(\\b\\p{L}+\\b)"
]
}
},
"analyzer": {
"filter_only_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"capture_words"
]
}
}
}
}
}
然而,此分析器将 él-empezó-a-reír
错误地分词为 l
、empez
、a
和 reír
,因为 \b
不匹配重音字符与字符串开头或结尾之间的边界。
要正确处理特殊字符,请在模式中添加 Unicode 大小写标志 (?U)
PUT /fixed_custom_analyzer
{
"settings": {
"analysis": {
"filter": {
"capture_words": {
"type": "pattern_capture",
"patterns": [
"(?U)(\\b\\p{L}+\\b)"
]
}
},
"analyzer": {
"filter_only_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"capture_words"
]
}
}
}
}
}
位置增量间隙
当索引多值字段(例如数组)时,position_increment_gap
参数在术语之间设置一个位置间隙。此间隙确保短语查询不会跨单独的值匹配术语,除非明确允许。例如,默认间隙为 100 意味着不同数组条目中的术语相距 100 个位置,从而防止在短语搜索中出现意外匹配。您可以调整此值或将其设置为 0
,以允许短语跨数组值。
以下示例演示了使用 match_phrase
查询时 position_increment_gap
的效果。
-
在
test-index
中索引文档PUT test-index/_doc/1 { "names": [ "Slow green", "turtle swims"] }
-
使用
match_phrase
查询文档GET test-index/_search { "query": { "match_phrase": { "names": { "query": "green turtle" } } } }
响应未返回任何匹配项,因为术语
green
和turtle
之间的距离为100
(默认position_increment_gap
)。 -
现在使用
slop
参数高于position_increment_gap
的match_phrase
查询文档GET test-index/_search { "query": { "match_phrase": { "names": { "query": "green turtle", "slop": 101 } } } }
响应包含匹配的文档
{ "took": 4, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 0.010358453, "hits": [ { "_index": "test-index", "_id": "1", "_score": 0.010358453, "_source": { "names": [ "Slow green", "turtle swims" ] } } ] } }