Link Search Menu Expand Document Documentation Menu

创建自定义分析器

要创建自定义分析器,请指定以下组件的组合:

  • 字符过滤器(零个或多个)

  • 分词器(一个)

  • 分词过滤器(零个或多个)

配置

以下参数可用于配置自定义分析器。

参数 必需/可选 描述
type 可选 分析器类型。默认值为 custom。您也可以使用此参数指定一个预构建分析器。
tokenizer 必需 要包含在分析器中的分词器。
char_filter 可选 要包含在分析器中的字符过滤器列表。
filter 可选 要包含在分析器中的分词过滤器列表。
position_increment_gap 可选 在索引具有多个值的文本字段时,在值之间应用的额外间距。有关更多信息,请参阅Position increment gap。默认值为 100

示例

以下示例演示了各种自定义分析器配置。

带有 HTML 剥离字符过滤器的自定义分析器

以下示例分析器在分词前从文本中删除 HTML 标签

PUT simple_html_strip_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "whitespace",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET simple_html_strip_analyzer_index/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
}

响应包含生成的词元

{
  "tokens": [
    {
      "token": "opensearch",
      "start_offset": 3,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 1
    },
    {
      "token": "awesome!",
      "start_offset": 25,
      "end_offset": 42,
      "type": "word",
      "position": 2
    }
  ]
}

带有映射字符过滤器用于同义词替换的自定义分析器

以下示例分析器在应用同义词过滤器之前替换特定字符和模式

PUT mapping_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "synonym_mapping_analyzer": {
          "type": "custom",
          "char_filter": ["underscore_to_space"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "synonym_filter"]
        }
      },
      "char_filter": {
        "underscore_to_space": {
          "type": "mapping",
          "mappings": ["_ => ' '"]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "quick, fast, speedy",
            "big, large, huge"
          ]
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET mapping_analyzer_index/_analyze
{
  "analyzer": "synonym_mapping_analyzer",
  "text": "The slow_green_turtle is very large"
}

响应包含生成的词元

{
  "tokens": [
    {"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
    {"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
    {"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
    {"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
    {"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
    {"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
    {"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
  ]
}

带有自定义基于模式的字符过滤器用于数字规范化的自定义分析器

以下示例分析器通过删除破折号和空格来规范电话号码,并对规范化后的文本应用边缘 n-gram 以支持部分匹配

PUT advanced_pattern_replace_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "phone_number_analyzer": {
          "type": "custom",
          "char_filter": ["phone_normalization"],
          "tokenizer": "standard",
          "filter": ["lowercase", "edge_ngram"]
        }
      },
      "char_filter": {
        "phone_normalization": {
          "type": "pattern_replace",
          "pattern": "[-\\s]",
          "replacement": ""
        }
      },
      "filter": {
        "edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 10
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET advanced_pattern_replace_analyzer_index/_analyze
{
  "analyzer": "phone_number_analyzer",
  "text": "123-456 7890"
}

响应包含生成的词元

{
  "tokens": [
    {"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
  ]
}

处理正则表达式模式中的特殊字符

在分析器中使用自定义正则表达式模式时,请确保正确处理特殊字符或非英文字符。默认情况下,Java 的正则表达式仅将 [A-Za-z0-9_] 视为词字符 (\w)。这在使用 \w\b 时可能导致意外行为,它们匹配词与非词字符之间的边界。

例如,以下分析器尝试使用模式 (\b\p{L}+\b) 来匹配一个或多个来自任何语言的字母字符 (\p{L}),这些字符被词边界包围

PUT /buggy_custom_analyzer
{
  "settings": {
    "analysis": {
      "filter": {
        "capture_words": {
          "type": "pattern_capture",
          "patterns": [
            "(\\b\\p{L}+\\b)"
          ]
        }
      },
      "analyzer": {
        "filter_only_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_words"
          ]
        }
      }
    }
  }
}

然而,此分析器将 él-empezó-a-reír 错误地分词为 lempezareír,因为 \b 不匹配重音字符与字符串开头或结尾之间的边界。

要正确处理特殊字符,请在模式中添加 Unicode 大小写标志 (?U)

PUT /fixed_custom_analyzer
{
  "settings": {
    "analysis": {
      "filter": {
        "capture_words": {
          "type": "pattern_capture",
          "patterns": [
            "(?U)(\\b\\p{L}+\\b)"
          ]
        }
      },
      "analyzer": {
        "filter_only_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_words"
          ]
        }
      }
    }
  }
}

位置增量间隙

当索引多值字段(例如数组)时,position_increment_gap 参数在术语之间设置一个位置间隙。此间隙确保短语查询不会跨单独的值匹配术语,除非明确允许。例如,默认间隙为 100 意味着不同数组条目中的术语相距 100 个位置,从而防止在短语搜索中出现意外匹配。您可以调整此值或将其设置为 0,以允许短语跨数组值。

以下示例演示了使用 match_phrase 查询时 position_increment_gap 的效果。

  1. test-index 中索引文档

      PUT test-index/_doc/1
      {
        "names": [ "Slow green", "turtle swims"]
      }
    

  2. 使用 match_phrase 查询文档

     GET test-index/_search
     {
       "query": {
         "match_phrase": {
           "names": {
             "query": "green turtle" 
           }
         }
       }
     }
    

    响应未返回任何匹配项,因为术语 greenturtle 之间的距离为 100(默认 position_increment_gap)。

  3. 现在使用 slop 参数高于 position_increment_gapmatch_phrase 查询文档

     GET test-index/_search
     {
       "query": {
         "match_phrase": {
           "names": {
             "query": "green turtle",
             "slop": 101
           }
         }
       }
     }
    

    响应包含匹配的文档

     {
       "took": 4,
       "timed_out": false,
       "_shards": {
         "total": 1,
         "successful": 1,
         "skipped": 0,
         "failed": 0
       },
       "hits": {
         "total": {
           "value": 1,
           "relation": "eq"
         },
         "max_score": 0.010358453,
         "hits": [
           {
             "_index": "test-index",
             "_id": "1",
             "_score": 0.010358453,
             "_source": {
               "names": [
                 "Slow green",
                 "turtle swims"
               ]
             }
           }
         ]
       }
     }