创建自定义分析器

要创建自定义分析器，请指定以下组件的组合：

字符过滤器（零个或多个）
分词器（一个）
分词过滤器（零个或多个）

配置

以下参数可用于配置自定义分析器。

参数	必需/可选	描述
`type`	可选	分析器类型。默认值为 `custom`。您也可以使用此参数指定一个预构建分析器。
`tokenizer`	必需	要包含在分析器中的分词器。
`char_filter`	可选	要包含在分析器中的字符过滤器列表。
`filter`	可选	要包含在分析器中的分词过滤器列表。
`position_increment_gap`	可选	在索引具有多个值的文本字段时，在值之间应用的额外间距。有关更多信息，请参阅Position increment gap。默认值为 `100`。

示例

以下示例演示了各种自定义分析器配置。

带有 HTML 剥离字符过滤器的自定义分析器

以下示例分析器在分词前从文本中删除 HTML 标签

PUT simple_html_strip_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "html_strip_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "whitespace",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET simple_html_strip_analyzer_index/_analyze
{
  "analyzer": "html_strip_analyzer",
  "text": "<p>OpenSearch is <strong>awesome</strong>!</p>"
}

响应包含生成的词元

{
  "tokens": [
    {
      "token": "opensearch",
      "start_offset": 3,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 1
    },
    {
      "token": "awesome!",
      "start_offset": 25,
      "end_offset": 42,
      "type": "word",
      "position": 2
    }
  ]
}

带有映射字符过滤器用于同义词替换的自定义分析器

以下示例分析器在应用同义词过滤器之前替换特定字符和模式

PUT mapping_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "synonym_mapping_analyzer": {
          "type": "custom",
          "char_filter": ["underscore_to_space"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "synonym_filter"]
        }
      },
      "char_filter": {
        "underscore_to_space": {
          "type": "mapping",
          "mappings": ["_ => ' '"]
        }
      },
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "quick, fast, speedy",
            "big, large, huge"
          ]
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET mapping_analyzer_index/_analyze
{
  "analyzer": "synonym_mapping_analyzer",
  "text": "The slow_green_turtle is very large"
}

响应包含生成的词元

{
  "tokens": [
    {"token": "slow","start_offset": 4,"end_offset": 8,"type": "<ALPHANUM>","position": 1},
    {"token": "green","start_offset": 9,"end_offset": 14,"type": "<ALPHANUM>","position": 2},
    {"token": "turtle","start_offset": 15,"end_offset": 21,"type": "<ALPHANUM>","position": 3},
    {"token": "very","start_offset": 25,"end_offset": 29,"type": "<ALPHANUM>","position": 5},
    {"token": "large","start_offset": 30,"end_offset": 35,"type": "<ALPHANUM>","position": 6},
    {"token": "big","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6},
    {"token": "huge","start_offset": 30,"end_offset": 35,"type": "SYNONYM","position": 6}
  ]
}

带有自定义基于模式的字符过滤器用于数字规范化的自定义分析器

以下示例分析器通过删除破折号和空格来规范电话号码，并对规范化后的文本应用边缘 n-gram 以支持部分匹配

PUT advanced_pattern_replace_analyzer_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "phone_number_analyzer": {
          "type": "custom",
          "char_filter": ["phone_normalization"],
          "tokenizer": "standard",
          "filter": ["lowercase", "edge_ngram"]
        }
      },
      "char_filter": {
        "phone_normalization": {
          "type": "pattern_replace",
          "pattern": "[-\\s]",
          "replacement": ""
        }
      },
      "filter": {
        "edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 10
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET advanced_pattern_replace_analyzer_index/_analyze
{
  "analyzer": "phone_number_analyzer",
  "text": "123-456 7890"
}

响应包含生成的词元

{
  "tokens": [
    {"token": "123","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "12345","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "123456","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234567","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "12345678","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "123456789","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0},
    {"token": "1234567890","start_offset": 0,"end_offset": 12,"type": "<NUM>","position": 0}
  ]
}

处理正则表达式模式中的特殊字符

在分析器中使用自定义正则表达式模式时，请确保正确处理特殊字符或非英文字符。默认情况下，Java 的正则表达式仅将 [A-Za-z0-9_] 视为词字符 (\w)。这在使用 \w 或 \b 时可能导致意外行为，它们匹配词与非词字符之间的边界。

例如，以下分析器尝试使用模式 (\b\p{L}+\b) 来匹配一个或多个来自任何语言的字母字符 (\p{L})，这些字符被词边界包围

PUT /buggy_custom_analyzer
{
  "settings": {
    "analysis": {
      "filter": {
        "capture_words": {
          "type": "pattern_capture",
          "patterns": [
            "(\\b\\p{L}+\\b)"
          ]
        }
      },
      "analyzer": {
        "filter_only_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_words"
          ]
        }
      }
    }
  }
}

然而，此分析器将 él-empezó-a-reír 错误地分词为 l、empez、a 和 reír，因为 \b 不匹配重音字符与字符串开头或结尾之间的边界。

要正确处理特殊字符，请在模式中添加 Unicode 大小写标志 (?U)

PUT /fixed_custom_analyzer
{
  "settings": {
    "analysis": {
      "filter": {
        "capture_words": {
          "type": "pattern_capture",
          "patterns": [
            "(?U)(\\b\\p{L}+\\b)"
          ]
        }
      },
      "analyzer": {
        "filter_only_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "capture_words"
          ]
        }
      }
    }
  }
}

位置增量间隙

当索引多值字段（例如数组）时，position_increment_gap 参数在术语之间设置一个位置间隙。此间隙确保短语查询不会跨单独的值匹配术语，除非明确允许。例如，默认间隙为 100 意味着不同数组条目中的术语相距 100 个位置，从而防止在短语搜索中出现意外匹配。您可以调整此值或将其设置为 0，以允许短语跨数组值。

以下示例演示了使用 match_phrase 查询时 position_increment_gap 的效果。

在 test-index 中索引文档

  PUT test-index/_doc/1
  {
    "names": [ "Slow green", "turtle swims"]
  }

使用 match_phrase 查询文档

 GET test-index/_search
 {
   "query": {
     "match_phrase": {
       "names": {
         "query": "green turtle" 
       }
     }
   }
 }

响应未返回任何匹配项，因为术语 green 和 turtle 之间的距离为 100（默认 position_increment_gap）。

现在使用 slop 参数高于 position_increment_gap 的 match_phrase 查询文档

 GET test-index/_search
 {
   "query": {
     "match_phrase": {
       "names": {
         "query": "green turtle",
         "slop": 101
       }
     }
   }
 }

响应包含匹配的文档

 {
   "took": 4,
   "timed_out": false,
   "_shards": {
     "total": 1,
     "successful": 1,
     "skipped": 0,
     "failed": 0
   },
   "hits": {
     "total": {
       "value": 1,
       "relation": "eq"
     },
     "max_score": 0.010358453,
     "hits": [
       {
         "_index": "test-index",
         "_id": "1",
         "_score": 0.010358453,
         "_source": {
           "names": [
             "Slow green",
             "turtle swims"
           ]
         }
       }
     ]
   }
 }

配置
示例
处理正则表达式模式中的特殊字符
位置增量间隙

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想贡献？编辑此页面或创建问题。

创建自定义分析器

配置

示例

带有 HTML 剥离字符过滤器的自定义分析器

带有映射字符过滤器用于同义词替换的自定义分析器

带有自定义基于模式的字符过滤器用于数字规范化的自定义分析器

处理正则表达式模式中的特殊字符

位置增量间隙

OpenSearch 链接

参与其中

资源

联系我们