Link Search Menu Expand Document Documentation Menu

映射字符过滤器

映射字符过滤器接受用于字符替换的键值对映射。当过滤器遇到与键匹配的字符串时,它会将其替换为相应的值。替换值可以是空字符串。

该过滤器采用贪婪匹配,即匹配最长的匹配模式。

映射字符过滤器有助于在分词前需要特定文本替换的场景。

示例

以下请求配置了一个映射字符过滤器,将罗马数字(例如 I、II 或 III)转换为其对应的阿拉伯数字(1、2 和 3)

GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "I => 1",
        "II => 2",
        "III => 3",
        "IV => 4",
        "V => 5"
      ]
    }
  ],
  "text": "I have III apples and IV oranges"
}

响应中包含一个令牌,其中罗马数字已被阿拉伯数字替换

{
  "tokens": [
    {
      "token": "1 have 3 apples and 4 oranges",
      "start_offset": 0,
      "end_offset": 32,
      "type": "word",
      "position": 0
    }
  ]
}

参数

您可以使用以下任一参数配置键值映射。

参数 必需/可选 数据类型 描述
mappings 可选 数组 键值对数组,格式为key => value。输入文本中找到的每个键都将替换为其相应的值。
mappings_path 可选 字符串 包含键值映射的 UTF-8 编码文件的路径。每个映射应以key => value的格式在新行上显示。路径可以是绝对路径,也可以是相对于 OpenSearch 配置目录的相对路径。

使用自定义映射字符过滤器

您可以通过定义自己的映射集来创建自定义映射字符过滤器。以下请求创建了一个自定义字符过滤器,用于替换文本中的常见缩写

PUT /test-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_abbr_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "custom_abbr_filter"
          ]
        }
      },
      "char_filter": {
        "custom_abbr_filter": {
          "type": "mapping",
          "mappings": [
            "BTW => By the way",
            "IDK => I don't know",
            "FYI => For your information"
          ]
        }
      }
    }
  }
}

使用以下请求检查使用该分析器生成的词元

GET /text-index/_analyze
{
  "tokenizer": "keyword",
  "char_filter": [ "custom_abbr_filter" ],
  "text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}

响应显示缩写已被替换

{
  "tokens": [
    {
      "token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
      "start_offset": 0,
      "end_offset": 153,
      "type": "word",
      "position": 0
    }
  ]
}
剩余 350 字符

有问题?

想做贡献?