映射字符过滤器
映射
字符过滤器接受用于字符替换的键值对映射。当过滤器遇到与键匹配的字符串时,它会将其替换为相应的值。替换值可以是空字符串。
该过滤器采用贪婪匹配,即匹配最长的匹配模式。
映射
字符过滤器有助于在分词前需要特定文本替换的场景。
示例
以下请求配置了一个映射
字符过滤器,将罗马数字(例如 I、II 或 III)转换为其对应的阿拉伯数字(1、2 和 3)
GET /_analyze
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"I => 1",
"II => 2",
"III => 3",
"IV => 4",
"V => 5"
]
}
],
"text": "I have III apples and IV oranges"
}
响应中包含一个令牌,其中罗马数字已被阿拉伯数字替换
{
"tokens": [
{
"token": "1 have 3 apples and 4 oranges",
"start_offset": 0,
"end_offset": 32,
"type": "word",
"position": 0
}
]
}
参数
您可以使用以下任一参数配置键值映射。
参数 | 必需/可选 | 数据类型 | 描述 |
---|---|---|---|
mappings | 可选 | 数组 | 键值对数组,格式为key => value 。输入文本中找到的每个键都将替换为其相应的值。 |
mappings_path | 可选 | 字符串 | 包含键值映射的 UTF-8 编码文件的路径。每个映射应以key => value 的格式在新行上显示。路径可以是绝对路径,也可以是相对于 OpenSearch 配置目录的相对路径。 |
使用自定义映射字符过滤器
您可以通过定义自己的映射集来创建自定义映射字符过滤器。以下请求创建了一个自定义字符过滤器,用于替换文本中的常见缩写
PUT /test-index
{
"settings": {
"analysis": {
"analyzer": {
"custom_abbr_analyzer": {
"tokenizer": "standard",
"char_filter": [
"custom_abbr_filter"
]
}
},
"char_filter": {
"custom_abbr_filter": {
"type": "mapping",
"mappings": [
"BTW => By the way",
"IDK => I don't know",
"FYI => For your information"
]
}
}
}
}
}
使用以下请求检查使用该分析器生成的词元
GET /text-index/_analyze
{
"tokenizer": "keyword",
"char_filter": [ "custom_abbr_filter" ],
"text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}
响应显示缩写已被替换
{
"tokens": [
{
"token": "For your information, updates to the workout schedule are posted. I don't know when it takes effect, but we have some details. By the way, the finalized schedule will be released Monday.",
"start_offset": 0,
"end_offset": 153,
"type": "word",
"position": 0
}
]
}