Classic 分词器

classic 分词器解析文本时，会应用英语语法规则将文本分解为词元。它包含处理以下特定模式的逻辑：

缩写词
电子邮件地址
域名
某些类型的标点符号

此分词器最适合英语。对于其他语言，尤其是语法结构不同的语言，它可能无法产生最佳结果。

classic 分词器按以下方式解析文本：

标点符号：根据大多数标点符号分割文本并移除标点符号字符。不带空格的句点被视为词元的一部分。
连字符：在连字符处分割单词，但当存在数字时除外。当词元中包含数字时，该词元不会被分割，并被视为产品编号。
电子邮件：识别电子邮件地址和主机名，并将它们保留为单个词元。

使用示例

以下示例请求会创建一个名为 my_index 的新索引，并配置一个使用 classic 分词器的分析器

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_classic_analyzer": {
          "type": "custom",
          "tokenizer": "classic"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "my_classic_analyzer"
      }
    }
  }
}

生成的词元

使用以下请求检查使用该分析器生成的词元

POST /my_index/_analyze
{
  "analyzer": "my_classic_analyzer",
  "text": "For product AB3423, visit X&Y at example.com, email info@example.com, or call the operator's phone number 1-800-555-1234. P.S. 你好."
}

响应包含生成的词元

{
  "tokens": [
    {
      "token": "For",
      "start_offset": 0,
      "end_offset": 3,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "product",
      "start_offset": 4,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "AB3423",
      "start_offset": 12,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "visit",
      "start_offset": 20,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "X&Y",
      "start_offset": 26,
      "end_offset": 29,
      "type": "<COMPANY>",
      "position": 4
    },
    {
      "token": "at",
      "start_offset": 30,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "example.com",
      "start_offset": 33,
      "end_offset": 44,
      "type": "<HOST>",
      "position": 6
    },
    {
      "token": "email",
      "start_offset": 46,
      "end_offset": 51,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "info@example.com",
      "start_offset": 52,
      "end_offset": 68,
      "type": "<EMAIL>",
      "position": 8
    },
    {
      "token": "or",
      "start_offset": 70,
      "end_offset": 72,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "call",
      "start_offset": 73,
      "end_offset": 77,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "the",
      "start_offset": 78,
      "end_offset": 81,
      "type": "<ALPHANUM>",
      "position": 11
    },
    {
      "token": "operator's",
      "start_offset": 82,
      "end_offset": 92,
      "type": "<APOSTROPHE>",
      "position": 12
    },
    {
      "token": "phone",
      "start_offset": 93,
      "end_offset": 98,
      "type": "<ALPHANUM>",
      "position": 13
    },
    {
      "token": "number",
      "start_offset": 99,
      "end_offset": 105,
      "type": "<ALPHANUM>",
      "position": 14
    },
    {
      "token": "1-800-555-1234",
      "start_offset": 106,
      "end_offset": 120,
      "type": "<NUM>",
      "position": 15
    },
    {
      "token": "P.S.",
      "start_offset": 122,
      "end_offset": 126,
      "type": "<ACRONYM>",
      "position": 16
    },
    {
      "token": "你",
      "start_offset": 127,
      "end_offset": 128,
      "type": "<CJ>",
      "position": 17
    },
    {
      "token": "好",
      "start_offset": 128,
      "end_offset": 129,
      "type": "<CJ>",
      "position": 18
    }
  ]
}

词元类型

classic 分词器生成以下词元类型。

词元类型	描述
`<ALPHANUM>`	由字母、数字或两者组合组成的字母数字词元。
`<APOSTROPHE>`	包含撇号的词元，常用于所有格或缩写（例如，`John's`）。
`<ACRONYM>`	缩写词或简称，通常通过尾随句点识别（例如，`P.S.` 或 `U.S.A.`）。
`<COMPANY>`	表示公司名称的词元（例如，`X&Y`）。如果这些词元不是自动生成的，您可能需要自定义配置或过滤器。
`<EMAIL>`	匹配电子邮件地址的词元，包含 `@` 符号和域名（例如，`support@widgets.co` 或 `info@example.com`）。
`<HOST>`	匹配网站或主机名的词元，通常包含 `www.` 或类似 `.com` 的域名后缀（例如，`www.example.com` 或 `example.org`）。
`<NUM>`	仅包含数字或类似数字序列的词元（例如，`1-800`、`12345` 或 `3.14`）。
`<CJ>`	表示中文或日文字符的词元。
`<ACRONYM_DEP>`	已弃用的缩写词处理（例如，旧版本中具有不同解析规则的缩写词）。很少使用——主要用于向后兼容旧版分词器规则。

使用示例
生成的词元
词元类型

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想贡献？编辑此页面或创建问题。

Classic 分词器

使用示例

生成的词元

词元类型

OpenSearch 链接

参与其中

资源

联系我们