分析 API

1.0 版引入

分析 API 允许您执行文本分析，这是将非结构化文本转换为为搜索优化的单个词元（通常是单词）的过程。有关字符过滤器、分词器、词元过滤器和归一化器等常见分析组件的更多信息，请参阅分析器。

分析 API 分析文本字符串并返回生成的词元。

如果您使用安全插件，则必须拥有 manage index 权限。如果您只想分析文本，则必须拥有 manage cluster 权限。

端点

GET /_analyze
GET /{index}/_analyze
POST /_analyze
POST /{index}/_analyze

虽然您可以使用 GET 和 POST 请求发出分析请求，但两者之间有重要的区别。GET 请求会使数据缓存在索引中，以便下次请求数据时能够更快地检索。POST 请求会将一个尚不存在的字符串发送给分析器，以便与索引中已有的数据进行比较。POST 请求不会被缓存。

路径参数

您可以在请求中包含以下可选路径参数。

参数	数据类型	描述
index	字符串	用于派生分析器的索引。

请求正文字段

下表列出了可用的请求正文字段。

字段	数据类型	描述
`text`	字符串或字符串数组	要分析的文本。如果提供字符串数组，则文本将作为多值字段进行分析。必需。
`分析器`	字符串	要应用于 `text` 字段的分析器名称。该分析器可以在索引中构建或配置。如果未指定 `analyzer`，分析 API 将使用 `field` 字段映射中定义的分析器。如果未指定 `field` 字段，分析 API 将使用索引的默认分析器。如果没有指定索引，或者索引没有默认分析器，分析 API 将使用标准分析器。可选。请参阅分析器。
`attributes`	字符串数组	用于过滤 `explain` 字段输出的词元属性数组。
`char_filter`	字符串数组	用于在 `tokenizer` 字段之前预处理字符的字符过滤器数组。可选。请参阅字符过滤器。
`explain`	布尔型	如果为 `true`，则响应将包含词元属性和附加详细信息。可选。默认为 `false`。
`field`	字符串	用于派生分析器的字段。如果指定 `field`，则还必须指定 `index` 路径参数。如果指定 `analyzer` 字段，它将覆盖 `field` 的值。如果未指定 `field`，分析 API 将使用索引的默认分析器。如果未指定 `index` 字段，或者索引没有默认分析器，分析 API 将使用标准分析器。可选。
`filter`	字符串数组	在 `tokenizer` 字段之后应用的词元过滤器数组。可选。请参阅词元过滤器。
`normalizer`	字符串	用于将文本转换为单个词元的归一化器。可选。请参阅归一化器。
`tokenizer`	字符串	用于将 `text` 字段转换为词元的分词器。可选。请参阅分词器。

示例请求

分析文本字符串数组

当您将字符串数组传递给 text 字段时，它会作为多值字段进行分析。

GET /_analyze
{
  "analyzer" : "standard",
  "text" : ["first array element", "second array element"]
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "first",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "array",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "element",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "second",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "array",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "element",
      "start_offset" : 33,
      "end_offset" : 40,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

应用内置分析器

如果您省略 index 路径参数，可以将任何内置分析器应用于文本字符串。

以下请求使用 standard 内置分析器分析文本

GET /_analyze
{
  "analyzer" : "standard",
  "text" : "OpenSearch text analysis"
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "opensearch",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "text",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "analysis",
      "start_offset" : 16,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

应用自定义分析器

您可以创建自己的分析器并在分析请求中指定它。

在此场景中，已创建自定义分析器 lowercase_ascii_folding 并将其与 books2 索引关联。该分析器将文本转换为小写并将非 ASCII 字符转换为 ASCII。

以下请求将自定义分析器应用于提供的文本

GET /books2/_analyze
{
  "analyzer": "lowercase_ascii_folding",
  "text" : "Le garçon m'a SUIVI."
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "le",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "garcon",
      "start_offset" : 3,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "m'a",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "suivi",
      "start_offset" : 14,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

应用自定义临时分析器

您可以从分词器、词元过滤器或字符过滤器构建自定义临时分析器。使用 filter 参数指定词元过滤器。

以下请求使用 uppercase 字符过滤器将文本转换为大写

GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["uppercase"],
  "text" : "OpenSearch filter"
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "OPENSEARCH FILTER",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

以下请求使用 html_strip 过滤器从文本中删除 HTML 字符

GET /_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["html_strip"],
  "text" : "<b>Leave</b> right now!"
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "leave right now!",
      "start_offset" : 3,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}

您可以使用数组组合过滤器。

以下请求将 lowercase 转换与 stop 过滤器结合，该过滤器将删除 stopwords 数组中的单词

GET /_analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["lowercase", {"type": "stop", "stopwords": [ "to", "in"]}],
  "text" : "how to train your dog in five steps"
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "how",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "train",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "your",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "dog",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "five",
      "start_offset" : 25,
      "end_offset" : 29,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "steps",
      "start_offset" : 30,
      "end_offset" : 35,
      "type" : "word",
      "position" : 7
    }
  ]
}

指定索引

您可以使用索引的默认分析器来分析文本，也可以指定不同的分析器。

以下请求使用与 books 索引关联的默认分析器分析提供的文本

GET /books/_analyze
{
  "text" : "OpenSearch analyze test"
}

上一个请求返回以下字段

  "tokens" : [
    {
      "token" : "opensearch",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "analyze",
      "start_offset" : 11,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

以下请求使用 keyword 分析器分析提供的文本，该分析器将整个文本值作为一个词元返回

GET /books/_analyze
{
  "analyzer" : "keyword",
  "text" : "OpenSearch analyze test"
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "OpenSearch analyze test",
      "start_offset" : 0,
      "end_offset" : 23,
      "type" : "word",
      "position" : 0
    }
  ]
}

从索引字段派生分析器

您可以在索引中传递文本和字段。API 将查找该字段的分析器并用它来分析文本。

如果映射不存在，API 将使用标准分析器，该分析器会将所有文本转换为小写并根据空格进行分词。

以下请求将使分析基于 name 的映射

GET /books2/_analyze
{
  "field" : "name",
  "text" : "OpenSearch analyze test"
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "opensearch",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "analyze",
      "start_offset" : 11,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "test",
      "start_offset" : 19,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

指定归一化器

除了使用关键字字段，您还可以使用与索引关联的归一化器。归一化器会导致分析更改以生成单个词元。

在此示例中，books2 索引包含一个名为 to_lower_fold_ascii 的归一化器，它将文本转换为小写并将非 ASCII 文本转换为 ASCII。

以下请求将 to_lower_fold_ascii 应用于文本

GET /books2/_analyze
{
  "normalizer" : "to_lower_fold_ascii",
  "text" : "C'est le garçon qui m'a suivi."
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "c'est le garcon qui m'a suivi.",
      "start_offset" : 0,
      "end_offset" : 30,
      "type" : "word",
      "position" : 0
    }
  ]
}

您可以使用词元和字符过滤器创建自定义临时归一化器。

以下请求使用 uppercase 字符过滤器将给定文本转换为全大写

GET /_analyze
{
  "filter" : ["uppercase"],
  "text" : "That is the boy who followed me."
}

上一个请求返回以下字段

{
  "tokens" : [
    {
      "token" : "THAT IS THE BOY WHO FOLLOWED ME.",
      "start_offset" : 0,
      "end_offset" : 32,
      "type" : "word",
      "position" : 0
    }
  ]
}

获取词元详细信息

通过将 explain 属性设置为 true，您可以获取所有词元的附加详细信息。

以下请求提供了与 standard 分词器一起使用的 reverse 过滤器的详细词元信息

GET /_analyze
{
  "tokenizer" : "standard",
  "filter" : ["reverse"],
  "text" : "OpenSearch analyze test",
  "explain" : true,
  "attributes" : ["keyword"] 
}

上一个请求返回以下字段

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "standard",
      "tokens" : [
        {
          "token" : "OpenSearch",
          "start_offset" : 0,
          "end_offset" : 10,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "analyze",
          "start_offset" : 11,
          "end_offset" : 18,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "test",
          "start_offset" : 19,
          "end_offset" : 23,
          "type" : "<ALPHANUM>",
          "position" : 2
        }
      ]
    },
    "tokenfilters" : [
      {
        "name" : "reverse",
        "tokens" : [
          {
            "token" : "hcraeSnepO",
            "start_offset" : 0,
            "end_offset" : 10,
            "type" : "<ALPHANUM>",
            "position" : 0
          },
          {
            "token" : "ezylana",
            "start_offset" : 11,
            "end_offset" : 18,
            "type" : "<ALPHANUM>",
            "position" : 1
          },
          {
            "token" : "tset",
            "start_offset" : 19,
            "end_offset" : 23,
            "type" : "<ALPHANUM>",
            "position" : 2
          }
        ]
      }
    ]
  }
}

设置词元限制

您可以设置生成的词元数量的限制。设置较低的值可减少节点的内存使用。默认值为 10000。

以下请求将词元限制为四个

PUT /books2
{
  "settings" : {
    "index.analyze.max_token_count" : 4
  }
}

前面的请求是索引 API 而非分析 API。有关更多详细信息，请参阅动态索引级别索引设置。

响应正文字段

文本分析端点返回以下响应字段。

字段	数据类型	描述
tokens	数组	从 `text` 派生的词元数组。请参阅词元对象。
detail	对象	有关分析和每个词元的详细信息。仅当您请求词元详细信息时包含。请参阅详细信息对象。

词元对象

字段	数据类型	描述
token	字符串	词元文本。
start_offset	整数	词元在原始文本字符串中的起始位置。偏移量从零开始。
end_offset	整数	词元在原始文本字符串中的结束位置。
type	字符串	词元的分类：`<ALPHANUM>`、`<NUM>` 等。分词器通常设置类型，但某些过滤器定义自己的类型。例如，同义词过滤器定义 `<SYNONYM>` 类型。
position	整数	词元在 `tokens` 数组中的位置。

详细信息对象

字段	数据类型	描述
custom_analyzer	布尔型	应用于文本的分析器是自定义的还是内置的。
charfilters	数组	应用于文本的字符过滤器列表。
tokenizer	对象	应用于文本的分词器名称以及词元过滤器应用之前内容的词元列表^*。
tokenfilters	数组	应用于文本的词元过滤器列表。每个词元过滤器都包含过滤器的名称以及应用过滤器后内容的词元列表^*。词元过滤器按其在请求中指定的顺序排列。

有关词元字段的说明，请参阅词元对象。

端点
路径参数
请求正文字段
示例请求
响应正文字段

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想贡献吗？编辑此页面或创建问题。

分析 API

端点

路径参数

请求正文字段

示例请求

分析文本字符串数组

应用内置分析器

应用自定义分析器

应用自定义临时分析器

指定索引

从索引字段派生分析器

指定归一化器

获取词元详细信息

设置词元限制

响应正文字段

词元对象

详细信息对象

OpenSearch 链接

参与其中

资源

联系我们