更多相似内容

使用 more_like_this 查询查找与一个或多个给定文档相似的文档。这对于推荐引擎、内容发现和识别数据集中的相关项目非常有用。

more_like_this 查询分析输入文档或文本，并选择最能表征它们的词元。然后，它会搜索包含这些重要词元的其他文档。

先决条件

在使用 more_like_this 查询之前，请确保您定位的字段已索引且其数据类型为 text 或 keyword。

如果您在 like 部分引用文档，OpenSearch 需要访问其内容。这通常通过 _source 字段完成，该字段默认启用。如果 _source 被禁用，您必须单独存储字段或配置它们以保存 term_vector 数据。

在索引文档时保存 term_vector 信息可以大大加快 more_like_this 查询，因为引擎可以直接检索重要词元，而无需在查询时重新分析字段文本。

示例：无词元向量优化

使用以下映射创建一个名为 articles-basic 的索引

PUT /articles-basic
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "content": { "type": "text" }
    }
  }
}

添加示例文档

POST /articles-basic/_bulk
{ "index": { "_id": 1 }}
{ "title": "Exploring the Sahara Desert", "content": "Sand dunes and vast landscapes." }
{ "index": { "_id": 2 }}
{ "title": "Amazon Rainforest Tour", "content": "Dense jungle and exotic wildlife." }
{ "index": { "_id": 3 }}
{ "title": "Mountain Adventures", "content": "Snowy peaks and hiking trails." }

使用以下请求进行查询

GET /articles-basic/_search
{
  "query": {
    "more_like_this": {
      "fields": ["content"],
      "like": "jungle wildlife",
      "min_term_freq": 1,
      "min_doc_freq": 1
    }
  }
}

more_like_this 查询在 content 字段中搜索词元 jungle 和 wildlife，只匹配一个文档

{
  ...
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.9616582,
    "hits": [
      {
        "_index": "articles-basic",
        "_id": "2",
        "_score": 1.9616582,
        "_source": {
          "title": "Amazon Rainforest Tour",
          "content": "Dense jungle and exotic wildlife."
        }
      }
    ]
  }
}

示例：词元向量优化

使用以下映射创建一个名为 articles-optimized 的索引

PUT /articles-optimized
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "term_vector": "with_positions_offsets"
      },
      "content": {
        "type": "text",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}

将示例文档插入到优化索引中

POST /articles-optimized/_bulk
{ "index": { "_id": "a1" } }
{ "name": "Diana", "alias": "Wonder Woman", "quote": "Justice will come when it is deserved." }
{ "index": { "_id": "a2" } }
{ "name": "Clark", "alias": "Superman", "quote": "Even in the darkest times, hope cuts through." }
{ "index": { "_id": "a3" } }
{ "name": "Bruce", "alias": "Batman", "quote": "I am vengeance. I am the night. I am Batman!" }

查找 quote 字段中包含与“dark”和“night”相似的词元的文档

GET /articles-optimized/_search
{
  "query": {
    "more_like_this": {
      "fields": ["quote"],
      "like": "dark night",
      "min_term_freq": 1,
      "min_doc_freq": 1
    }
  }
}

more_like_this 查询搜索词元 dark 和 night 并返回以下匹配项

{
  ...
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.2363393,
    "hits": [
      {
        "_index": "articles-optimized",
        "_id": "a3",
        "_score": 1.2363393,
        "_source": {
          "name": "Bruce",
          "alias": "Batman",
          "quote": "I am vengeance. I am the night. I am Batman!"
        }
      }
    ]
  }
}

示例：使用多个文档和文本输入

more_like_this 查询允许您在 like 参数中提供多个源。您可以将自由文本与索引中的文档结合使用。如果您希望搜索结合来自多个示例的相关性信号，这将非常有用。

在以下示例中，直接提供了一个自定义文档。此外，还包含了 heroes 索引中 ID 为 5 的现有文档

GET /articles-optimized/_search
{
  "query": {
    "more_like_this": {
      "fields": ["name", "alias"],
      "like": [
        {
          "doc": {
            "name": "Diana",
            "alias": "Wonder Woman",
            "quote": "Courage is not the absence of fear, but the triumph over it."
          }
        },
        {
          "_index": "heroes",
          "_id": "5"
        }
      ],
      "min_term_freq": 1,
      "min_doc_freq": 1,
      "max_query_terms": 25
    }
  }
}

返回的结果包含与查询中提供的 name 和 alias 字段最相似的文章

{
  ...
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 2.140194,
    "hits": [
      {
        "_index": "articles-optimized",
        "_id": "a1",
        "_score": 2.140194,
        "_source": {
          "name": "Diana",
          "alias": "Wonder Woman",
          "quote": "Justice will come when it is deserved."
        }
      },
      {
        "_index": "articles-optimized",
        "_id": "a2",
        "_score": 1.1596459,
        "_source": {
          "name": "Clark",
          "alias": "Superman",
          "quote": "Even in the darkest times, hope cuts through."
        }
      }
    ]
  }
}

当您想根据尚未完全索引的新概念提升结果，并将其与现有已索引文档中的知识相结合时，请使用此模式。

参数

more_like_this 查询唯一必需的参数是 like。其余参数具有默认值，但允许进行微调。以下是主要的参数类别。

文档输入参数

下表指定了文档输入参数。

参数	必需/可选	数据类型	描述
`like`	必需	字符串或对象数组	定义要查找相似文档的文本或文档。您可以输入自由文本、索引中的真实文档或人工文档。除非被覆盖，否则与字段关联的分析器将处理文本。
`unlike`	可选	字符串或对象数组	提供文本或文档，其词元应从影响查询中排除。对于指定否定示例很有用。
`fields`	可选	字符串数组	列出分析文本时要使用的字段。如果未指定，则使用所有字段。

词元选择参数

参数	必需/可选	数据类型	描述
`max_query_terms`	可选	整数	设置从输入中选择的最大词元数量。值越高，精度越高，但执行速度越慢。默认值为 `25`。
`min_term_freq`	可选	整数	输入中出现次数少于此值的词元将被忽略。默认值为 `2`。
`min_doc_freq`	可选	整数	出现在文档中少于此值的词元将被忽略。默认值为 `5`。
`max_doc_freq`	可选	整数	出现文档数超过此限制的词元将被忽略。对于避免非常常见的词很有用。默认值为无限制 (2³¹ - 1)。
`min_word_length`	可选	整数	忽略比此值短的单词。默认值为 `0`。
`max_word_length`	可选	整数	忽略比此值长的单词。默认值为无限制。
`stop_words`	可选	字符串数组	定义在选择词元时完全忽略的单词列表。
`分析器`	可选	字符串	用于处理输入文本的自定义分析器。默认为 `fields` 中列出的第一个字段的分析器。

查询形成参数

参数	必需/可选	数据类型	描述
`minimum_should_match`	可选	字符串	指定最终查询中必须匹配的最小词元数量。该值可以是百分比或固定数字。有助于微调召回率和精度之间的平衡。默认值为 `30%`
`fail_on_unsupported_field`	可选	布尔型	确定如果目标字段之一不是兼容类型（`text` 或 `keyword`）时是否抛出错误。设置为 `false` 可静默跳过不支持的字段。默认值为 `true`。
`boost_terms`	可选	浮点型	根据词元频率-逆文档频率 (TF-IDF) 权重对选定的词元应用提升。任何大于 `0` 的值都将使用指定因子激活词元提升。默认值为 `0`。
`include`	可选	布尔型	如果为 `true`，则 `like` 中提供的源文档将包含在结果命中中。默认值为 `false`。
`提升`	可选	浮点型	将整个 `more_like_this` 查询的相关性分数相乘。默认值为 `1.0`。

先决条件
示例：无词元向量优化
示例：词元向量优化
示例：使用多个文档和文本输入
文档输入参数
词元选择参数
查询形成参数

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想贡献？编辑此页面或创建问题。