Min hash 令牌过滤器

min_hash 令牌过滤器用于根据 MinHash 近似算法为令牌生成哈希值，这对于检测文档之间的相似性非常有用。min_hash 令牌过滤器为一组令牌（通常来自已分析的字段）生成哈希值。

参数

min_hash 令牌过滤器可以使用以下参数进行配置。

参数	必需/可选	数据类型	描述
`hash_count`	可选	整数	为每个令牌生成的哈希值数量。增加此值通常会提高相似性估计的准确性，但会增加计算成本。默认值为 `1`。
`bucket_count`	可选	整数	要使用的哈希桶数量。这会影响哈希的粒度。更多的桶提供了更细的粒度并减少了哈希冲突，但需要更多内存。默认值为 `512`。
`hash_set_size`	可选	整数	每个桶中要保留的哈希数量。这会影响哈希质量。更大的集合大小可能导致更好的相似性检测，但会消耗更多内存。默认值为 `1`。
`with_rotation`	可选	布尔型	当设置为 `true` 时，如果 `hash_set_size` 为 `1`，则过滤器会用其循环右侧的第一个非空桶的值填充空桶。如果 `bucket_count` 参数超过 `1`，则此设置自动默认为 `true`；否则，默认为 `false`。

示例

以下示例请求创建了一个名为 minhash_index 的新索引，并使用 min_hash 过滤器配置了一个分析器。

PUT /minhash_index
{
  "settings": {
    "analysis": {
      "filter": {
        "minhash_filter": {
          "type": "min_hash",
          "hash_count": 3,
          "bucket_count": 512,
          "hash_set_size": 1,
          "with_rotation": false
        }
      },
      "analyzer": {
        "minhash_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "minhash_filter"
          ]
        }
      }
    }
  }
}

生成的词元

使用以下请求检查使用该分析器生成的词元

POST /minhash_index/_analyze
{
  "analyzer": "minhash_analyzer",
  "text": "OpenSearch is very powerful."
}

响应包含生成的令牌（令牌不可读，因为它们代表哈希值）

{
  "tokens" : [
    {
      "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
      "start_offset" : 0,
      "end_offset" : 27,
      "type" : "MIN_HASH",
      "position" : 0
    },
    {
      "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
      "start_offset" : 0,
      "end_offset" : 27,
      "type" : "MIN_HASH",
      "position" : 0
    },
    ...

为了演示 min_hash 令牌过滤器的有用性，您可以使用以下 Python 脚本，通过之前创建的分析器比较两个字符串

from opensearchpy import OpenSearch
from requests.auth import HTTPBasicAuth

# Initialize the OpenSearch client with authentication
host = 'https://:9200'  # Update if using a different host/port
auth = ('admin', 'admin')  # Username and password

# Create the OpenSearch client with SSL verification turned off
client = OpenSearch(
    hosts=[host],
    http_auth=auth,
    use_ssl=True,
    verify_certs=False,  # Disable SSL certificate validation
    ssl_show_warn=False  # Suppress SSL warnings in the output
)

# Analyzes text and returns the minhash tokens
def analyze_text(index, text):
    response = client.indices.analyze(
        index=index,
        body={
            "analyzer": "minhash_analyzer",
            "text": text
        }
    )
    return [token['token'] for token in response['tokens']]

# Analyze two similar texts
tokens_1 = analyze_text('minhash_index', 'OpenSearch is a powerful search engine.')
tokens_2 = analyze_text('minhash_index', 'OpenSearch is a very powerful search engine.')

# Calculate Jaccard similarity
set_1 = set(tokens_1)
set_2 = set(tokens_2)
shared_tokens = set_1.intersection(set_2)
jaccard_similarity = len(shared_tokens) / len(set_1.union(set_2))

print(f"Jaccard Similarity: {jaccard_similarity}")

响应应包含 Jaccard 相似度分数

Jaccard Similarity: 0.8571428571428571

参数
示例
生成的词元

此页面有帮助吗？

✔ 是 ✖ 否

告诉我们原因

剩余 350 字符

有问题？在 OpenSearch 论坛上提问。

想贡献？编辑此页面或创建问题。

Min hash 令牌过滤器

参数

示例

生成的词元

OpenSearch 链接

参与其中

资源

联系我们