使用语义高亮
语义突出显示通过根据查询的含义识别并强调文档中最具语义相关性的句子或段落来增强搜索结果。与依赖精确关键字匹配的传统突出显示器不同,语义突出显示使用机器学习 (ML) 模型来理解文本片段的上下文和相关性。这使您能够查明文档中最相关的信息,即使突出显示的段落中不存在确切的搜索词。有关更多信息,请参阅使用 semantic
突出显示器。
本教程将指导您使用神经搜索查询设置和使用语义突出显示。
将以 your_
为前缀的占位符替换为您自己的值。
先决条件
为确保本地基本设置正常工作,请指定以下集群设置
PUT _cluster/settings
{
"persistent": {
"plugins.ml_commons.allow_registering_model_via_url": "true",
"plugins.ml_commons.only_run_on_ml_node": "false",
"plugins.ml_commons.model_access_control_enabled": "true"
}
}
此示例使用简单设置,没有专用 ML 节点,并允许在非 ML 节点上运行模型。在具有专用 ML 节点的集群上,请指定 "only_run_on_ml_node": "true"
以提高性能。有关更多信息,请参阅 ML Commons 集群设置。
步骤 1:创建索引
首先,创建一个索引来存储您的文本数据及其对应的向量嵌入。您需要一个 text
字段用于原始内容,以及一个 knn_vector
字段用于嵌入。
PUT neural-search-index
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"text": {
"type": "text"
},
"text_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "l2",
"engine": "faiss",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
}
}
}
}
dimension
字段必须包含您选择的嵌入模型的维度。
步骤 2:注册和部署 ML 模型
语义突出显示需要两种类型的模型
- 文本嵌入模型:用于将搜索查询和文档文本转换为向量。
- 句子突出显示模型:用于分析文本并识别最相关的句子。
首先,注册并部署一个文本嵌入模型
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "huggingface/sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.2",
"model_format": "TORCH_SCRIPT"
}
此 API 返回部署操作的 task_id
。使用 任务 API 来监控部署状态
GET /_plugins/_ml/tasks/<your-task-id>
一旦 state
变为 COMPLETED
,任务 API 将返回已部署模型的模型 ID。记下文本嵌入模型 ID;您将在以下步骤中使用它。
接下来,注册一个预训练的语义句子突出显示模型
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "amazon/sentence-highlighting/opensearch-semantic-highlighter-v1",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT",
"function_name": "QUESTION_ANSWERING"
}
使用任务 API 监控部署状态。记下语义突出显示模型 ID;您将在以下步骤中使用它。
步骤 3(可选):配置摄入管道
为了在索引期间自动生成嵌入,创建一个 摄入管道
PUT /_ingest/pipeline/nlp-ingest-pipeline
{
"description": "A pipeline to generate text embeddings",
"processors": [
{
"text_embedding": {
"model_id": "your-text-embedding-model-id",
"field_map": {
"text": "text_embedding"
}
}
}
]
}
将此管道设置为索引的默认管道
PUT /neural-search-index/_settings
{
"index.default_pipeline": "nlp-ingest-pipeline"
}
步骤 4:索引数据
现在,索引一些示例文档。如果您配置了摄入管道,嵌入将自动生成。
POST /neural-search-index/_doc/1
{
"text": "Alzheimer's disease is a progressive neurodegenerative disorder characterized by accumulation of amyloid-beta plaques and neurofibrillary tangles in the brain. Early symptoms include short-term memory impairment, followed by language difficulties, disorientation, and behavioral changes. While traditional treatments such as cholinesterase inhibitors and memantine provide modest symptomatic relief, they do not alter disease progression. Recent clinical trials investigating monoclonal antibodies targeting amyloid-beta, including aducanumab, lecanemab, and donanemab, have shown promise in reducing plaque burden and slowing cognitive decline. Early diagnosis using biomarkers such as cerebrospinal fluid analysis and PET imaging may facilitate timely intervention and improved outcomes."
}
POST /neural-search-index/_doc/2
{
"text": "Major depressive disorder is characterized by persistent feelings of sadness, anhedonia, and neurovegetative symptoms affecting sleep, appetite, and energy levels. First-line pharmacological treatments include selective serotonin reuptake inhibitors (SSRIs) and serotonin-norepinephrine reuptake inhibitors (SNRIs), with response rates of approximately 60-70%. Cognitive-behavioral therapy demonstrates comparable efficacy to medication for mild to moderate depression and may provide more durable benefits. Treatment-resistant depression may respond to augmentation strategies including atypical antipsychotics, lithium, or thyroid hormone. Electroconvulsive therapy remains the most effective intervention for severe or treatment-resistant depression, while newer modalities such as transcranial magnetic stimulation and ketamine infusion offer promising alternatives with fewer side effects."
}
POST /neural-search-index/_doc/3
{
"text" : "Cardiovascular disease remains the leading cause of mortality worldwide, accounting for approximately one-third of all deaths. Risk factors include hypertension, diabetes mellitus, smoking, obesity, and family history. Recent advancements in preventive cardiology emphasize lifestyle modifications such as Mediterranean diet, regular exercise, and stress reduction techniques. Pharmacological interventions including statins, beta-blockers, and ACE inhibitors have significantly reduced mortality rates. Emerging treatments focus on inflammation modulation and precision medicine approaches targeting specific genetic profiles associated with cardiac pathologies."
}
步骤 5:执行语义突出显示
将神经搜索查询与语义突出显示器结合使用
- 使用
neural
查询,利用文本嵌入模型查找与您的查询文本语义相似的文档。 - 添加一个
highlight
部分。 - 在
highlight.fields
中,指定text
字段(或包含您要突出显示内容的另一个字段)。 - 将此字段的
type
设置为semantic
。 - 添加一个全局
highlight.options
对象。 - 在
options
中,提供您已部署的句子突出显示模型的model_id
。
使用以下请求检索前五个匹配文档(在 k
参数中指定)。将占位符模型 ID(TEXT_EMBEDDING_MODEL_ID
和 SEMANTIC_HIGHLIGHTING_MODEL_ID
)替换为在步骤 2 成功部署后获得的模型 ID。
POST /neural-search-index/_search
{
"_source": {
"excludes": ["text_embedding"] // Exclude the large embedding from the source
},
"query": {
"neural": {
"text_embedding": {
"query_text": "treatments for neurodegenerative diseases",
"model_id": "<your-text-embedding-model-id>",
"k": 2
}
}
},
"highlight": {
"fields": {
"text": {
"type": "semantic"
}
},
"options": {
"model_id": "<your-semantic-highlighting-model-id>"
}
}
}
步骤 6:解释结果
搜索结果在每个匹配项中包含一个 highlight
对象。在 highlight
对象中指定的 text
字段包含原始文本,其中语义最相关的句子默认用 <em>
标签包裹。
{
"took": 711,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.52716815,
"hits": [
{
"_index": "neural-search-index",
"_id": "1",
"_score": 0.52716815,
"_source": {
"text": "Alzheimer's disease is a progressive neurodegenerative disorder ..." // Shortened for brevity
},
"highlight": {
"text": [
// Highlighted sentence may differ based on the exact model used
"Alzheimer's disease is a progressive neurodegenerative disorder characterized by accumulation of amyloid-beta plaques and neurofibrillary tangles in the brain. Early symptoms include short-term memory impairment, followed by language difficulties, disorientation, and behavioral changes. While traditional treatments such as cholinesterase inhibitors and memantine provide modest symptomatic relief, they do not alter disease progression. <em>Recent clinical trials investigating monoclonal antibodies targeting amyloid-beta, including aducanumab, lecanemab, and donanemab, have shown promise in reducing plaque burden and slowing cognitive decline.</em> Early diagnosis using biomarkers such as cerebrospinal fluid analysis and PET imaging may facilitate timely intervention and improved outcomes."
]
}
},
{
"_index": "neural-search-index",
"_id": "2",
"_score": 0.4364841,
"_source": {
"text": "Major depressive disorder is characterized by persistent feelings of sadness ..." // Shortened for brevity
},
"highlight": {
"text": [
// Highlighted sentence for document 2
"Major depressive disorder is characterized by persistent feelings of sadness, anhedonia, and neurovegetative symptoms affecting sleep, appetite, and energy levels. First-line pharmacological treatments include selective serotonin reuptake inhibitors (SSRIs) and serotonin-norepinephrine reuptake inhibitors (SNRIs), with response rates of approximately 60-70%. <em>Cognitive-behavioral therapy demonstrates comparable efficacy to medication for mild to moderate depression and may provide more durable benefits.</em> Treatment-resistant depression may respond to augmentation strategies including atypical antipsychotics, lithium, or thyroid hormone. Electroconvulsive therapy remains the most effective intervention for severe or treatment-resistant depression, while newer modalities such as transcranial magnetic stimulation and ketamine infusion offer promising alternatives with fewer side effects." ]
}
}
]
}
}
semantic
突出显示器识别出模型认为在每个检索到的文档上下文中与查询(“神经退行性疾病的治疗方法”)语义相关的句子。如果需要,您可以使用 pre_tags
和 post_tags
参数自定义突出显示标签。有关更多信息,请参阅 更改突出显示标签。