Elastic Stack第二十四重

第 12 屆 iThome 鐵人賽

DAY 24

Elastic Stack on Cloud

Elastic Stack武學練功坊系列第 24 篇

12th鐵人賽

沉思者

2020-10-09 14:16:43

1100 瀏覽

分享至

Query DSL Part V (查詢語法)

本篇介紹 Query DSL的其他 query
我會著重在我使用過或是覺得比較特別的 query，沒有介紹到的不代表不重要

Specialized queries(專用的query)

more_like_this query
可以把他想像成是要做 "like" 來找出相似的documents，下述以 MLK 來簡稱，
MLT會根據輸入來選出具代表性的terms，以這些terms組成query，最後執行query取得結果後回傳

根據提供的字串來找出相似的documents

GET /_search
{
  "query": {
    "more_like_this" : {
      "fields" : ["title", "description"], (1)
      "like" : "Once upon a time", (2)
      "min_term_freq" : 1, (3)
      "max_query_terms" : 12 (4)
    }
  }
}

(1): 欲搜尋的 fields
(2): 是 MLT query 中唯一一個必填參數 like，提供很多樣的語法，可以填入字串，也可以是一個或多個documents，甚至是不在index內的document都可以，而且這些都是可以混用的
(3): 輸入的document的terms出現頻率如果低於此值就會被忽略 (預設值為 2)
(4): 最多有多少query terms會被選取，增加此值會增進精準度，但同時會付出執行速度的代價 (預設值為 25)

上述request做的是以字串"Once upon a time"在documents的 title 和 description fields做匹配，找出相似的documents

rank_feature query
根據 rank_feature 或 rank_features field type的值來增加匹配到的documents的相關分數，
通常 rank_feature query 會被用在 bool query 中 should clause，這樣計算出的相關分數就會和 bool query的其他分數加總在一起
[注意] 如果 track_total_hits 參數設定不是true，會略過那些沒有競爭的hits，進而大幅提升query速度

[範例]
建立 test index，其中mappings如下：

pagerank : 為 rank_feature field type，用來衡量網站的重要性
url_length : 為 rank_feature field type，儲存網址URL的長度，而此例而言，URL的長度和相關性成負相關，所以會設定 rank_feature field 參數 positive_score_impact 為 false
topics : 為 rank_features field type，會記錄著各種主題(topics)以及這些topics和document的關聯程度

example mappings

PUT /test
{
  "mappings": {
    "properties": {
      "pagerank": {
        "type": "rank_feature"
      },
      "url_length": {
        "type": "rank_feature",
        "positive_score_impact": false
      },
      "topics": {
        "type": "rank_features"
      }
    }
  }
}

insert sample data

PUT /test/_doc/1?refresh
{
  "url": "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
  "content": "Rio 2016",
  "pagerank": 50.3,
  "url_length": 42,
  "topics": {
    "sports": 50,
    "brazil": 30
  }
}

PUT /test/_doc/2?refresh
{
  "url": "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
  "content": "Formula One motor race held on 13 November 2016",
  "pagerank": 50.3,
  "url_length": 47,
  "topics": {
    "sports": 35,
    "formula one": 65,
    "brazil": 20
  }
}

PUT /test/_doc/3?refresh
{
  "url": "https://en.wikipedia.org/wiki/Deadpool_(film)",
  "content": "Deadpool is a 2016 American superhero film",
  "pagerank": 50.3,
  "url_length": 37,
  "topics": {
    "movies": 60,
    "super hero": 65
  }
}

Use rank_feature Query

GET /test/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "2016"
          }
        }
      ],
      "should": [
        {
          "rank_feature": {
            "field": "pagerank" (1)
          }
        },
        {
          "rank_feature": {
            "field": "url_length", (1)
            "boost": 0.1 (2)
          }
        },
        {
          "rank_feature": {
            "field": "topics.sports", (1)
            "boost": 0.4 (2)
          }
        }
      ]
    }
  }
}

找出 content field 內含 2016 且透過 pagerank , url_length , topics.sports 來提升分數

(1): field 參數，為必填，欲用來提升分數的field (必須是 rank_feature 或 rank_features field type)
(2): boost 參數，預設值為 1.0，介於 0 和 1.0 之間代表是降低相關分數，而大於 1.0則是提高相關分數

rank feature functions，也就是 rank_feature query用來計算的數學function，
主要有以下幾種：

如果不知道選擇哪一種，建議使用 Saturation，此種也是預設值，

補充介紹一下
rank_feature field type
此field可以新增數值(numbers)，供後續可以透過 rank_feature query來提升documents的分數

有些注意事項

field value只接受 "單值"，也就是不能是array，而且數值必須是 "正數"
此field type只能用在 rank_feature query，並不支援 sorting, aggs 以及其他query

其中此 field type有一個參數 positive_score_impact 可用來設定此field value的相關性是
"正相關" 還是 "負相關" ，預設值為 true，即 "正相關"
而相關性是在 rank_feature query 在計算相關分數的公式內會有影響，如果是 "正相關" ，則分數會隨著 field value提升而提升，反之，設定 "負相關" 時，分數會因為field value越大而分數降低

以上述範例的field url_length 為例，url的長度越長(即 url_length field value越大)，因為設定負相關，所以分數反而是降低