2024 鐵人賽 Day15: The combination of two

2024 iThome 鐵人賽

DAY 0

自我挑戰組

重新開始 elasticsearch 系列第 14 篇

16th鐵人賽

kimcheng

2024-09-30 22:45:19

121 瀏覽

分享至

前兩篇實作了『補完還沒打完的字』，根據『被補完的字給接續建議字』；接下來使用 python 結合兩者產出最後的呈現，如果你不熟悉 python 可以閱讀註解，或是閱讀程式碼後面的解說：

import elasticsearch
import itertools
import re

es_cli = elasticsearch.Elasticsearch("http://localhost:9200")

kw_phrase = "covid"
index_name = 'covid19_tweets'

def get_token_compelete(kw, index_name):
    query_to_compelete = {
              "query": {
                "prefix": {
                  "tweet": {
                    "value": kw
                  }
                }
              },
              "aggs": {
                "find_the_whole_token": {
                  "significant_text": {
                    "field": "tweet",
                    "include": f"{kw}.+",
                    "size": 2,
                    "min_doc_count": 100
                  }
                }
              },
              "fields": ["aggregation"],
               "_source": False 
            }
    r = es_cli.search(index=index_name, body=query_to_compelete)
    buckets = r["aggregations"]["find_the_whole_token"]["buckets"]
    tokens = [bucket["key"] for bucket in buckets]
    return tokens or [kw]

def get_suggestion(token, index_name):
    suggestion_query = {
        "query": {
          "match": {
            "tweet": token
          }
        },
        "aggs": {
          "tags": {
            "significant_text": {
              "field": "tweet",
              "exclude": token,
              "size": 3,
              "min_doc_count": 100
            }
          }
        }
      }
    r = es_cli.search(index=index_name, body=suggestion_query)
    buckets = r["aggregations"]["tags"]["buckets"]
    suggestions = [bucket["key"] for bucket in buckets]
    return suggestions

def get_auto_complete(search_phrase):
    tail = re.findall("\S+", search_phrase)[-1] # 取得最後沒有打完的那個字
    pre_search_phrase = " ".join(re.findall("\S+", search_phrase)[0:-1])
    comelete_tokens = get_token_compelete(tail, index_name)
    guess = []
    for token in comelete_tokens:
        suggestions = get_suggestion(token, index_name)
        for suggestion in suggestions:
            guess.append(" ".join([pre_search_phrase, token, suggestion]).strip())
    return guess

search_phrase = "pandem"
print(get_auto_complete(search_phrase))

程式碼說明：

get_token_compelete : function：將為打完的單字補全，產出最多兩個 token。

get_suggestion : function: 根據 input 的 token 找出至多三個 suggestion token。

get_auto_complete :

(1) 將 search phrase 拆解，若使用空白有切分多於一個  token，則將空白切分後的最後一個部分丟給 `get_token_compelete` ，

(2) 上一步產出的 tokens 再逐一丟給 `get_suggestion` 產出 suggestion token。

(3) 將兩者使用空白字元結合

最後測試一下效果：

輸入： cov

輸出： ['covid 19', 'covid pandemic', 'covid response', 'covid19 stayathome', 'covid19 coronavirus', 'covid19 thelockdown']

輸入： covid

輸出：['covid19 stayathome', 'covid19 coronavirus', 'covid19 thelockdown', 'covid_19 coronavirus', 'covid_19 stayhome', 'covid_19 covid19']

輸入： covid sym

輸出：['covid symptoms have']

你覺得跟你預期有一樣嗎？

我覺得有幾個可以改進的地方：

沒有把整個 search phrase 納入考量，在這個設計中如果 search phrase 有很多個 token，只會以搜尋最後一個為主，並沒有把其他 token 也一起納入考量，所以搜尋 symptoms covid 跟搜尋 uber covid 是一樣的。
跟上一點類似，搜尋的結果並沒有考量詞彙的『相近性』所以就算搜尋詞不一樣，得到的 suggestion 通常都蠻像的。
有時候會得到奇怪的結果：搜尋 hospita 會得到：['hospital covid', 'hospital 19']，但搜尋 hospital 卻什麼都沒有得到 ( ? ? ?

如果你也有一起操作，有發現什麼奇怪的結果，非常歡迎留言給我知道，謝謝。

接下來我們會回到 search query 開始思考有什麼可以改進的。