前兩篇實作了『補完還沒打完的字』,根據『被補完的字給接續建議字』;接下來使用 python 結合兩者產出最後的呈現,如果你不熟悉 python 可以閱讀註解,或是閱讀程式碼後面的解說:
import elasticsearch
import itertools
import re
es_cli = elasticsearch.Elasticsearch("http://localhost:9200")
kw_phrase = "covid"
index_name = 'covid19_tweets'
def get_token_compelete(kw, index_name):
query_to_compelete = {
"query": {
"prefix": {
"tweet": {
"value": kw
}
}
},
"aggs": {
"find_the_whole_token": {
"significant_text": {
"field": "tweet",
"include": f"{kw}.+",
"size": 2,
"min_doc_count": 100
}
}
},
"fields": ["aggregation"],
"_source": False
}
r = es_cli.search(index=index_name, body=query_to_compelete)
buckets = r["aggregations"]["find_the_whole_token"]["buckets"]
tokens = [bucket["key"] for bucket in buckets]
return tokens or [kw]
def get_suggestion(token, index_name):
suggestion_query = {
"query": {
"match": {
"tweet": token
}
},
"aggs": {
"tags": {
"significant_text": {
"field": "tweet",
"exclude": token,
"size": 3,
"min_doc_count": 100
}
}
}
}
r = es_cli.search(index=index_name, body=suggestion_query)
buckets = r["aggregations"]["tags"]["buckets"]
suggestions = [bucket["key"] for bucket in buckets]
return suggestions
def get_auto_complete(search_phrase):
tail = re.findall("\S+", search_phrase)[-1] # 取得最後沒有打完的那個字
pre_search_phrase = " ".join(re.findall("\S+", search_phrase)[0:-1])
comelete_tokens = get_token_compelete(tail, index_name)
guess = []
for token in comelete_tokens:
suggestions = get_suggestion(token, index_name)
for suggestion in suggestions:
guess.append(" ".join([pre_search_phrase, token, suggestion]).strip())
return guess
search_phrase = "pandem"
print(get_auto_complete(search_phrase))
程式碼說明:
get_token_compelete
: function:將為打完的單字補全,產出最多兩個 token。
get_suggestion
: function: 根據 input 的 token 找出至多三個 suggestion token。
get_auto_complete
:
(1) 將 search phrase 拆解,若使用空白有切分多於一個 token,則將空白切分後的最後一個部分丟給 `get_token_compelete` ,
(2) 上一步產出的 tokens 再逐一丟給 `get_suggestion` 產出 suggestion token。
(3) 將兩者使用空白字元結合
最後測試一下效果:
輸入: cov
輸出: ['covid 19', 'covid pandemic', 'covid response', 'covid19 stayathome', 'covid19 coronavirus', 'covid19 thelockdown']
輸入: covid
輸出:['covid19 stayathome', 'covid19 coronavirus', 'covid19 thelockdown', 'covid_19 coronavirus', 'covid_19 stayhome', 'covid_19 covid19']
輸入: covid sym
輸出:['covid symptoms have']
你覺得跟你預期有一樣嗎?
我覺得有幾個可以改進的地方:
hospita
會得到:['hospital covid', 'hospital 19'],但搜尋 hospital
卻什麼都沒有得到 ( ? ? ?如果你也有一起操作,有發現什麼奇怪的結果,非常歡迎留言給我知道,謝謝。
接下來我們會回到 search query 開始思考有什麼可以改進的。