iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 4
1
Elastic Stack on Cloud

Elastic 30天自我修行系列 第 4

利用R進行查詢: 星星數大於2000?

  • 分享至 

  • xImage
  •  

首先安裝套件

install.packages("elastic")

接下來到網頁取得 Elastic Endpoint

https://ithelp.ithome.com.tw/upload/images/20200904/20130033R6B0mssv9Z.png

然後在當初 Elastic Cloud 設定完成時,Elastic 會給你一組帳密進行連線,此處我們假設為
user = "elastic"
pwd = "abc123"

接下來要在 R 進行設定
這邊我們先假設取得的 Elastic Endpoint 是 "https://abc123.asia-east1.gcp.elastic-cloud.com:9243"
如果依照套件官網的說明,你可能會如下編寫

# Connect Error
x <- connect(
  host = "https://abc123.asia-east1.gcp.elastic-cloud.com:9243",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)

或是

# Connect Error
x <- connect(
  host = "https://abc123.asia-east1.gcp.elastic-cloud.com",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)

但都會得到以下的錯誤訊息

Found http or https on es_host, stripping off, see the docs

正確的寫法應該是

# Connect OK!

x <- connect(
  host = "abc123.asia-east1.gcp.elastic-cloud.com",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)

重點就是在 host 設定時,要保持單純,不需要 port 與 https 的資訊。

設定完成後,我們可以先來進行第一次查詢,先試著回傳在 Index "elastic_json"的第一筆資料

library("elastic")

# Connect to Elastic Cloud
 
x <- connect(
  host = "abc123.asia-east1.gcp.elastic-cloud.com",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)
 
# My First Query
query1record <- Search(x, index = "elastic_json", size = 1)$hits$hits

此處我們可以看到資料會以 List 的方式回傳至 R,可以使用 typeof 或 str 進行檢查。

那如果我們要像昨天一樣進行星星數大於2000?的查詢呢?

首先讓我們快速看一下,在 elastic 套件中提供了那些的操作:

ls('package:elastic')

查詢結果如下,可以發現CURD的功能都有提供:

 [1] "alias_create"             "alias_delete"             "alias_exists"            
  [4] "alias_get"                "alias_rename"             "aliases_get"             
  [7] "cat_"                     "cat_aliases"              "cat_allocation"          
 [10] "cat_count"                "cat_fielddata"            "cat_health"              
 [13] "cat_indices"              "cat_master"               "cat_nodeattrs"           
 [16] "cat_nodes"                "cat_pending_tasks"        "cat_plugins"             
 [19] "cat_recovery"             "cat_segments"             "cat_shards"              
 [22] "cat_thread_pool"          "cluster_health"           "cluster_pending_tasks"   
 [25] "cluster_reroute"          "cluster_settings"         "cluster_state"           
 [28] "cluster_stats"            "connect"                  "count"                   
 [31] "docs_bulk"                "docs_bulk_create"         "docs_bulk_delete"        
 [34] "docs_bulk_index"          "docs_bulk_prep"           "docs_bulk_update"        
 [37] "docs_create"              "docs_delete"              "docs_delete_by_query"    
 [40] "docs_get"                 "docs_mget"                "docs_update"             
 [43] "docs_update_by_query"     "es_parse"                 "explain"                 
 [46] "field_caps"               "field_mapping_get"        "field_stats"             
 [49] "index_analyze"            "index_clear_cache"        "index_close"             
 [52] "index_create"             "index_delete"             "index_exists"            
 [55] "index_flush"              "index_forcemerge"         "index_get"               
 [58] "index_open"               "index_optimize"           "index_recovery"          
 [61] "index_recreate"           "index_segments"           "index_settings"          
 [64] "index_settings_update"    "index_shrink"             "index_stats"             
 [67] "index_status"             "index_template_delete"    "index_template_exists"   
 [70] "index_template_get"       "index_template_put"       "index_upgrade"           
 [73] "info"                     "mapping_create"           "mapping_delete"          
 [76] "mapping_get"              "mlt"                      "msearch"                 
 [79] "mtermvectors"             "nodes_hot_threads"        "nodes_info"              
 [82] "nodes_shutdown"           "nodes_stats"              "percolate_count"         
 [85] "percolate_delete"         "percolate_list"           "percolate_match"         
 [88] "percolate_register"       "ping"                     "pipeline_attachment"     
 [91] "pipeline_create"          "pipeline_delete"          "pipeline_get"            
 [94] "pipeline_simulate"        "reindex"                  "scroll"                  
 [97] "scroll_clear"             "Search"                   "search_shards"           
[100] "Search_template"          "Search_template_delete"   "Search_template_get"     
[103] "Search_template_register" "Search_template_render"   "Search_uri"              
[106] "tasks"                    "tasks_cancel"             "termvectors"             
[109] "tokenizer_set"            "type_exists"              "type_remover"            
[112] "validate"                

今天先讓我們以 Search 進行我們我們的操作,操作的方式基本上與昨天相同,在 q 中進行設定。

# 星星數大於2000
star_Over_2000 <- Search(x, index = "elastic_json", q = "stargazers_count:>2000")$hits$hits
 
# 星星數大於199 
star_Over_199 <- Search(x, index = "elastic_json", q = "stargazers_count:>199")$hits$hits
 
# 星星數大於20 
star_Over_20 <- Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$hits

讓我們檢查一下回傳的資料筆數是否正確。我們在昨天已經知道,星星數大於2000的有2筆。

> length(star_Over_2000)
[1] 2
> length(star_Over_199)
[1] 10
> length(star_Over_20)
[1] 10

咦! 為什麼星星數大於 199 與 20的都只有10筆呢,這不太合理吧?
後來到 Elastic Cloud API Console 進行確認,的確是預設只傳前10筆完整的明細資料,而不會回傳全部。

所以如果要更正確的直接取得聚合後的資料,只需要將程式碼改為

Search(x, index = "elastic_json", q = "stargazers_count:>200")$hits$total$value
Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value
Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value

# 星星數大於2000
star_Over_2000 <- Search(x, index = "elastic_json", q = "stargazers_count:>2000")$hits$total$value

# 星星數大於199 
star_Over_199 <- Search(x, index = "elastic_json", q = "stargazers_count:>199")$hits$total$value

# 星星數大於20 
star_Over_20 <- Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value

再檢查一次結果,這下子就合理多了。Repos的星星數大於2000的有2筆,星星數大於199的有13筆,星星數大於20的有33筆:

> star_Over_2000 
[1] 2
> star_Over_199 
[1] 13
> star_Over_20
[1] 33

今天我們透過R語言,直接對 Elastic Cloud 上的資料進行蒐尋,後續我們再找時間實作如何完成其他 CURD 的操作。


上一篇
學習透過 API 進行查詢: 星星數大於2000?
下一篇
新增 MySQL log 資料
系列文
Elastic 30天自我修行31
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言