利用R進行查詢: 星星數大於2000?

第 12 屆 iThome 鐵人賽

DAY 4

Elastic Stack on Cloud

Elastic 30天自我修行系列第 4 篇

12th鐵人賽

bear999

2020-09-04 20:47:31

1303 瀏覽

分享至

首先安裝套件

install.packages("elastic")

接下來到網頁取得 Elastic Endpoint

然後在當初 Elastic Cloud 設定完成時，Elastic 會給你一組帳密進行連線，此處我們假設為
user = "elastic"
pwd = "abc123"

接下來要在 R 進行設定
這邊我們先假設取得的 Elastic Endpoint 是 "https://abc123.asia-east1.gcp.elastic-cloud.com:9243"
如果依照套件官網的說明，你可能會如下編寫

# Connect Error
x <- connect(
  host = "https://abc123.asia-east1.gcp.elastic-cloud.com:9243",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)

或是

# Connect Error
x <- connect(
  host = "https://abc123.asia-east1.gcp.elastic-cloud.com",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)

但都會得到以下的錯誤訊息

Found http or https on es_host, stripping off, see the docs

正確的寫法應該是

# Connect OK!

x <- connect(
  host = "abc123.asia-east1.gcp.elastic-cloud.com",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)

重點就是在 host 設定時，要保持單純，不需要 port 與 https 的資訊。

設定完成後，我們可以先來進行第一次查詢，先試著回傳在 Index "elastic_json"的第一筆資料

library("elastic")

# Connect to Elastic Cloud
 
x <- connect(
  host = "abc123.asia-east1.gcp.elastic-cloud.com",
  path = "",
  user = "elastic",
  pwd = "abc123",
  port = 9243,
  transport_schema = "https"
)
 
# My First Query
query1record <- Search(x, index = "elastic_json", size = 1)$hits$hits

此處我們可以看到資料會以 List 的方式回傳至 R,可以使用 typeof 或 str 進行檢查。

那如果我們要像昨天一樣進行星星數大於2000?的查詢呢?

首先讓我們快速看一下，在 elastic 套件中提供了那些的操作:

ls('package:elastic')

查詢結果如下，可以發現CURD的功能都有提供：

 [1] "alias_create"             "alias_delete"             "alias_exists"            
  [4] "alias_get"                "alias_rename"             "aliases_get"             
  [7] "cat_"                     "cat_aliases"              "cat_allocation"          
 [10] "cat_count"                "cat_fielddata"            "cat_health"              
 [13] "cat_indices"              "cat_master"               "cat_nodeattrs"           
 [16] "cat_nodes"                "cat_pending_tasks"        "cat_plugins"             
 [19] "cat_recovery"             "cat_segments"             "cat_shards"              
 [22] "cat_thread_pool"          "cluster_health"           "cluster_pending_tasks"   
 [25] "cluster_reroute"          "cluster_settings"         "cluster_state"           
 [28] "cluster_stats"            "connect"                  "count"                   
 [31] "docs_bulk"                "docs_bulk_create"         "docs_bulk_delete"        
 [34] "docs_bulk_index"          "docs_bulk_prep"           "docs_bulk_update"        
 [37] "docs_create"              "docs_delete"              "docs_delete_by_query"    
 [40] "docs_get"                 "docs_mget"                "docs_update"             
 [43] "docs_update_by_query"     "es_parse"                 "explain"                 
 [46] "field_caps"               "field_mapping_get"        "field_stats"             
 [49] "index_analyze"            "index_clear_cache"        "index_close"             
 [52] "index_create"             "index_delete"             "index_exists"            
 [55] "index_flush"              "index_forcemerge"         "index_get"               
 [58] "index_open"               "index_optimize"           "index_recovery"          
 [61] "index_recreate"           "index_segments"           "index_settings"          
 [64] "index_settings_update"    "index_shrink"             "index_stats"             
 [67] "index_status"             "index_template_delete"    "index_template_exists"   
 [70] "index_template_get"       "index_template_put"       "index_upgrade"           
 [73] "info"                     "mapping_create"           "mapping_delete"          
 [76] "mapping_get"              "mlt"                      "msearch"                 
 [79] "mtermvectors"             "nodes_hot_threads"        "nodes_info"              
 [82] "nodes_shutdown"           "nodes_stats"              "percolate_count"         
 [85] "percolate_delete"         "percolate_list"           "percolate_match"         
 [88] "percolate_register"       "ping"                     "pipeline_attachment"     
 [91] "pipeline_create"          "pipeline_delete"          "pipeline_get"            
 [94] "pipeline_simulate"        "reindex"                  "scroll"                  
 [97] "scroll_clear"             "Search"                   "search_shards"           
[100] "Search_template"          "Search_template_delete"   "Search_template_get"     
[103] "Search_template_register" "Search_template_render"   "Search_uri"              
[106] "tasks"                    "tasks_cancel"             "termvectors"             
[109] "tokenizer_set"            "type_exists"              "type_remover"            
[112] "validate"

今天先讓我們以 Search 進行我們我們的操作，操作的方式基本上與昨天相同，在 q 中進行設定。

# 星星數大於2000
star_Over_2000 <- Search(x, index = "elastic_json", q = "stargazers_count:>2000")$hits$hits
 
# 星星數大於199 
star_Over_199 <- Search(x, index = "elastic_json", q = "stargazers_count:>199")$hits$hits
 
# 星星數大於20 
star_Over_20 <- Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$hits

讓我們檢查一下回傳的資料筆數是否正確。我們在昨天已經知道，星星數大於2000的有2筆。

> length(star_Over_2000)
[1] 2
> length(star_Over_199)
[1] 10
> length(star_Over_20)
[1] 10

咦! 為什麼星星數大於 199 與 20的都只有10筆呢，這不太合理吧?
後來到 Elastic Cloud API Console 進行確認，的確是預設只傳前10筆完整的明細資料，而不會回傳全部。

所以如果要更正確的直接取得聚合後的資料，只需要將程式碼改為

Search(x, index = "elastic_json", q = "stargazers_count:>200")$hits$total$value
Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value
Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value

# 星星數大於2000
star_Over_2000 <- Search(x, index = "elastic_json", q = "stargazers_count:>2000")$hits$total$value

# 星星數大於199 
star_Over_199 <- Search(x, index = "elastic_json", q = "stargazers_count:>199")$hits$total$value

# 星星數大於20 
star_Over_20 <- Search(x, index = "elastic_json", q = "stargazers_count:>20")$hits$total$value

再檢查一次結果，這下子就合理多了。Repos的星星數大於2000的有2筆，星星數大於199的有13筆，星星數大於20的有33筆：

> star_Over_2000 
[1] 2
> star_Over_199 
[1] 13
> star_Over_20
[1] 33

今天我們透過R語言，直接對 Elastic Cloud 上的資料進行蒐尋，後續我們再找時間實作如何完成其他 CURD 的操作。

學習透過 API 進行查詢: 星星數大於2000?

新增 MySQL log 資料

系列文

Elastic 30天自我修行共 31 篇

RSS系列文訂閱系列文

10 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19642 篇

完賽人數

530 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Elastic 30天自我修行系列 第 4 篇