【Day 21】由淺入深來探討Elasticsearch - Aggregation (2) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 21

Software Development

由淺入深來探討Elasticsearch，從基礎語法到底層相關原理系列第 21 篇

【Day 21】由淺入深來探討Elasticsearch - Aggregation (2)

15th鐵人賽 #elasticsearch #pipeline aggregation

blank

2023-09-23 23:16:08

589 瀏覽

分享至

昨天介紹了metric aggregation與bucket aggregation
不過在介紹Pipeline aggregation之前還有一些東西需要補充

在使用aggregation時，我們可以

使用query來先進一步篩選要聚合的資料範圍

GET /index-name/_search
{
  "query": {
    ...
  },
  "aggs": {
		...
  }  
}

同時使用多個的aggregation聚合，當然下面這樣的寫法，結果是各自獨立的

GET /index_name/_search
{
  "aggs": {
    "first-aggregation-name": {
      "terms": {
        "field": "my-field"
      }
    },
    "second-aggregation-name": {
      "avg": {
        "field": "my-other-field"
      }
    }
  }
}

在bucket aggregation中，我們可以分組後再進行下一層的bucket或metric聚合，並且沒有深度方面的限制

GET /orders/_search
{
  "size": 0, 
  "aggs": {
    "ouside_agg_name": {
      "terms": {
        "field": "status"
      },
      "aggs": { // 在上面分組完後，內部再進行一次聚合
        "nested_agg_name": {
          "avg": {
            "field": "total_amount"
          }
        }
      }
    }
  }
}

可以看到紅箭頭會先把資料分成不同bucket，每個bucket內部再進行metric aggregation

也可以既然可以用query或是filer先一步限縮了，那當然runtime field也是沒問題

GET /inde_name/_search?size=0
{
  "runtime_mappings": {
    "message.length": {
      "type": "long",
      "script": ""
    }
  },
  "aggs": {
    "message_length": {
      "histogram": {
        "field": ""
      }
    }
  }
}

如果遇到nested的話，一樣要標明path

GET /index_name/_search?size=0
{
  "aggs": {
    "resellers": {
      "nested": {
        "path": "obj" // nested_field_name
      },
      "aggs": {
        "min_price": {
          "min": {
            "field": "obj.price" // 用.來找到sub_field
          }
        }
      }
    }
  }
}

如果我們不同文檔間欄位數不同，我們可以設置missing來補缺失值

GET /index_name/_search
{
  "aggs": {
    "NAME": {
      "terms": {
        "field": ""
      },
      "missing": {
        "field": ""
      }
    }
  }
}

總結一下這些技巧在實際操作可能流程如下

對於現有資料使用filter或是query進行篩選，或是使用script的runtime field來做篩選
如果我們發現文檔之間有缺失值，可以設置missing補缺
我們可以先將多個aggregation的結果跑出來，獲得獨立的輸出
但是如果想要對資料更細部的探討，可以先用bucket縮小樣本範圍。再使用bucket限縮更小的範圍，或是使用metric aggregation來比較統計值

接下來我們就進到最後的pipeline aggregation

Pipeline aggregation：

主要處理其他聚合所輸出的結果，而不是直接處理document，將聚合出的新結果添加到輸出中
主要分成：
1. parent：會根據parent aggregation的資料，聚合成新的bucket，並且可以將結果添加回現有的bucket中
2. sibling：跟parent不同的是，sibling提供同級聚合的輸出
因為pipeline aggregation不能有sub_aggregation(因為其添加在output上)，但是可以根據buckets_path參數來讓pipeline aggregation聚合
- pipeline aggregation需要其他aggregation當作input，需要用buckets_path來指定
- buckets_path是使用相對路徑來標示，參數如下：

AGG_SEPARATOR       =  `>` ;
METRIC_SEPARATOR    =  `.` ;
AGG_NAME            =  <the name of the aggregation> ;
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ;
MULTIBUCKET_KEY     =  `[<KEY_NAME>]`
PATH                =  <AGG_NAME><MULTIBUCKET_KEY>? (<AGG_SEPARATOR>, <AGG_NAME> )* ( <METRIC_SEPARATOR>, <METRIC> ) ;

實際寫起來可能是像下面這樣

multi_bucket["foo"]>single_bucket>multi_metric.avg

multi_bucket是分成多組的聚合其中foo這個key的bucket
- 這個bucket又有single_bucket下面的聚合
- 而single_bucket又在內部使用一次metric aggregation使用的avg的方法
如果是sibling要引入同級aggregation時

"buckets_path": "agg_name"

那我們來介紹幾個不同的bucket aggregation～

Avg bucket aggregation：

屬於sibling pipeline aggregation
取得其他同級aggregation的值，再計算所有他們的平均值

接著我們來稍微示範一下使用的方法

首先創建示範的index

PUT /test_pipeline_aggregation
{
  "mappings": {
    "properties": {
      "date": {
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "show_counts": { //當日網站曝光數
        "type": "integer"
      },
      "click_counts": { //當日網站點擊數
        "type": "integer"
      },
      "spend_time": {   //當日使用者平均停留在網站的秒數
        "type": "integer"
      }
    }
  }
}

使用bulk API輸入數據

PUT /_bulk
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-01-01", "show_counts": 25, "click_counts": 3, "spend_time": 11}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-01-15", "show_counts": 42, "click_counts": 3, "spend_time": 10}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-02-01", "show_counts": 82, "click_counts": 0, "spend_time": 13}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-02-15", "show_counts": 89, "click_counts": 4, "spend_time": 15}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-03-01", "show_counts": 109, "click_counts": 2, "spend_time": 9}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-03-15", "show_counts": 121, "click_counts": 4, "spend_time": 15}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-04-01", "show_counts": 138, "click_counts": 6, "spend_time": 15}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-04-15", "show_counts": 141, "click_counts": 4, "spend_time": 16}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-05-01", "show_counts": 151, "click_counts": 3, "spend_time": 21}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-05-15", "show_counts": 201, "click_counts": 5, "spend_time": 19}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-06-01", "show_counts": 292, "click_counts": 7, "spend_time": 18}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-06-15", "show_counts": 320, "click_counts": 8, "spend_time": 21}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-07-01", "show_counts": 439, "click_counts": 9, "spend_time": 25}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-07-15", "show_counts": 401, "click_counts": 6, "spend_time": 22}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-08-01", "show_counts": 385, "click_counts": 3, "spend_time": 18}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-08-15", "show_counts": 408, "click_counts": 6, "spend_time": 23}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-09-01", "show_counts": 457, "click_counts": 8, "spend_time": 23}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-10-01", "show_counts": 605, "click_counts": 11, "spend_time": 31}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-11-01", "show_counts": 517, "click_counts": 15, "spend_time": 49}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-12-01", "show_counts": 463, "click_counts": 8, "spend_time": 60}

進行avg bucket aggregation

我們先讓所有的文檔依照month進行bucket aggregation
接著在每個bucket中再對其用metric aggregation，算法為sum
- 此時對同級的total_show進行avg bucket aggregation，也就是能獲取一整年下來每個月的平均曝光數

GET /test_pipeline_aggregation/_search
{
  "size": 0,
  "aggs": {
    "show_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_show": {
          "sum": {
            "field": "show_counts"
          }
        }
      }
    },
    "avg_month_show": {
      "avg_bucket": {
        "buckets_path": "show_per_month>total_show"
      }
    }
  }
}

藍色箭頭就是一開始的bucket aggregation分組
橘色箭頭則是每一組內部的metric aggregation
紅色箭頭則是最後去算同級的avg bucket aggregation

Derivative aggregation：

derivative(導數)相當於斜率的概念，也就是每個點的變化率
屬於parent pipeline aggregation

在下面的範例中，我們先計算一階導數，看一下點擊次數每月的變化

GET /test_pipeline_aggregation/_search
{
  "size": 0,
  "aggs": {
    "show_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_click": {
          "sum": {
            "field": "click_counts"
          }
        },
        "click_deriv": {
          "derivative": {
            "buckets_path": "total_click"
          }
        }
      }
    }
  }
}

結果如下：


  "aggregations": {
    "show_per_month": {
      "buckets": [
        {
          "key_as_string": "2023-01-01",
          "key": 1672531200000,
          "doc_count": 2,
          "total_click": {
            "value": 6
          }
        },
        {
          "key_as_string": "2023-02-01",
          "key": 1675209600000,
          "doc_count": 2,
          "total_click": {
            "value": 4
          },
          "click_deriv": {
            "value": -2
          }
        },
        {
          "key_as_string": "2023-03-01",
          "key": 1677628800000,
          "doc_count": 2,
          "total_click": {
            "value": 6
          },
          "click_deriv": {
            "value": 2
          }

從結果中我們可以看到，click_deriv是從2月開始才有資料，因為1月之前沒有資料讓其判斷變化率
這邊的click_deriv基本上就是2月的點擊數相減

Cumulative sum aggregation：

可以用於累加的組合
屬於parent pipeline aggregation

GET /test_pipeline_aggregation/_search
{
  "size": 0,
  "aggs": {
    "show_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_users_used": {
          "sum": {
            "field": "spend_time"
          }
        },
        "spend_time_cumulative_sum": {
          "cumulative_sum": {
            "buckets_path": "total_users_used"
          }
        }
      }
    }
  }
}

結果如下：


  "aggregations": {
    "show_per_month": {
      "buckets": [
        {
          "key_as_string": "2023-01-01",
          "key": 1672531200000,
          "doc_count": 2,
          "total_users_used": {
            "value": 21
          },
          "spend_time_cumulative_sum": {
            "value": 21
          }
        },
        {
          "key_as_string": "2023-02-01",
          "key": 1675209600000,
          "doc_count": 2,
          "total_users_used": {
            "value": 28
          },
          "spend_time_cumulative_sum": {
            "value": 49
          }
        },
        {
          "key_as_string": "2023-03-01",
          "key": 1677628800000,
          "doc_count": 2,
          "total_users_used": {
            "value": 24
          },
          "spend_time_cumulative_sum": {
            "value": 73
          }