iT邦幫忙

2023 iThome 鐵人賽

DAY 21
0

昨天介紹了metric aggregation與bucket aggregation
不過在介紹Pipeline aggregation之前還有一些東西需要補充

在使用aggregation時,我們可以

  1. 使用query來先進一步篩選要聚合的資料範圍
GET /index-name/_search
{
  "query": {
    ...
  },
  "aggs": {
		...
  }  
}
  1. 同時使用多個的aggregation聚合,當然下面這樣的寫法,結果是各自獨立的
GET /index_name/_search
{
  "aggs": {
    "first-aggregation-name": {
      "terms": {
        "field": "my-field"
      }
    },
    "second-aggregation-name": {
      "avg": {
        "field": "my-other-field"
      }
    }
  }
}
  1. 在bucket aggregation中,我們可以分組後再進行下一層的bucket或metric聚合,並且沒有深度方面的限制
GET /orders/_search
{
  "size": 0, 
  "aggs": {
    "ouside_agg_name": {
      "terms": {
        "field": "status"
      },
      "aggs": { // 在上面分組完後,內部再進行一次聚合
        "nested_agg_name": {
          "avg": {
            "field": "total_amount"
          }
        }
      }
    }
  }
}

https://ithelp.ithome.com.tw/upload/images/20230923/20161866U5YZ2t0qyu.png
可以看到紅箭頭會先把資料分成不同bucket,每個bucket內部再進行metric aggregation

  1. 也可以既然可以用query或是filer先一步限縮了,那當然runtime field也是沒問題
GET /inde_name/_search?size=0
{
  "runtime_mappings": {
    "message.length": {
      "type": "long",
      "script": ""
    }
  },
  "aggs": {
    "message_length": {
      "histogram": {
        "field": ""
      }
    }
  }
}
  1. 如果遇到nested的話,一樣要標明path
GET /index_name/_search?size=0
{
  "aggs": {
    "resellers": {
      "nested": {
        "path": "obj" // nested_field_name
      },
      "aggs": {
        "min_price": {
          "min": {
            "field": "obj.price" // 用.來找到sub_field
          }
        }
      }
    }
  }
}
  1. 如果我們不同文檔間欄位數不同,我們可以設置missing來補缺失值
GET /index_name/_search
{
  "aggs": {
    "NAME": {
      "terms": {
        "field": ""
      },
      "missing": {
        "field": ""
      }
    }
  }
}

總結一下這些技巧在實際操作可能流程如下

  1. 對於現有資料使用filter或是query進行篩選,或是使用script的runtime field來做篩選
  2. 如果我們發現文檔之間有缺失值,可以設置missing補缺
  3. 我們可以先將多個aggregation的結果跑出來,獲得獨立的輸出
  4. 但是如果想要對資料更細部的探討,可以先用bucket縮小樣本範圍。再使用bucket限縮更小的範圍,或是使用metric aggregation來比較統計值

接下來我們就進到最後的pipeline aggregation

Pipeline aggregation

  • 主要處理其他聚合所輸出的結果,而不是直接處理document,將聚合出的新結果添加到輸出中
  • 主要分成:
    1. parent:會根據parent aggregation的資料,聚合成新的bucket,並且可以將結果添加回現有的bucket中
    2. sibling:跟parent不同的是,sibling提供同級聚合的輸出
  • 因為pipeline aggregation不能有sub_aggregation(因為其添加在output上),但是可以根據buckets_path參數來讓pipeline aggregation聚合
    • pipeline aggregation需要其他aggregation當作input,需要用buckets_path來指定
    • buckets_path是使用相對路徑來標示,參數如下:
AGG_SEPARATOR       =  `>` ;
METRIC_SEPARATOR    =  `.` ;
AGG_NAME            =  <the name of the aggregation> ;
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ;
MULTIBUCKET_KEY     =  `[<KEY_NAME>]`
PATH                =  <AGG_NAME><MULTIBUCKET_KEY>? (<AGG_SEPARATOR>, <AGG_NAME> )* ( <METRIC_SEPARATOR>, <METRIC> ) ;

實際寫起來可能是像下面這樣

multi_bucket["foo"]>single_bucket>multi_metric.avg
  • multi_bucket是分成多組的聚合其中foo這個key的bucket
    • 這個bucket又有single_bucket下面的聚合
    • 而single_bucket又在內部使用一次metric aggregation使用的avg的方法
  • 如果是sibling要引入同級aggregation時
"buckets_path": "agg_name"

那我們來介紹幾個不同的bucket aggregation~

Avg bucket aggregation

  • 屬於sibling pipeline aggregation
  • 取得其他同級aggregation的值,再計算所有他們的平均值

接著我們來稍微示範一下使用的方法

  1. 首先創建示範的index
PUT /test_pipeline_aggregation
{
  "mappings": {
    "properties": {
      "date": {
        "type": "date",
        "format": "yyyy-MM-dd"
      },
      "show_counts": { //當日網站曝光數
        "type": "integer"
      },
      "click_counts": { //當日網站點擊數
        "type": "integer"
      },
      "spend_time": {   //當日使用者平均停留在網站的秒數
        "type": "integer"
      }
    }
  }
}
  1. 使用bulk API輸入數據
PUT /_bulk
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-01-01", "show_counts": 25, "click_counts": 3, "spend_time": 11}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-01-15", "show_counts": 42, "click_counts": 3, "spend_time": 10}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-02-01", "show_counts": 82, "click_counts": 0, "spend_time": 13}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-02-15", "show_counts": 89, "click_counts": 4, "spend_time": 15}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-03-01", "show_counts": 109, "click_counts": 2, "spend_time": 9}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-03-15", "show_counts": 121, "click_counts": 4, "spend_time": 15}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-04-01", "show_counts": 138, "click_counts": 6, "spend_time": 15}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-04-15", "show_counts": 141, "click_counts": 4, "spend_time": 16}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-05-01", "show_counts": 151, "click_counts": 3, "spend_time": 21}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-05-15", "show_counts": 201, "click_counts": 5, "spend_time": 19}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-06-01", "show_counts": 292, "click_counts": 7, "spend_time": 18}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-06-15", "show_counts": 320, "click_counts": 8, "spend_time": 21}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-07-01", "show_counts": 439, "click_counts": 9, "spend_time": 25}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-07-15", "show_counts": 401, "click_counts": 6, "spend_time": 22}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-08-01", "show_counts": 385, "click_counts": 3, "spend_time": 18}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-08-15", "show_counts": 408, "click_counts": 6, "spend_time": 23}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-09-01", "show_counts": 457, "click_counts": 8, "spend_time": 23}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-10-01", "show_counts": 605, "click_counts": 11, "spend_time": 31}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-11-01", "show_counts": 517, "click_counts": 15, "spend_time": 49}
{"index": {"_index": "test_pipeline_aggregation"}}
{"date": "2023-12-01", "show_counts": 463, "click_counts": 8, "spend_time": 60}
  1. 進行avg bucket aggregation
  • 我們先讓所有的文檔依照month進行bucket aggregation
  • 接著在每個bucket中再對其用metric aggregation,算法為sum
    • 此時對同級的total_show進行avg bucket aggregation,也就是能獲取一整年下來每個月的平均曝光數
GET /test_pipeline_aggregation/_search
{
  "size": 0,
  "aggs": {
    "show_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_show": {
          "sum": {
            "field": "show_counts"
          }
        }
      }
    },
    "avg_month_show": {
      "avg_bucket": {
        "buckets_path": "show_per_month>total_show"
      }
    }
  }
}

https://ithelp.ithome.com.tw/upload/images/20230923/20161866zpxcpz6RSh.png

  • 藍色箭頭就是一開始的bucket aggregation分組
  • 橘色箭頭則是每一組內部的metric aggregation
  • 紅色箭頭則是最後去算同級的avg bucket aggregation

Derivative aggregation

  • derivative(導數)相當於斜率的概念,也就是每個點的變化率
  • 屬於parent pipeline aggregation

在下面的範例中,我們先計算一階導數,看一下點擊次數每月的變化

GET /test_pipeline_aggregation/_search
{
  "size": 0,
  "aggs": {
    "show_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_click": {
          "sum": {
            "field": "click_counts"
          }
        },
        "click_deriv": {
          "derivative": {
            "buckets_path": "total_click"
          }
        }
      }
    }
  }
}

結果如下:


  "aggregations": {
    "show_per_month": {
      "buckets": [
        {
          "key_as_string": "2023-01-01",
          "key": 1672531200000,
          "doc_count": 2,
          "total_click": {
            "value": 6
          }
        },
        {
          "key_as_string": "2023-02-01",
          "key": 1675209600000,
          "doc_count": 2,
          "total_click": {
            "value": 4
          },
          "click_deriv": {
            "value": -2
          }
        },
        {
          "key_as_string": "2023-03-01",
          "key": 1677628800000,
          "doc_count": 2,
          "total_click": {
            "value": 6
          },
          "click_deriv": {
            "value": 2
          }

  • 從結果中我們可以看到,click_deriv是從2月開始才有資料,因為1月之前沒有資料讓其判斷變化率
  • 這邊的click_deriv基本上就是2月的點擊數相減

Cumulative sum aggregation

  • 可以用於累加的組合
  • 屬於parent pipeline aggregation
GET /test_pipeline_aggregation/_search
{
  "size": 0,
  "aggs": {
    "show_per_month": {
      "date_histogram": {
        "field": "date",
        "calendar_interval": "month"
      },
      "aggs": {
        "total_users_used": {
          "sum": {
            "field": "spend_time"
          }
        },
        "spend_time_cumulative_sum": {
          "cumulative_sum": {
            "buckets_path": "total_users_used"
          }
        }
      }
    }
  }
}

結果如下:


  "aggregations": {
    "show_per_month": {
      "buckets": [
        {
          "key_as_string": "2023-01-01",
          "key": 1672531200000,
          "doc_count": 2,
          "total_users_used": {
            "value": 21
          },
          "spend_time_cumulative_sum": {
            "value": 21
          }
        },
        {
          "key_as_string": "2023-02-01",
          "key": 1675209600000,
          "doc_count": 2,
          "total_users_used": {
            "value": 28
          },
          "spend_time_cumulative_sum": {
            "value": 49
          }
        },
        {
          "key_as_string": "2023-03-01",
          "key": 1677628800000,
          "doc_count": 2,
          "total_users_used": {
            "value": 24
          },
          "spend_time_cumulative_sum": {
            "value": 73
          }
  • 每個月的spend_time_cumulative_sum就是這樣累加上來的

設計簡單的運算,我們可以不一定需要使用pipeline aggreation
我們可以使用bucket跟metric的組合

但是當牽扯更複雜的運算,或是需要產出中間值來生成結果
使用pipeline aggregation能更幫助我們做到更多的事情~

明天就是search部分的最後一天,後天開始就會進到更進階一點的內容上~

參考資料
aggregation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
pipeline aggregation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html
https://blog.csdn.net/UbuntuTouch/article/details/103539437


上一篇
【Day 20】由淺入深來探討Elasticsearch - Aggregation(1)
下一篇
【Day 22】由淺入深來探討Elasticsearch - Improve searching performance
系列文
由淺入深來探討Elasticsearch,從基礎語法到底層相關原理30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言