iT邦幫忙

2024 iThome 鐵人賽

DAY 25
0
AI/ ML & Data

粗暴的資料處理 DuckDB系列 第 25

Day25-- DuckDB Style SQL (19) ?

  • 分享至 

  • xImage
  •  

Apache Kafka 作為 Streaming 的王者,每天有數不清的 issuepull request. 今天我們就來用 DuckDB 分析最近 100 局 KafKa PR 中有沒有什麼特別之處

受益於 github 良好的 rest api,我們透過 DuckDB 建立一個最近 100 隻 PR 資料的 kafka table 就是這麼簡單

CREATE TABLE kafka AS 
FROM read_json_auto("https://api.github.com/repos/apache/kafka/pulls?state=closed&sort=updated&direction=desc&per_page=100");

最近 100 個 PR 中貢獻者有哪些

SELECT user.login as name,  count(*) as  pr
FROM kafka
GROUP BY name
ORDER BY pr DESC;

最近 100 個被 merge 的 PR 花了多長

SELECT
    number AS pr_number,
    created_at,
    merged_at,
    AGE(merged_at, created_at) AS merge_duration
FROM
    kafka
WHERE
    merged_at IS NOT NULL
ORDER BY
    merge_duration DESC;

最近 100 個 PR 有多少在
1 小時內
1 天內
1 週內
1 個月內
合併

SELECT 
  CASE 
    WHEN AGE(merged_at, created_at) < INTERVAL '1 hour' THEN 'Less than 1 hour'
    WHEN AGE(merged_at, created_at) < INTERVAL '1 day' THEN '1 hour to 1 day'
    WHEN AGE(merged_at, created_at) < INTERVAL '1 week' THEN '1 day to 1 week'
    WHEN AGE(merged_at, created_at) < INTERVAL '1 month' THEN '1 week to 1 month'
    ELSE 'More than 1 month'
  END as duration_range,
  COUNT(*) as pr_count
FROM 
  kafka
WHERE 
  merged_at IS NOT NULL
GROUP BY 
  duration_range
ORDER BY 
  pr_count DESC;

星期一到星期日平均有多少 PR 被 create,平均又花多久 merged
看得出來星期六和星期日 沒有人想寫 code 😅

 SELECT
      EXTRACT(DOW FROM created_at) AS day_of_week,
      AVG(EXTRACT(HOUR FROM AGE(merged_at, created_at)) +
          EXTRACT(DAY FROM AGE(merged_at, created_at)) * 24 +
          EXTRACT(MINUTE FROM AGE(merged_at, created_at)) / 60.0) AS avg_merge_duration_hours,
      COUNT(*) AS pr_count
  FROM
      kafka
  WHERE
      merged_at IS NOT NULL
  GROUP BY
      EXTRACT(DOW FROM created_at)
  ORDER BY
      day_of_week;

平均每的貢獻者的 PR 多久會被合併,看得出來誰的面子最大

 SELECT
      user->>'login' AS author,
      AVG(EXTRACT(EPOCH FROM AGE(merged_at, created_at)) / 3600.0) AS avg_merge_duration_hours,
      COUNT(*) AS pr_count
  FROM
      kafka
  WHERE
      merged_at IS NOT NULL
  GROUP BY
      user->>'login'
  ORDER BY
      avg_merge_duration_hours;

一些大大
mumrah │ 28.04715277777778 │ 8 │
mimaison │ 38.41796296296296 │ 6 │
fvaleri │ 20.85527777777778 │ 4 │

這個主題好有趣,明天繼續分析

BTW , 所有 query 都可以跑在 https://shell.duckdb.org/


上一篇
Day24-- DuckDB Style SQL (18) ?
下一篇
Day26 -- DuckDB Style SQL (20) ?
系列文
粗暴的資料處理 DuckDB30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言