Apache Kafka 作為 Streaming 的王者,每天有數不清的 issue
和 pull request
. 今天我們就來用 DuckDB 分析最近 100 局 KafKa PR 中有沒有什麼特別之處
受益於 github 良好的 rest api,我們透過 DuckDB 建立一個最近 100 隻 PR 資料的 kafka
table 就是這麼簡單
CREATE TABLE kafka AS
FROM read_json_auto("https://api.github.com/repos/apache/kafka/pulls?state=closed&sort=updated&direction=desc&per_page=100");
最近 100 個 PR 中貢獻者有哪些
SELECT user.login as name, count(*) as pr
FROM kafka
GROUP BY name
ORDER BY pr DESC;
最近 100 個被 merge 的 PR 花了多長
SELECT
number AS pr_number,
created_at,
merged_at,
AGE(merged_at, created_at) AS merge_duration
FROM
kafka
WHERE
merged_at IS NOT NULL
ORDER BY
merge_duration DESC;
最近 100 個 PR 有多少在
1 小時內
1 天內
1 週內
1 個月內
合併
SELECT
CASE
WHEN AGE(merged_at, created_at) < INTERVAL '1 hour' THEN 'Less than 1 hour'
WHEN AGE(merged_at, created_at) < INTERVAL '1 day' THEN '1 hour to 1 day'
WHEN AGE(merged_at, created_at) < INTERVAL '1 week' THEN '1 day to 1 week'
WHEN AGE(merged_at, created_at) < INTERVAL '1 month' THEN '1 week to 1 month'
ELSE 'More than 1 month'
END as duration_range,
COUNT(*) as pr_count
FROM
kafka
WHERE
merged_at IS NOT NULL
GROUP BY
duration_range
ORDER BY
pr_count DESC;
星期一到星期日平均有多少 PR 被 create,平均又花多久 merged
看得出來星期六和星期日 沒有人想寫 code 😅
SELECT
EXTRACT(DOW FROM created_at) AS day_of_week,
AVG(EXTRACT(HOUR FROM AGE(merged_at, created_at)) +
EXTRACT(DAY FROM AGE(merged_at, created_at)) * 24 +
EXTRACT(MINUTE FROM AGE(merged_at, created_at)) / 60.0) AS avg_merge_duration_hours,
COUNT(*) AS pr_count
FROM
kafka
WHERE
merged_at IS NOT NULL
GROUP BY
EXTRACT(DOW FROM created_at)
ORDER BY
day_of_week;
平均每的貢獻者的 PR 多久會被合併,看得出來誰的面子最大
SELECT
user->>'login' AS author,
AVG(EXTRACT(EPOCH FROM AGE(merged_at, created_at)) / 3600.0) AS avg_merge_duration_hours,
COUNT(*) AS pr_count
FROM
kafka
WHERE
merged_at IS NOT NULL
GROUP BY
user->>'login'
ORDER BY
avg_merge_duration_hours;
一些大大
mumrah │ 28.04715277777778 │ 8 │
mimaison │ 38.41796296296296 │ 6 │
fvaleri │ 20.85527777777778 │ 4 │
這個主題好有趣,明天繼續分析
BTW , 所有 query 都可以跑在 https://shell.duckdb.org/ 上