在之前的一些分享中,有一些關於直轉橫的方式,也有固定時段與不固定時段的統計方式.
這篇會對直轉橫統計及統計後直轉橫做一些探討.
create table t231122a (
id int not null generated always as identity primary key
, gal text
, dt date
, qty int
);
insert into t231122a (gal, dt, qty)
select gal
, date '2023-10-31' + interval '1 day' * n
, floor(random() * 100)
from generate_series(1, 4) n
, (values ('水卜櫻'), ('石川澪'), ('七沢米亞')) a(gal);
-- 第二次
insert into t231122a (gal, dt, qty)
select gal
, date '2023-10-31' + interval '1 day' * n
, floor(random() * 100)
from generate_series(1, 4) n
, (values ('水卜櫻'), ('石川澪'), ('七沢米亞')) a(gal);
select *
from t231122a;
id | gal | dt | qty
----+----------+------------+-----
1 | 水卜櫻 | 2023-11-01 | 83
2 | 石川澪 | 2023-11-01 | 49
3 | 七沢米亞 | 2023-11-01 | 46
4 | 水卜櫻 | 2023-11-02 | 84
5 | 石川澪 | 2023-11-02 | 81
6 | 七沢米亞 | 2023-11-02 | 48
7 | 水卜櫻 | 2023-11-03 | 25
8 | 石川澪 | 2023-11-03 | 86
9 | 七沢米亞 | 2023-11-03 | 60
10 | 水卜櫻 | 2023-11-04 | 27
11 | 石川澪 | 2023-11-04 | 92
12 | 七沢米亞 | 2023-11-04 | 37
13 | 水卜櫻 | 2023-11-01 | 62
14 | 石川澪 | 2023-11-01 | 76
15 | 七沢米亞 | 2023-11-01 | 65
16 | 水卜櫻 | 2023-11-02 | 82
17 | 石川澪 | 2023-11-02 | 61
18 | 七沢米亞 | 2023-11-02 | 58
19 | 水卜櫻 | 2023-11-03 | 47
20 | 石川澪 | 2023-11-03 | 80
21 | 七沢米亞 | 2023-11-03 | 63
22 | 水卜櫻 | 2023-11-04 | 8
23 | 石川澪 | 2023-11-04 | 61
24 | 七沢米亞 | 2023-11-04 | 21
(24 rows)
select gal
, sum(case
when dt = date '2023-11-01' then qty
else 0
end) as d1
, sum(case
when dt = date '2023-11-02' then qty
else 0
end) as d2
, sum(case
when dt = date '2023-11-03' then qty
else 0
end) as d3
, sum(case
when dt = date '2023-11-04' then qty
else 0
end) as d4
from t231122a
group by gal;
gal | d1 | d2 | d3 | d4
----------+-----+-----+-----+-----
水卜櫻 | 145 | 166 | 72 | 35
石川澪 | 125 | 142 | 166 | 153
七沢米亞 | 111 | 106 | 123 | 58
(3 rows)
case方式,大多數主流資料庫都有支援.缺點是繁瑣,語法寫起來不方便.
時常會看到一些寫的很長的,典型的是12個月份的.
select gal
, sum(qty) filter (where dt = date '2023-11-01') as d1
, sum(qty) filter (where dt = date '2023-11-02') as d2
, sum(qty) filter (where dt = date '2023-11-03') as d3
, sum(qty) filter (where dt = date '2023-11-04') as d4
from t231122a
group by gal;
gal | d1 | d2 | d3 | d4
----------+-----+-----+-----+-----
水卜櫻 | 145 | 166 | 72 | 35
石川澪 | 125 | 142 | 166 | 153
七沢米亞 | 111 | 106 | 123 | 58
(3 rows)
filter 是PostgreSQL特有的,語法會比case 方便清晰.
不只是sum()之類的可以使用.
select gal
, array_agg(qty) filter (where dt = date '2023-11-01') as d1
, array_agg(qty) filter (where dt = date '2023-11-02') as d2
, array_agg(qty) filter (where dt = date '2023-11-03') as d3
, array_agg(qty) filter (where dt = date '2023-11-04') as d4
from t231122a
group by gal;
gal | d1 | d2 | d3 | d4
----------+---------+---------+---------+---------
水卜櫻 | {83,62} | {84,82} | {25,47} | {27,8}
石川澪 | {49,76} | {81,61} | {86,80} | {92,61}
七沢米亞 | {46,65} | {48,58} | {60,63} | {37,21}
這樣在做發展的過程中,可以方便對照,確認數字的正確性.
可以透過輔助表join之後取得結果,在我一些分享中有介紹過,在此就不贅述.
Oracle / SQL Server 有 pivot
PostgreSQL 使用 tablefunc
https://www.postgresql.org/docs/current/tablefunc.html
這兩種方式,我有分享,請參考傳送門
在常見的應用中,除了之前有提到的12個月分,還有一個月份,或是一段日期也許是20天或是45天等等.像是商品的每日銷售額,或是議題的點擊量.
在之前的分享中,是以tag為例,但是因為日期是數字開頭,這樣不能建立欄位名稱,需要變通一下.所以就用來舉例,順帶把方式變通方式一併介紹.
with t1 as (
select gal, dt, sum(qty) as qtys
from t231122a
group by gal, dt
)
select gal
, array_agg(dt order by dt) as dtarr
, array_agg(qtys order by dt) as qtyarr
from t1
group by gal;
gal | dtarr | qtyarr
----------+-----------------------------------------------+-------------------
水卜櫻 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {145,166,72,35}
石川澪 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {125,142,166,153}
七沢米亞 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {111,106,123,58}
(3 rows)
with t1 as (
select gal, dt, sum(qty) as qtys
from t231122a
group by gal, dt
), t2 as (
select gal
, array_agg(dt order by dt) as dtarr
, array_agg(qtys order by dt) as qtyarr
from t1
group by gal
)
select gal
, hstore(dtarr::text[], qtyarr::text[]) as hs1
from t2;
-[ RECORD 1 ]---------------------------------------------------------------------------
gal | 水卜櫻
hs1 | "2023-11-01"=>"145", "2023-11-02"=>"166", "2023-11-03"=>"72", "2023-11-04"=>"35"
-[ RECORD 2 ]---------------------------------------------------------------------------
gal | 石川澪
hs1 | "2023-11-01"=>"125", "2023-11-02"=>"142", "2023-11-03"=>"166", "2023-11-04"=>"153"
-[ RECORD 3 ]---------------------------------------------------------------------------
gal | 七沢米亞
hs1 | "2023-11-01"=>"111", "2023-11-02"=>"106", "2023-11-03"=>"123", "2023-11-04"=>"58"
with t1 as (
select gal, dt, sum(qty) as qtys
from t231122a
group by gal, dt
), t2 as (
select gal
, array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr
, array_agg(qtys::text order by dt) as qtyarr
from t1
group by gal
)
select gal
, hstore(dtarr, qtyarr) as hs1
from t2;
gal | hs1
----------+----------------------------------------------------------------
水卜櫻 | "d1101"=>"145", "d1102"=>"166", "d1103"=>"72", "d1104"=>"35"
石川澪 | "d1101"=>"125", "d1102"=>"142", "d1103"=>"166", "d1104"=>"153"
七沢米亞 | "d1101"=>"111", "d1102"=>"106", "d1103"=>"123", "d1104"=>"58"
(3 rows)
DO LANGUAGE plpgsql $$
DECLARE v_sqlstring TEXT = '';
BEGIN
v_sqlstring := concat('create temp table tmpxt231122b as select ',
(select string_agg(concat('NULL::int AS d', to_char(dt, 'mmDD')), ' ,'
order by dt)
from generate_series(date '2023-11-01'
, date '2023-11-04', interval '1 day') dt
));
EXECUTE(v_sqlstring);
END $$;
---
# \d tmpxt231122b
Table "pg_temp_3.tmpxt231122b"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
d1101 | integer | | |
d1102 | integer | | |
d1103 | integer | | |
d1104 | integer | | |
with t1 as (
select gal, dt, sum(qty) as qtys
from t231122a
group by gal, dt
), t2 as (
select gal
, array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr
, array_agg(qtys order by dt) as qtyarr
from t1
group by gal
), t3 as (
select gal
, hstore(dtarr, qtyarr::text[]) as hs1
from t2
)
select gal
, (populate_record(null::tmpxt231122b, hs1)).*
from t3;
gal | d1101 | d1102 | d1103 | d1104
----------+-------+-------+-------+-------
水卜櫻 | 145 | 166 | 72 | 35
石川澪 | 125 | 142 | 166 | 153
七沢米亞 | 111 | 106 | 123 | 58
(3 rows)
在前面的例子中,已經把日期的方式做了展示,array當基礎,可以轉為hstore,需要時再做展開.
因為統計需要消耗計算資源與時間,加以做統計時紀錄表可以正在做寫入等動作,所以我們可以先將統計資料存起來,避免多次存取.
create table t231122c (
id int not null generated always as identity primary key
, gal text
, qtys hstore
, qtyarr int[]
);
with t1 as (
select gal, dt, sum(qty) as qtys
from t231122a
group by gal, dt
), t2 as (
select gal
, array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr
, array_agg(qtys order by dt) as qtyarr
from t1
group by gal
)
insert into t231122c(gal, qtys, qtyarr)
select gal
, hstore(dtarr, qtyarr::text[])
, qtyarr
from t2;
select *
from t231122c;
-[ RECORD 1 ]----------------------------------------------------------
id | 1
gal | 水卜櫻
qtys | "d1101"=>"145", "d1102"=>"166", "d1103"=>"72", "d1104"=>"35"
qtyarr | {145,166,72,35}
-[ RECORD 2 ]----------------------------------------------------------
id | 2
gal | 石川澪
qtys | "d1101"=>"125", "d1102"=>"142", "d1103"=>"166", "d1104"=>"153"
qtyarr | {125,142,166,153}
-[ RECORD 3 ]----------------------------------------------------------
id | 3
gal | 七沢米亞
qtys | "d1101"=>"111", "d1102"=>"106", "d1103"=>"123", "d1104"=>"58"
qtyarr | {111,106,123,58}
資料庫適合做大量資料的處理,但是每次都要下指令.當我們要做資料分析時,需要資料庫把資料撈出來,這時候看到資料,才能再做後續的處理,例如挑選某些日期某些類別再做加總等等.所以才會有先把資料聚合後直轉橫,再傳給試算表或是給R,Python Panda 再來做處理,像是某個日期的加總或平均,較為方便直觀.
但是試算表之類的速度較慢,若要用資料庫,可能建立如31個日期欄位,再把直轉橫的結果存看來,但是這樣要做縱向的加總平均,SQL 指令會很繁瑣.
最近vector已經開始引進了.在PostgreSQL 有 pgvector 這個extension.傳送門
-- 將array 轉為 vector
select id
, array_to_vector(qtyarr,4,true)
, array_to_vector(qtyarr,4,false)
, qtyarr::vector
from t231122c;
id | array_to_vector | array_to_vector | qtyarr
----+-------------------+-------------------+-------------------
1 | [145,166,72,35] | [145,166,72,35] | [145,166,72,35]
2 | [125,142,166,153] | [125,142,166,153] | [125,142,166,153]
3 | [111,106,123,58] | [111,106,123,58] | [111,106,123,58]
(3 rows)
-- 縱向加總
with t1 as (
select qtyarr::vector as v1
from t231122c
)
select sum(v1)
from t1;
sum
-------------------
[381,414,361,246]
select qtyarr
from t231122c;
qtyarr
-------------------
{145,166,72,35}
{125,142,166,153}
{111,106,123,58}
(3 rows)
-- 驗算
select 145+125+111;
?column?
----------
381
-- 縱向平均
with t1 as (
select qtyarr::vector as v1
from t231122c
)
select sum(v1)
, avg(v1)
from t1;
sum | avg
-------------------+-------------------------
[381,414,361,246] | [127,138,120.333336,82]
(1 row)
select 381 / 3;
?column?
----------
127
可以觀察到,我們可以透過array轉為vector,可以很方便的做縱向的加總與平均.
我們可以做單日加總後,存到array,同時將array轉為hstore與vector存到分析工作用的table.
需要轉為單日欄位的,可以透過前面介紹的方式,轉出來.
也可以轉為json,方便傳給其他系統.
也可以利用vector,方便做大量的縱向加總平均.這樣當我們在做分析時,
例如品項也可以帶tag,而這些tag也是利用array存放,還記得前面分享的佐山愛嗎?這樣就能方便分類計算,試算表需要一行行拉,資料庫一下就30天一起算,更不用說強大的過濾條件.
至於array橫向的計算,之前有一篇做多欄位計算,就是利用array,所以array的橫向計算是不成問題的,另外也有extension提供了一些函數,這個會另外安排分享.
連續幾篇,介紹了一些統計的方法,希望這些分享對大家能有一些幫助.
感謝水卜櫻,石川澪,七沢米亞,當然還有愛醬.