情境:有許多 sensor, 一直傳送資料,我們想取得最新的資料.
create unlogged table it201011c (
id bigint not null primary key
, sensor_id int not null
, val int not null
);
insert into it201011c
select n
, ceil(random() * 1e5)
, random() * 100
from generate_series(1,1e7) n;
輸入一千萬筆資料,注意到這次使用了 unlogged table.
為了方便起見,id 越大,代表資料越新.
實務上可以使用id或是時間戳.
一般的作法,利用 window function 的 row_number() , rank()
select id, sensor_id, val
from (select *
, row_number() over(partition by sensor_id order by id desc) rn
from it201011c) a
where a.rn = 1;
(100000 rows)
Time: 14941.510 ms (00:14.942)
取得十萬筆資料,速度還不賴.
explain (analyze,timing,costs)
select id, sensor_id, val
from (select *
, row_number() over(partition by sensor_id order by id desc) rn
from it201011c) a
where a.rn = 1;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Subquery Scan on a (cost=1658507.15..1983502.60 rows=49999 width=16) (actual time=10466.416..18693.056 rows=100000 loops=1) |
| Filter: (a.rn = 1) |
| Rows Removed by Filter: 9900000 |
| -> WindowAgg (cost=1658507.15..1858504.35 rows=9999860 width=24) (actual time=10466.414..17789.984 rows=10000000 loops=1) |
| -> Sort (cost=1658507.15..1683506.80 rows=9999860 width=16) (actual time=10466.401..12646.441 rows=10000000 loops=1) |
| Sort Key: it201011c.sensor_id, it201011c.id DESC |
| Sort Method: external merge Disk: 254472kB |
| -> Seq Scan on it201011c (cost=0.00..154053.60 rows=9999860 width=16) (actual time=2.679..1516.206 rows=10000000 loops=1) |
| Planning Time: 0.254 ms |
| Execution Time: 19242.739 ms |
+-------------------------------------------------------------------------------------------------------------------------------------------+
觀察到 Sort Key: it201011c.sensor_id, it201011c.id DESC
執行時間在 19~14 秒.
建立 index.
create index on it201011c(sensor_id, id desc);
commit;
analyze it201011c;
select id, sensor_id, val
from (select *
, row_number() over(partition by sensor_id order by id desc) rn
from it201011c) a
where a.rn = 1;
Time: 46248.666 ms (00:46.249)
explain (analyze,timing,costs)
select id, sensor_id, val
from (select *
, row_number() over(partition by sensor_id order by id desc) rn
from it201011c) a
where a.rn = 1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Subquery Scan on a (cost=0.43..820237.17 rows=49999 width=16) (actual time=3.364..43062.106 rows=100000 loops=1) |
| Filter: (a.rn = 1) |
| Rows Removed by Filter: 9900000 |
| -> WindowAgg (cost=0.43..695238.92 rows=9999860 width=24) (actual time=3.360..41889.384 rows=10000000 loops=1) |
| -> Index Scan using it201011c_sensor_id_id_idx on it201011c (cost=0.43..520241.37 rows=9999860 width=16) (actual time=3.328..35215.809 rows=10000000 loops=1) |
| Planning Time: 0.172 ms |
| Execution Time: 43078.930 ms |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
雖然 index 建立的方式,也符合排序的需求,也正確的使用了 index, 但是實際存取時,反而變慢了....
因為我們建立的 index 是 id desc,與 id 生長方面相反,但要取得最新,就是要抓大的,而且還要多透過index,
沒index 時,還要利用 external merge Disk: 254472kB, 但反而速度較快.
所以資料庫的優化,並不是建立 index 就必然加快速度.
看過昨天的利用遞迴來加快,那我們來看看如何利用遞迴來加快這樣的查詢.
我們透過與昨天的類似的方式,來嘗試使用以下 SQL Command
with recursive t1 as (
select sensor_id, val
from it201011c
where id in (
select id
from it201011c
where sensor_id is not null
order by sensor_id, id desc
limit 1
)
union all
select (select sensor_id, val
from it201011c
where id in (
select id
from it201011c a
where a.sensor_id > t1.sensor_id
and a.sensor_id is not null
order by a.sensor_id, a.id desc
limit 1
)
)
from t1
where t1.sensor_id is not null
)
select sensor_id, val
from t1
where t1.sensor_id is not null;
ERROR: 42601: subquery must return only one column
LINE 12: select (select sensor_id, val
出現了錯誤訊息.....
不過方法是人想出來的,既然只能一個欄位,那我們就建立 type, 來包含兩個欄位,
type 本身只算一個欄位.
create type ty1011 as (sensor_id int, val int);
with recursive t1 as (
select (sensor_id, val)::ty1011 as sr
from it201011c
where id in (
select id
from it201011c
where sensor_id is not null
order by sensor_id, id desc
limit 1
)
union all
select (select (sensor_id, val)::ty1011 as sr
from it201011c
where id in (
select id
from it201011c a
where a.sensor_id > (t1.sr).sensor_id
and a.sensor_id is not null
order by a.sensor_id, a.id desc
limit 1
)
)
from t1
where (t1.sr).sensor_id is not null
)
select (t1.sr).sensor_id, (t1.sr).val
from t1
where t1.* is not null;
Time: 12371.356 ms (00:12.371)
Time: 5858.400 ms (00:05.858)
第一次執行會久一些 12秒,第二次就降到 5秒.
這樣我們就可以利用遞迴,來取得十萬個sensor的最新資料,而且速度也較快.
Execution Time: 3929.956 ms <-- 第三次, 查看執行計畫時,更快了!
執行計畫就不列出了,有興趣的,可以自己測試看看.