window function和 aggregate function類似,不過相較於aggregate function是將資料行以特定的欄位之數值進行分組,window function套用之後輸出的行數仍然是不變的,但是window function可以取得多個和當前資料行相關的資料來進行計算,並和當前的資料行做比較。
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
query的SELECT目標後面加上OVER clause,然後使用PARTITION BY來指定要做分組的欄位,像這條query的用途就是將每一位員工的部門名稱,員工編號,薪資後面和同部門員工的平均薪資一起輸出表格。
depname | empno | salary | avg
-----------+-------+--------+-----------------------
develop | 11 | 5200 | 5020.0000000000000000
develop | 7 | 4200 | 5020.0000000000000000
develop | 9 | 4500 | 5020.0000000000000000
develop | 8 | 6000 | 5020.0000000000000000
develop | 10 | 5200 | 5020.0000000000000000
personnel | 5 | 3500 | 3700.0000000000000000
personnel | 2 | 3900 | 3700.0000000000000000
sales | 3 | 4800 | 4866.6666666666666667
sales | 1 | 5000 | 4866.6666666666666667
sales | 4 | 4800 | 4866.6666666666666667
(10 rows)
OVER clause後面定義的分組稱為window frame,window frame除了可以在query裡面定義之外,也可以在query後面定義,透過WINDOW clause來定義多個不同的aggregate以及其別名來做使用。
forumdb=# select distinct category, count() over w1 ,count() over W2
from posts
WINDOW w1 as (partition by category),W2 as ()
order by category;
category | count | count
----------+-------+-------
10 | 1 | 5
11 | 3 | 5
12 | 1 | 5
(3 rows)
如果partition裡面是空的話,則代表該partition的範圍是整個table,如果裡面有order by則代表window frame的範圍是該partition的第一行到當前資料行(因為有一些window function只對window frame裡面的資料做處哩,而不是整個partition)。
用途:取得當前資料行在partition當中的順序作為ID
例子:
用途:在partition之內透過某個欄位的值進行排序
例子:
test=# SELECT country, year, production,
min(production) OVER (PARTITION BY country ORDER BY year)
FROM t_oil
WHERE year BETWEEN 1978 AND 1983
AND country IN ('Iran', 'Oman');
country | year | production | min
---------+-----+------------+------
Iran | 1978 | 5302 | 5302
Iran | 1979 | 3218 | 3218
Iran | 1980 | 1479 | 1479
Iran | 1981 | 1321 | 1321
Iran | 1982 | 2397 | 1321
Iran | 1983 | 2454 | 1321
Oman | 1978 | 314 | 314
Oman | 1979 | 295 | 295
Oman | 1980 | 285 | 285
Oman | 1981 | 330 | 285
可以從查詢結果看到,每個partition裡面的結果都用年份排序過了。並且由於在window frame裡面設定order by的緣故,min() function得到的結果是partition內第一筆紀錄到該筆紀錄的最小值。
用途:回傳partition中該欄位的第一行的值/最後一行的值,建議在partition當中搭配ORDER BY來使用,以確保每一次查詢出來的值都是一樣
test=# SELECT year, production,
first_value(production) OVER (ORDER BY year)
FROM t_oil
WHERE country = 'Canada'
LIMIT 4;
year | production | first_value
-------+------------+-------------
1965 | 920 | 920
1966 | 1012 | 920
1967 | 1106 | 920
1968 | 1194 | 920
(4 rows)
用途:回傳partition中該欄位的第n行的值
test=# SELECT year, production,
nth_value(production, 3) OVER (ORDER BY year)
FROM t_oil
WHERE country = 'Canada';
year | production | nth_value
-------+------------+-----------
1965 | 920 |
1966 | 1012 |
1967 | 1106 | 1106
1968 | 1194 | 1106
由於nth_value在前兩行因為有order by的關係,window frame大小<3,所以無法取得第三筆資料的內容。
用途:取得該資料行的特定欄位值在partition中的排名
語法為RANK() OVER (
[PARTITION BY partition_expression, ... ]
ORDER BY sort_expression [ASC | DESC], ...
)
rank()在官方文件中的說法是"Returns the rank of the current row, with gaps; that is, the row_number of the first row in its peer group",也就是說欄位值如果有重複的話,rank值會跟上一行的一樣,然後下一行的欄位值如果有變動,rank會直接跳到下一行的row number...
例子:
test=# SELECT year, production,
rank() OVER (ORDER BY production)
FROM t_oil
WHERE country = 'Other Middle East'
ORDER BY rank
LIMIT 7;
year | production | rank
-------+------------+------
2001 | 47 | 1
2004 | 48 | 2
2002 | 48 | 2
1999 | 48 | 2
2000 | 48 | 2
2003 | 48 | 2
1998 | 49 | 7
(7 rows)
也有另外一個稱為dense_rank()的function,和rank()的差別在於當rank數字增加的時候一次只會加一。
test=# SELECT year, production,
dense_rank() OVER (ORDER BY production)
FROM t_oil
WHERE country = 'Other Middle East'
ORDER BY dense_rank
LIMIT 7;
year | production | dense_rank
-------+------------+------------
2001 | 47 | 1
2004 | 48 | 2
...
2003 | 48 | 2
1998 | 49 | 3
(7 rows)
lead() function可以取得當前資料行後面一行的數值,lag()可以取得當前資料行前面一行的數值
以下的例子透過lag() window function,將前一行的production欄位資料加在每一行的後面。
test=# SELECT year, production,
lag(production, 1) OVER (ORDER BY year)
FROM t_oil
WHERE country = 'Mexico'
LIMIT 5;
year | production | lag
-------+------------+-----
1965 | 362 |
1966 | 370 | 362
1967 | 411 | 370
1968 | 439 | 411
1969 | 461 | 439
(5 rows)
用途:回傳欄位值不大於當前資料行(及相同數值的peer)的資料行在table中的比率
forumdb=# select x,cume_dist() over w from (select generate_series(1,5) as
x) V WINDOW w as (order by x) ;
x | cume_dist
---+-----------
1 | 0.2
2 | 0.4
3 | 0.6
4 | 0.8
5 | 1
(5 rows)