以Postgresql為主,再聊聊資料庫 PostgreSQL Array 的應用

postgresql rdbms sql

一級屠豬士 2020-10-13 13:17:08 ‧ 1678 瀏覽

分享至

PostgreSQL 有 Array, 讓我們在處理資料或是分析資料時,
十分地便利.

情境:某電商有10萬種商品,上個月有一千萬筆銷售紀錄,每筆銷售有1~10種商品.

我們使用 Array,可以很方便的建立以下的 table.

create table it201012a (
  id int generated always as identity
, items int[] not null
);

產生測試資料的函數

create or replace function ranitemarr()
returns int[]
as $$
  select array_agg(ceil(random() * 100000)::int)
    from generate_series(1, ceil(random() * 10)::int);
$$
language sql;

insert into it201012a(items)
select ranitemarr()
  from generate_series(1,1e7);

commit;

接下來我們可以開發一個array內元素所有可能組合的函數.
在 stackoverflow 中 已經有一個這樣的範例

https://stackoverflow.com/questions/30515990/postgresql-find-all-possible-combinations-permutations-in-recursive-query

遞迴啊~~~ 遞迴的方式,雖然已經有開發完成,但是頗佔篇幅,今天先不介紹.

Python 的 itertools 裡面也有 combinations 可以利用
https://docs.python.org/zh-cn/3/library/itertools.html#itertools.combinations

我們將利用 combinations 來開發三個PlPython函數.

create or replace function pyarrcombin (lst int[])
returns setof int[]
as $$
  from itertools import combinations
  return [list(t) for r in range(1, len(lst)+1) for t in combinations(lst,r)]
$$ language plpython3u;

create or replace function pyarrcombin (lst int[], n int)
returns setof int[]
as $$
  from itertools import combinations
  return [list(t) for t in combinations(lst,n)]
$$ language plpython3u;

create or replace function pyarrcombin (lst int[], n int, f int)
returns setof int[]
as $$
  from itertools import combinations
  return [list(t) for t in combinations(lst,n) if f in t]
$$ language plpython3u;

-------
看一下使用案例
產生全部的組合
select pyarrcombin(array[1,2,3]);
+-------------+
| pyarrcombin |
+-------------+
| {1}         |
| {2}         |
| {3}         |
| {1,2}       |
| {1,3}       |
| {2,3}       |
| {1,2,3}     |
+-------------+
(7 rows)

產生符合指定的數目元素
select pyarrcombin(array[1,2,3],2);
+-------------+
| pyarrcombin |
+-------------+
| {1,2}       |
| {1,3}       |
| {2,3}       |
+-------------+
(3 rows)

產生符合指定的數目元素,且包含指定元素
select pyarrcombin(array[1,2,3],2,3);
+-------------+
| pyarrcombin |
+-------------+
| {1,3}       |
| {2,3}       |
+-------------+
(2 rows)

--------
使用產生組合的函數,來看應用.

先來幫測試table 建立 index 吧.

create index on it201012a using gin(items);

analyze it201012a;

commit;

-------
select *
  from it201012a
 fetch first 3 rows only;

+----+---------------------------------------------------------------+
| id |                             items                             |
+----+---------------------------------------------------------------+
|  1 | {77367,65384,94905,83558,89606,66216,13380,89461,73248,65171} |
|  2 | {30193,61007,48154}                                           |
|  3 | {42786}                                                       |
+----+---------------------------------------------------------------+
(3 rows)

假設以 61007 ,查詢有哪些紀錄有包含.

select *
  from it201012a
 where items @> array[61007];

+---------+---------------------------------------------------------------+
|   id    |                             items                             |
+---------+---------------------------------------------------------------+
|       2 | {30193,61007,48154}                                           |
|    2706 | {61007,39231,7845,41394,22950,57433,41700,84929,90262}        |
....
| 9997914 | {61007}                                                       |
+---------+---------------------------------------------------------------+
(537 rows)

查詢與 61007 搭配的品項,出現次數依序降冪. 

select count(*)
     , pyarrcombin(items, 2, 61007)
  from it201012a
 where items @> array[61007]
 group by 2
 order by 1 desc;

+-------+---------------+
| count |  pyarrcombin  |
+-------+---------------+
|     2 | {61007,9629}  |
|     2 | {42904,61007} |
|     2 | {61007,80684} |
|     2 | {20941,61007} |
|     2 | {61007,10792} |
|     2 | {61007,70903} |
...
|     1 | {1439,61007}  |
+-------+---------------+
(3245 rows)
Time: 62.261 ms

可見得 61007 與 其他品項一起購買的情況是有不少,但沒有特別明顯的組合.

----
接下來找 61007 的哼哈二將

select count(*)
     , pyarrcombin(items, 3, 61007)
  from it201012a
 where items @> array[61007]
 group by 2
 order by 1 desc;

+-------+---------------------+
| count |     pyarrcombin     |
+-------+---------------------+
|     1 | {97,15451,61007}    |
|     1 | {97,16040,61007}    |
...
|     1 | {99987,61007,68303} |
+-------+---------------------+
(9911 rows)

Time: 77.938 ms
速度都還不錯.
---------
銷量前三名的品項
select a.item
     , count(*) 
  from (select unnest(items) as item
          from it201012a) a
 group by a.item
 order by 2 desc
 fetch first 3 rows only;

+-------+-------+
| item  | count |
+-------+-------+
| 78311 |   662 |
| 13034 |   652 |
| 42643 |   646 |
+-------+-------+
(3 rows)

以78311 來查詢
select count(*)
     , pyarrcombin(items, 2, 78311)
  from it201012a
 where items @> array[78311]
 group by 2
 order by 1 desc;
+-------+---------------+
| count |  pyarrcombin  |
+-------+---------------+
|     3 | {76373,78311} |
|     2 | {4643,78311}  |
|     2 | {65731,78311} |
|     2 | {49658,78311} |
|     2 | {87291,78311} |
...
|     1 | {1429,78311}  |
+-------+---------------+
(4082 rows)

Time: 32.002 ms

-----------
因為產生的測試資料為亂數產生,會與現實的可能有較高相關性的資料不同.

今天介紹的應用方式,不只可以用在電商,應用的範圍很廣,希望能帶給大家在資料分析
時有新的方式可以應用.