DAY 26 Semantic Layer 跟文件說的不一樣！為何我們不用 Semantic Layer

2024 iThome 鐵人賽

DAY 26

AI/ ML & Data

這跟文件說的不一樣！從 0 到 1 導入 dbt 的實戰甘苦談系列第 26 篇

16th鐵人賽 dbt metrics

阿晟

團隊資料工程師甘苦談

2024-10-10 00:23:22

332 瀏覽

分享至

延續前面討論幾項讓 dbt 與下游結合更緊密的功能，還有一個功能 —— Semantic Layer，定位是 data mart 的下游，負責讓團隊可以定義關鍵指標與維度。

Semantic Layer 並不是一個 dbt 特有的功能，而是泛指一個中心化的層級，專門用來定義和管理與數據分析相關的業務語義（如指標、維度、變數等）。這層抽象位於資料模型和分析工具之間，旨在確保不同工具、團隊或部門在使用數據進行分析時能夠保持一致的語義和指標定義。

而 dbt 作為以 ELT 中的 Transformation 為特長的工具，旨在提供更方便資料分析運行的環境，也介接了這樣的工具，在 dbt 的資料模型與下游其他的分析工具之間，維護資料的一致性。

在導入這套工具時，我們主要的目的是讓資料庫本身更齊一、方便下游使用，而我們的組織並沒有這麼龐大，可以在每個部門都有不同的資料分析師來做分析專案或報告，而是中心化的資料治理部門負責一切的報表產出。

因此對我們來說，從 BigQuery, dbt 等轉換、到下游視覺化報表服務，都是同一批人管理，我們從上游制訂的政策與規範，將一路沿用至下游的服務，即使在視覺化報表中，我們也相當遵循在導入 dbt 同時建立的資料治理原則。

針對重要的、長久以來延續的，我們在 dbt 的資料模型中定義；針對較為特定的指標或維度，我們直接在視覺化工具 Metabase 中處理，而由於全部的需求都是中心化的資料組來處理，沒有其他的人，資料組的成員在這兩套工具的開發都有共識的情況下，再用更多工具來進行管理，就稍嫌多餘。

為什麼會這麼說呢？dbt 的 Semantic Layer 基本上是導入 MetricFlow 來作為其服務，以下是使用範例（文件）：

使用一般的 Query：

select
    date_trunc('day',orders.ordered_at) as day, 
    case when customers.first_ordered_at is not null then true else false end as is_new_customer,
    sum(orders.order_total) as order_total
from
  orders
left join
  customers
on
  orders.customer_id = customers.customer_id
group by 1, 2

使用 MetricFlow 來定義指標

semantic_models:
  - name: orders
    description: |
      A model containing order data. The grain of the table is the order id.
    model: ref('orders')  #The name of the dbt model and schema
    defaults:
      agg_time_dimension: metric_time
    entities: # Entities, which usually correspond to keys in the table
      - name: order_id
        type: primary
      - name: customer
        type: foreign
        expr: customer_id
    measures: # Measures, which are the aggregations on the columns in the table.
      - name: order_total
        agg: sum
    dimensions: # Dimensions are either categorical or time. They add additional context to metrics and the typical querying pattern is Metric by Dimension.
      - name: metric_time
        expr: cast(ordered_at as date)
        type: time
        type_params:
          time_granularity: day
      - name: is_food_order
        type: categorical

對於工程師來說，寫 SQL code 看起來直觀多了。

我們討論了關於 Semantic Models 的運用，認為他應該要結合其他的工具使用，可能是有視覺化介面、透過拖曳功能即可進行資料分析等運用，優勢在於 yaml file 的資料結構相對更單純，結合實作起來應該比 SQL 更單純。而這也代表，需要更多的工具，並且大多數的工具還要額外收費，只好謝謝再聯絡了（是我不夠好，規模不夠大支撐這樣的服務 XD）。