如何用 dbt 客製化命名 bigquery dataset? dbt custom schema 原理及使用方式

2023 iThome 鐵人賽

DAY 8

AI & Data

如何借助 dbt 優化當代資料倉儲及資料工程師的水肥之路分享系列第 8 篇

15th鐵人賽 dbt core dbt custom schema

brucehau

團隊dbt 和 dbt 以外有趣的事

2023-09-23 01:57:34

1356 瀏覽

分享至

這篇稍微進階，但對你在寫 dbt 對應 data warehouse 命名管理很有幫助

首先我們定義好專有名詞，以dbt 使用 bigquery 為例

定義: 在 dbt vs bigquery 的使用名詞中

database = project

要記住， dbt 所有設定使用 bigquery 中， schema 都是牽扯到 dataset 層級，database 為 project 層，後面操作我們都以 bigquery 為例
custom schema 是什麼?
我們知道 schema 就是 dataset 的名字，custom schema 顧名思義可以自定義的資料庫名字
為何需要 custom schema?
因為他讓你自己命名 schema ，所以彈性的設定每個 dbt model 生成在 biguery dataset 的位置，如果你沒設定 schema，你的預設 dataset 就會照你的 profile.yml 的名稱決定

# example profiles.yml file
migo-dbt:
  target: dev
  outputs:
    dev:
			dataset: migo_test
      job_execution_timeout_seconds: 1800
      job_retries: 2
      location: US
      method: oauth
      priority: interactive
      project: migo-1606
      threads: 8
      type: bigquery

參照以上的設定，你若不設定 custom schema 在 dbt_project，你所有 dbt run 的 model 都會生成在 migo_test 的 bigquery dataset。

# dbt_project.yml
name: migo-dbt

models:
  migo-dbt:
    events:
      +tags: migo
      +materialized: table
			+database: migo-1606
      base:
        +materialized: view
				+schema: datamart

如果你設定了 custom schema，例如拿上篇的 dbt_project.yml，你跑完的 model 會在 migo-1606.migo_test_datamart.{model} 下生成
為何 dataset 的名字把 {預設} & {custom schema}連在一起呢?

-- get_custom_schema.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}

    {{ default_schema }}

{%- else -%}

    {{ default_schema }}_{{ custom_schema_name | trim }}

{%- endif -%}
{%- endmacro %}

因為依照官方文件，你的 schema 是依 dbt project 下/ macro 資料夾/get_custom_schema.sql 的 macro 決定的

這時你會說，我不想讓 dataset 名字連預設和 custom 名字在一起啊! 我要用自己的規則命名，例如:我只想要 custom schema 當 dataset 名字。
怎麼做呢? 很簡單，你只要改 get_custom_schema.sql 的一小段語法就好

-- get_custom_schema.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}

    {{ default_schema }}

{%- else -%}

    **{{ custom_schema_name | trim }}** 

{%- endif -%}
{%- endmacro %}

你只要把原本 {{ default_schema }}去掉，你未來的 dataset 就只會出現你命名的 custom schema 囉！若是你很會寫 macro 語法，你也可以自己訂各種命名規則