[Day 16] Encoding and Evolution(4) - Avro - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 12 屆 iThome 鐵人賽

DAY 16

AI & Data

資料工程師修煉之路系列第 16 篇

[Day 16] Encoding and Evolution(4) - Avro

12th鐵人賽 data engineering data engineer

tshine73

2020-10-01 17:53:50

4474 瀏覽

分享至

接續 Day 15

Avro

最後一個要來談的 binary encoding 方式是 Apache Avro ，閞始於 Hadoop 底下的子專案，它很明顯的跟 Thrift 和 Protocol Buffers 不同，Avro 一樣需要 schema 來定義欄位，Avro 能用 2 個 schema 語言，一個 (Avro IDL - Interface description language) 比較適合人讀，另一個 JSON 版本比較適合機器讀，我們一樣拿這份 JSON 資料來看看 Avro 的 2 種 schema 長什麼樣子，

{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

首先是 Avro IDL：

record Person {
	string userName;
	union { null, long } favoriteNumber = null; 
	array<string> interests;
}

再來是 JSON 版 schema：

{
  "type": "record",
  "name": "Person",
  "fields": [
    {
      "name": "userName",
      "type": "string"
    },
    {
      "name": "favoriteNumber",
      "type": [
        "null",
        "long"
      ],
      "default": null
    },
    {
      "name": "interests",
      "type": {
        "type": "array",
        "items": "string"
      }
    }
  ]
}

這裡可以看到，Avro 沒有在 schema 中使用 欄位標籤 (field tags)，它 encoding 後的檔案大小是最強大的 32 位元組，encoding 結果細節如下圖：

figure_4-5