2023 iThome 鐵人賽

DAY 21

Software Development

30 天學習 Elasticsearch系列第 21 篇

Day 21 Elasticsearch - ReIndex & Alias API

15th鐵人賽 elasticsearch

Robert Chang

2023-10-06 18:19:26

1299 瀏覽

分享至

該文章同步發佈於：我的部落格

也歡迎關注我的 Facebook 以及 Instagram 接收軟體相關的資訊！

上一篇文章的結尾，有提到 index 的 mapping 或是設定在建立之後是不能該更改的。

但就像是關聯式資料庫一樣，更改欄位或是重新命名都是很常見的事情，該在 Elasticsearch ( 以下簡稱 ES ) 是怎麼做的呢？

下面讓我們假設一個情境，現在有一個欄位叫做 user_id 在一開始 index 的時候設定成 integer 的資料類型，但因為產品的迭代和增長，現在 user_id 改成 UUID 的方式儲存，所以必須含有字串，在評估過後，決定將 user_id 這個欄位更改成 keyword 的資料類型，方便我們做資料的聚合和篩選。

我們一樣使用 Kibana 的 Dev Tools 建立 index：

// PUT /reservations
{
  "mappings": {
    "properties": {
      "user_id": { "type": "integer" }
    }
  }
}

接著讀取一下現在 mapping 的結構，確認 user_id 是 integer 的類型：

// GET /reservations/_mapping
{
  "reservations": {
    "mappings": {
      "properties": {
        "user_id": {
          "type": "integer"
        }
      }
    }
  }
}

接著我們嘗試使用新增欄位的方式直接地去更新 user_id 的資料類型：

// PUT /reservations/_mapping
{
  "properties": {
    "user_id": { "type": "keyword" }
  }
}

得到下面的錯誤訊息：

mapper [user_id] cannot be changed from type [integer] to [keyword]

因為儲存方式不同的關係，ES 直接返回錯誤訊息給我們。

不可以直接更新欄位的 5 個理由

Inverted Indices：ES 的主要資料結構是 inverted index，之前提過很多次；當 document 被 index 時，基於 mapping 的資料類型，會生成一個特定的 inverted index 資料結構

例如，text 和 keyword 這兩種資料類型生成的 inverted index 結構是不同的。如果允許隨意更改資料類型，就需要重新建立這些索引，這在技術上是非常複雜而且沒什麼效率的，簡單來說就是沒必要。

資料一致性：如果允許在已經有 document 存在的情況下更改資料類型，可能會導致資料不一致。舊的 document 可能使用舊的資料類型，而新的 document 則使用新的資料類型。
性能考慮：更改現有 mapping 的資料類型可能會需要大量的資源運算來更新和重新 index 所有相關的 documents，這可能會影響整個叢集的性能。
預防 Mapping Explosion：mapping explosion 是指有大量單獨的欄位在 index 之中。這可能是因為不小心或因為 dynamic mapping 而導致的，限制 mapping 的更改有助於減少這種問題的產生。
簡化操作和設計：為了確保簡單和容易管理，限制 mapping 更改可以讓 ES 內部設計和維護更簡單。

ReIndex

所以我們唯一能更改 mapping 欄位的方式，就是使用 ES 提供的 reindex 的 API。

整個 ReIndex 的過程其實就是建立一個新的 index 符合新的 mapping 格式，接著告訴 ES 幫我們轉換過去。

現在先確認原先的 index 內至少有一筆資料：

// GET /reservations/_doc/1
{
  "_index": "reservations",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "user_id": 1 // integer 的形式
  }
}

接著我們使用根據原先的 mapping 手動建立一個 user_id 是 keyword 的新的 index：

// PUT /reservations_new
{
  "mappings": {
    "properties": {
      "user_id": { "type": "keyword" } // 更改成 keyword
    }
  } 
}

建立成功後，我們現在有 reservations 以及 reservations_new 兩個 index，而後者是最新的資料類型，接著我們就可以使用 reindex API 讓 ES 來幫我們轉換：

// PUT /_reindex
{
  "source": {
    "index": "reservations"
  },
  "dest": {
    "index": "reservations_new"
  }
}

source 指得是原先的 index，而 dest 則是 destination 的意思，放入要切換過去的 index。

接著會得到 ES 的 response：

{
  "took": 10,
  "timed_out": false,
  "total": 1, // 只有一筆資料
  "updated": 0,
  "created": 1,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": []
}

讓我們來看看新的 index 裡面的 documents 是不是我們想要的 keyword 資料類型？

// GET /reservations_new/_doc/1
{
  "_index": "reservations_new",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "user_id": 1
  }
}

你會發現，哭啊！user_id 還是 integer 而不是 keyword 的字串類型啊！

這個我們在之前就有提到過，關於 _source 內的資料，都是 index 時傳入的內容，而不是實際儲存的內容，我們可以想像現在的 Lucene 內，user_id 是以 keyword 的資料類型去儲存的。

但是，很多時候我們的應用程式都是拿 _source 內的資料來執行商業邏輯，當然可以在應用程式的層面去更改這些值的類型，但最好還是讓他們都變成我們期待的資料類型是最好的。

我們先把原本轉換過去的 document 都先刪掉，重新在 reindex 一次：

// POST /reservations_new/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

該怎麼做呢？接著又是 script 上場的時候了，沒有看過的人可以去看這篇文章，複習一下！

直接上程式碼：

// POST /_reindex
{
  "source": {
    "index": "reservations"
  },
  "dest": {
    "index": "reservations_new"
  },
  "script": {
    "source": """
      if (ctx._source.user_id != null) {
        ctx._source.user_id = ctx._source.user_id.toString();
      }
    """
  }
}

判斷如果 user_id 是存在的，就將它使用 Painless 的 toString() 方法轉換成字串。

Painless 是 ES 本身提供的腳本語言，詳細可以參考官方文件對於 painless 的介紹

接著，重新的讀取新的 document 試試看：

// GET /reservations_new/_doc/1
{
  "_index": "reservations_new",
  "_id": "1",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "user_id": "1"
  } 
}

成功了！接著該怎麼做呢？現在我們有兩個差不多的 index，直接刪除掉舊的，接著重新命名新的 index 變成舊的名字？這是很棒的想法，但如果我們這是一個停機的任務呢？

Alias

這時候可以使用 ES 內 alias 的 API 來做到這件事：

// POST /_aliases
{
  "actions": [
    {
      "remove": {
        "index": "reservations",
        "alias": "reservations_index"
      }
    },
    {
      "add": {
        "index": "reservations_new",
        "alias": "reservations_index"
      }
    }
  ]
}

這種方式可以讓我們的應用程式保持靈活性，始終都是讀取 reservation_index 這個 index 的名稱，但實際上我們可以在後台抽換背後真正的 index 以達到不停機的效果。

或是使用 alias 來管理 index 的版本，例如，我們在測試一些新的 index 的效能，就可以使用 reservation_v2 的新 index，當我們發現這個方法不可行時，快速地切換回 reservation 避免造成線上的服務崩壞。

最後的最後，我們在把原本的 index 給刪除掉：