由 2017/02/01 GitLab 資料庫事件來檢討備份 / 回復機制

備份

海綿寶寶 2017-02-02 09:47:11 ‧ 26766 瀏覽

分享至

事由：提供 Git 版控的付費 Hosting 網站 GitLab.com，發生資料庫異常而下線，恢復上線後，有六小時的資料完全救不回來。

以下是針對備份／回復的官方說明節錄

Problems Encountered

LVM snapshots are by default only taken once every 24 hours. Team-member-1 happened to run one manually about 6 hours prior to the outage because he was working in load balancing for the database.

Regular backups seem to also only be taken once per 24 hours, though team-member-1 has not yet been able to figure out where they are stored. According to team-member-2 these don’t appear to be working, producing files only a few bytes in size.

Team-member-3: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.

Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.

The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost

The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented

Our backups to S3 apparently don’t work either: the bucket is empty

So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place. We ended up restoring a 6 hours old backup.

像 GitLab 這種等級的網站
用了 5 種備份機制
發生問題時
備份也是派不上用場
掉了 6 小時的資料

回頭看看自己公司的備份機制
是否該做些調整
或者
如我所建議的
平時不演練，出狀況就搞笑