iT邦幫忙

2024 iThome 鐵人賽

DAY 26
0

題目

Questions

Q23

A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class. A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year. The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability. Which solution will meet these requirements in the MOST cost-effective way?

  • [ ] A. Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
  • [ ] B. Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
  • [x] C. Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
  • [ ] D. Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.

描述

  • 有一公司都用 S3 Standard 來存資料
  • 不過大部分的資料只有在存入之後的六個月會被讀取多次
  • 第六個月後~兩年間頂多一個月被讀取一兩次
  • 兩年之後就很少讀取了,可能一年一次兩次
  • 如何省錢?

解析

  • 因為使用 S3 One Zone-IA,客戶現在可在單一可用區域儲存不經常存取的資料,相較於 S3 Standard-IA,可降低 20% 的成本。
  • 因為前面六個月間還是蠻常存取的,所以不會選 S3 One-Zone
  • S3 Glacier Deep Archive: 12-48 hours 解凍
  • S3 Glacier Flexible Retrieval: choose S3 Glacier Flexible Retrieval, with retrieval in minutes or free bulk retrievals in 5-12 hours 解凍
  • 選最划算的方案,所以選 C

Q24

A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks. The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team's BI cluster. The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster. Which solution will meet these requirements?

  • [x] A. Set up the sales team BI cluster as a consumer of the ETL cluster by using Redshift data sharing.
  • [ ] B. Create materialized views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
  • [ ] C. Create database views based on the sales team's requirements. Grant the sales team direct access to the ETL cluster.
  • [ ] D. Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.

描述

  • 有一公司管理 Redshift 倉庫並用它來做 ETL,是非常「critical」很重要的分析
  • 行銷團隊用來做 BI
  • 最近這個團隊要弄每週分析,所以要在不影響原先任務的前提下,共享倉庫的資料給行銷團隊
  • 選消耗最少運算資源的方案

解析

  • 選 A 目前看起來是最簡單的做法,直接用倉儲內建的共享資料功能
  • 其他方案顯然是畫蛇添足

Q25

A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3. Which solution will meet this requirement MOST cost-effectively?

  • [ ] A. Use an Amazon EMR provisioned cluster to read from all sources. Use Apache Spark to join the data and perform the analysis.
  • [ ] B. Copy the data from DynamoDB, Amazon RDS, and Amazon Redshift into Amazon S3. Run Amazon Athena queries directly on the S3 files.
  • [x] C. Use Amazon Athena Federated Query to join the data from all data sources.
  • [ ] D. Use Redshift Spectrum to query data from DynamoDB, Amazon RDS, and Amazon S3 directly from Redshift.

描述

  • 有一工程師要對多來源的資料進行一次性的資料分析工作
  • 來源有這些:Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3
  • 哪個方案最省錢?

解析

  • 要選最便宜的,那就絕對不是 A 另外開 EMR cluster 出來 或 D 去使用Redshift
  • B 要另外花 S3 儲存
  • C 合理,因為那些資料都已經存在原本的地方,而 Athena 剛好都可以直接去撈那些來源

Q26

A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance. Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)

  • [ ] A. Use Hadoop Distributed File System (HDFS) as a persistent data store.
  • [x] B. Use Amazon S3 as a persistent data store.
  • [ ] C. Use x86-based instances for core nodes and task nodes.
  • [x] D. Use Graviton instances for core nodes and task nodes.
  • [ ] E. Use Spot Instances for all primary nodes.

描述

  • 有一公司計畫要生出一座 EMR 叢集,去執行 Apache Spark 分析資料
  • 需要高可靠性
  • 大數據部門要做成本最佳化、長時間運作的 EMR 方案
  • 哪兩個最划算?

解析

  • 先比較儲存費用,當然是 HDFS > S3 儲存費
  • 再比較硬體規格,架構上來說 x86 > ARM
  • Spot Instance 不能選,這種沒保障你可以租多久的方案,萬一算到一半房東把房客趕走,被都更就不好了。 不符合高可用性。

上一篇
【Day 25】 做題庫小試身手 - 6
下一篇
【Day 27】 做題庫小試身手 - 8
系列文
老闆,外帶一份 AWS Certified Data Engineer30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言