iT邦幫忙

2024 iThome 鐵人賽

DAY 25
0

題目

Questions

Q19

A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes. Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)

  • [x] A. Use an AWS Lambda function and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.
  • [x] B. Create an AWS Step Functions workflow and add two states. Add the first state before the Lambda function. Configure the second state as a Wait state to periodically check whether the Athena query has finished using the Athena Boto3 get_query_execution API call. Configure the workflow to invoke the next query when the current query has finished running.
  • [ ] C. Use an AWS Glue Python shell job and the Athena Boto3 client start_query_execution API call to invoke the Athena queries programmatically.
  • [ ] D. Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes to determine whether the current Athena query has finished running successfully. Configure the Python shell script to invoke the next query when the current query has finished running.
  • [ ] E. Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the Athena queries in AWS Batch.

描述

  • 一工程師每天都要去執行一系列的 Athena queries
  • 每個查詢的執行時間,都會超過十五分鐘
  • 選最便宜的做法

解析

  • 先前有提到 Lamnda 的執行時間,上限十五分鐘,所以不會用它來查詢。
  • 不過在選項 A 的描述中,是寫一段程式碼去呼叫 Athena API,而 AWS Lambda 只用來「觸發、喚醒」查詢工作,然後就去睡覺。
  • 搭配 Steps Funnction 確實就可以組合在一起。

A: Lambda 有最長執行 15 分鐘的限制,所以肯定不會是 A;不過如果 Athena 本身支援 Sync 的話,就可以拆成兩部份來執行
B: Step Function: https://docs.aws.amazon.com/zh_tw/step-functions/latest/dg/welcome.html

  • 其他選項我覺得缺點如下

C: 單純呼叫 Glue 去進行查訊,而 B 的話,透過 Step Function 去監控 Lambda 的執行狀態,肯定貴
D: 在 Glue 裡面使用 sleep 並不會省錢
E: 看起來可行,但不會省錢

Q20

A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options. The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS. Which extract, transform, and load (ETL) service will meet these requirements?

  • [ ] A. AWS Glue
  • [x] B. Amazon EMR
  • [ ] C. AWS Lambda
  • [ ] D. Amazon Redshift

描述

  • 一公司搬遷地端服務上雲
  • 想要減少維運成本,考慮用無伺服器服務
  • 正在使用的有 Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink

解析

  • 這些動物園系列的服務,都是 Hadoop 的周邊,就選 EMR

這題考 Hadoop 的 AWS 版本

Q21

A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII. Which solution will meet this requirement with the LEAST operational effort?

  • [ ] A. Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream.
  • [ ] B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
  • [x] C. Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake.
  • [ ] D. Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data. Use the same Lambda function to ingest the data into the S3 data lake.

描述

  • 工程師去吃資料,放 S3 datalake
  • 裡面有個資
  • 需要將個資進行混淆/模糊/去識別化處理
  • 選最輕鬆的管理方案

解析

  • 老王賣瓜,選 Glue Studio 有個 Detect PII 功能,所以在 BC 之間選擇

B: 看起來最簡單達成
C: 支持者表示,B 的做法不合理無法直接進行混淆 We cannot directly handle PII with Glue Studio, and Glue Data Quality can be used to handle PII.
The transform Detect PII in AWS Glue Studio is specifically used to identify personally identifiable information (PII) within the data. It can detect and flag this information, but on its own, it does not perform the obfuscation or removal of these details. To effectively obfuscate or alter the identified PII, an additional transformation would be necessary. This could be accomplished in several ways, such as: Writing a custom script within the same AWS Glue job using Python or Scala to modify the PII data as needed.
Using AWS Glue Data Quality, if available, to create rules that automatically obfuscate or modify the data identified as PII. AWS Glue Data Quality is a newer tool that helps improve data quality through rules and transformations, but whether it's needed will depend on the functionality's availability and the specificity of the obfuscation requirements

選擇如何處理已識別的 PII 資料

  • 如果您選擇在整個資料來源中偵測 PII,則可選取要套用的全域動作:
  • Enrich data with detection results (利用偵測結果豐富資料):如果您在每個儲存格中選擇「偵測 PII」,則可以將偵測到的實體存放到新的資料行中。
  • Redact detected text (將偵測到的文字設為密文):您可以使用在選擇性的取代文字輸入欄位中指定的字串來取代偵測到的 PII 值。如果未指定任何字串,則偵測到的 PII 實體會以 '*******' 取代。
  • 部分遮蔽偵測到的文字:您可以使用選擇的字串取代部分偵測到的 PII 值。其中提供兩個可能的選項:保持結尾未遮罩,或透過明確的 regex 模式進行遮罩。此功能尚無法在 AWS Glue 2.0 中使用。
  • Apply cryptographic hash (套用加密雜湊):您可以將偵測到的 PII 值傳遞給 SHA-256 密碼編譯雜湊函數,並以函數的輸出取代該值。

Q22

A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company's operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data. The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort. Which solution will meet these requirements with the LEAST operational overhead?

  • [ ] A. AWS Glue workflows
  • [x] B. AWS Step Functions tasks
  • [ ] C. AWS Lambda functions
  • [ ] D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows

描述

  • 一公司維護多個 ETL 工作流,分別用來吃公司營運資料庫,處理後放到 S3 datalake
  • ETL 有用到 Glue 和 EMR
  • 想要做成自動觸發
  • 選最輕鬆的方案

解析

  • 因為有多個 ETL,還要去接 EMR
  • Glue Workflow 只能做 crawlers and glue jobs (ETL)

看到 orchestration 要直接連接到 Step Functions task


上一篇
【Day 24】 Amazon MQ / Amazon Managed Streaming for Apache Kafka (MSK)
下一篇
【Day 26】 做題庫小試身手 - 7
系列文
老闆,外帶一份 AWS Certified Data Engineer30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言