iT邦幫忙

2024 iThome 鐵人賽

DAY 20
0

題目

Questions

Q11

A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column. Which solution will MOST speed up the Athena query performance?

  • [ ] A. Change the data format from .csv to JSON format. Apply Snappy compression.
  • [ ] B. Compress the .csv files by using Snappy compression.
  • [x] C. Change the data format from .csv to Apache Parquet. Apply Snappy compression.
  • [ ] D. Compress the .csv files by using gzip compression.

描述

  • 有個工程師透過 Athena 去查資料,想要加速
  • 資料目前放在沒有壓縮的 csv 檔
  • 最長查找資料都是特定欄位

解析

  1. 在上一回小試身手的時候,已經學過 Apache Parquet 就是用來做轉置欄、列的,用它就對了

Q12

A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket. The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility. Which solution will meet these requirements with the LOWEST latency?

  • [x] A. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Use a connector for Apache Flink to write data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.
  • [ ] B. Configure the S3 bucket to send a notification to an AWS Lambda function when any new object is created. Use the Lambda function to publish the data to Amazon Aurora. Use Aurora as a source to create an Amazon QuickSight dashboard.
  • [ ] C. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to process the sensor data. Create a new Data Firehose delivery stream to publish data directly to an Amazon Timestream database. Use the Timestream database as a source to create an Amazon QuickSight dashboard.
  • [ ] D. Use AWS Glue bookmarks to read sensor data from the S3 bucket in real time. Publish the data to an Amazon Timestream database. Use the Timestream database as a source to create a Grafana dashboard.

描述

  • 製造業者搜集感測器數值
  • 用 Kinesis Data Streams 去推送訊號
  • 用 Amazon Kinesis Data Firehose 寫資料到 S3
  • 想要呈現即時數值(戰情室)
  • 選最即時延遲低的

解析

  1. 需理解 Apache Flink 的用途
  2. 用 Lambda 去 S3 裡面撈資料比 Apache Flink 慢。
  3. Grafana 比 Quicksight 還要快。

Q13

A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data. The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog. Which solution will meet these requirements?

  • [ ] A. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Configure the output destination to a new path in the existing S3 bucket.
  • [x] B. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Create a daily schedule to run the crawler. Specify a database name for the output.
  • [ ] C. Create an IAM role that includes the AmazonS3FullAccess policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Specify a database name for the output.
  • [ ] D. Create an IAM role that includes the AWSGlueServiceRole policy. Associate the role with the crawler. Specify the S3 bucket path of the source data as the crawler's data store. Allocate data processing units (DPUs) to run the crawler every day. Configure the output destination to a new path in the existing S3 bucket.

描述

  • 公司每天整理投資效益記錄在 csv 放 S3
  • 工程師用 Glue 爬資料 並且使 AWS Glue Data Catalog 可以存取 S3

解析

  • 先用二分法,看到配置最高的 S3 權限就應該剔除
  • 剩下 B D 都只有給 IAM role 配置 AWSGlueServiceRole,並套用給 AWS Glue crawlers 才是合理的。
  • 針對爬完的內容,要放資料庫,而非 S3
  • 監控 DPU 是為了用來監控爬蟲耗掉多少資源,以用來擴充或者節省費用,似乎不是主要關注的議題。

Q14

A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?

  • [ ] A. Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.
  • [x] B. Use the Amazon Redshift Data API to publish an event to Amazon EventBridge. Configure an EventBridge rule to invoke the Lambda function.
  • [ ] C. Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.
  • [ ] D. Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.

描述

  • 有一個公司要 將 Redshift 的每日交易資料撈出來看,在 end of each day。
  • 工程師建立一個 Lambda 將 load statuses 寫入 DynamoDB,試問如何載入和讀取?

解析

  • 都開了 Amazon Redshift 就該使用 Amazon Redshift Data API
  • 因為次批次作業,所以選擇用 EventBridge
  • EventBridge 可以設定 Schedule Lambda Task 也可以直接觸發 Lambda Function
  • 沒大吞吐或低延遲,只是批次作業,不需要把東西丟到訊息佇列 (SQS)

結論

  • 大部分東西都看過了,剩下 Kinesis 和 SQS 要再加強一下

上一篇
【Day 19】 Hadoop / Spark / Amazon EMR
下一篇
【Day 21】 做題庫小試身手 - 5
系列文
老闆,外帶一份 AWS Certified Data Engineer30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言