SAA - C03 Certification: Data & Analytics

Athena

Serverless query service to analyze data stored in S3
Use SQL language to query the files (built on Presto)
Supports CSV, JSON, ORC, Avro, and Parquet
Pricing: 5$ per TB of data scanned
Commonly used with Amazon Quicksight for reporting/dashboards

Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails,…

Exam tips: analyze data in S3 using serverless SQL, use Athena

Federated Query

Allows you to run SQL queries across data stored in SQL, NoSQL, object,…
Uses Data Source Connectors that run on AWS Lambda to run Federated Queries
Store the result back in the S3 bucket

Redshift

It is based on PostgreSQL, but it’s not used for OLTP
It’s OLAP (Online Analytical Processing)
10x better performance than other data warehouses, scale to PBs of data
Columnar storage of data & parallel query engine
Two modes: Serverless Cluster & Provisioned Cluster
Has a SQL interface for performing the queries
BI Tools such as Amazon Quicksight or Tableau
vs Athena: faster queries/joins/aggregations thanks to indexes

Redshift CLuster

The architecture:
- Leader node: for query planning, results aggregation
- Compute node: for performing the queries, send results to the leader node
- Provisioned mode:
  - Choose instance types in advance
  - Can reserve instances for cost savings

Snapshots & DR

Snapshots are point-in-time backups of a cluster, stored internally in S3
Snapshots are incremental
You can restore a snapshot into a new cluster
Automated: every 8 hours, every 5 GB, or a schedule
Manual: snapshot is retained until you delete it
You can configure Redshift to copy snapshots of a cluster to another region automatically

Redshift Spectrum

Query data that is already in S3 without loading it
Must have a Redshift cluster available to start the query
The query is then submitted to thousands of Redshift Spectrum nodes

OpenSearch

Two modes:
- managed cluster
- serverless cluster
Does not natively support SQL (can be enabled via a plugin)
Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs,…
Comes with OpenSearch Dashboards

EMR

EMR stands for “Elastic MapReduce”
EMR helps create Hadoop clusters to analyze and process vast amounts of data
The cluster can be made of hundreds of EC2 instances
EMR comes bundled with Spark, HBase, Presto, Flink,…
EMR takes care of all the provisioning and configuration
Auto-scaling and integrated with Spot instances

Use cases: data processing, machine learning, web indexing, big data,…

Node types & Purchasing

Master Node
Core Node
Task Node (optional)
Purchasing options:
- On-demand
- Reserved (min 1 year): cost savings
- Spot instances: cheaper
Can have a long-running cluster, or transient (temporary) cluster

Quicksight

Serverless machine learning-powered business intelligence service to create interactive dashboards
Fast, automatically scalable, embeddable, with per-session pricing
Use cases:
- Business Analytics
- Building visualizations
- Perform ad-hoc analysis
Integrated with RDS, Aurora, Athena, Redshift, S3,…
Im-memory computation using the SPICE engine if data is imported into QuickSight
Enterprise edition: Column-level Security

Glue

Managed extract, transform, and load (ETL) service
Useful to prepare and transform data for analytics
Full serverless services
Use cases: convert data into Parquet format

Things to know at a high level

Glue Job Bookmarks: prevent re-processing old data
Glue Elastic Views:
- Combine and replicate data across multiple data stores using SQL
- No custom code
- Leverages a “virtual table”
Glue DataBrew: clean and normalize data using pre-built transformation
Glue Studio: new GUI to create, run, and monitor ETL jobs in Glue
Glue Streaming ETL (built on Spark): compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)

AWS Lake Formation

Data lake = central place to have all data for analytics purposes
Fully managed service that makes it easy to set up a data lake in days
Discover, cleanse, transform, and ingest data into Data Lake
It automates many complex manual steps (collecting, cleansing, moving, cataloging data,…) and de-duplicate (using ML Transforms)
Combine structured and unstructured data in the data lake
Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB,…
Fine-grained Access Control for applications (row and column-level)
Built on top of AWS Glue

Kinesis Data Analytics

For SQL

Real-time analytics on Kinesis Data Stream & Firehose using SQL
Add reference data from S3 to enrich streaming data
It is fully managed, with no servers to provision
Automatic scaling
Pay for actual consumption rate
Output:
- Kinesis Data Streams
- Kinesis Data Firehose

Use cases: Time series analytics, Real-time dashboards, Real-time metrics

For Apache Flink

Use Flink (Java, Scala, or SQL) to process and analyze streaming data
Run any Apache Flink application on a managed cluster on AWS
- provisioning compute resources, parallel computation, automatic scaling
- application backups
- Use any Apache Flink programming features
- Flink does not read from Firehose (use Kinesis Analytics for SQL instead)

MSK (Managed Streaming for Apache Kafka)

Alternative to Amazon Kinesis
Fully managed Kafka on AWS
- Allow to create, update, and delete clusters
- MSK creates & manages Kafka brokers nodes & Zookeeper nodes
- Deploy the MSK cluster in VPC, multi-AZ (up to 3 for HA)
- Automatic recovery from common Kafka failures
- Data is stored on EBS volumes for as long as you want
MSK Serverless
- Run Kafka on AWS on MSK without managing the capacity
- MSK automatically provisions resources and scales computing & storage

The difference between Kinesis Data Streams vs MSK

Kinesis Data Stream	MSK
1 MB message size limit	1 MB default, configured for higher
Data Streams with Shards	Kafka Topic with Partitions
TLS in-flight encryption	PLAINTEXT or TLS In-flight encryption
KMS at-rest encryption	KMS at-rest encryption

SAA - C03 Certification: Data & Analytics

Athena

Federated Query

Redshift

Redshift CLuster

Snapshots & DR

Redshift Spectrum

OpenSearch

EMR

Node types & Purchasing

Quicksight

Glue

Things to know at a high level

AWS Lake Formation

Kinesis Data Analytics

MSK (Managed Streaming for Apache Kafka)

The difference between Kinesis Data Streams vs MSK

Comments

SAA-C03 Certification

SAA - C03 Certification: Database in AWS

More from this blog

How to daily back up PostgreSQL into S3

How to install Redis Sentinel using Helm in K8S

Adding a new hostname or IP Address to K8S API Server

SAA - C03 Certification: Other Services

SAA - C03 Certification: More Solution Architectures

Command Palette

Athena

Federated Query

Redshift

Redshift CLuster

Snapshots & DR

Redshift Spectrum

OpenSearch

EMR

Node types & Purchasing

Quicksight

Glue

Things to know at a high level

AWS Lake Formation

Kinesis Data Analytics

MSK (Managed Streaming for Apache Kafka)

The difference between Kinesis Data Streams vs MSK

Comments

SAA-C03 Certification

SAA - C03 Certification: Database in AWS

More from this blog