Skip to main content

Command Palette

Search for a command to run...

SAA - C03 Certification: Data & Analytics

Updated
5 min read
T

I am a dedicated software engineer with a deep passion for security and a commitment to developing robust and scalable solutions. With over three years of hands-on experience in the .NET ecosystem, I have built, maintained, and optimized various software applications, demonstrating my ability to adapt to diverse project needs. In addition to my expertise in .NET, I have six months of specialized experience working with Spring Boot and ReactJS, further broadening my skill set to include full-stack development and modern web technologies. My professional journey includes deploying small to medium-sized systems to cloud platforms and on-premises environments, where I have ensured reliability, scalability, and efficient resource utilization. This combination of skills and experience reflects my versatility and commitment to staying at the forefront of the ever-evolving tech landscape.

Athena

  • Serverless query service to analyze data stored in S3

  • Use SQL language to query the files (built on Presto)

  • Supports CSV, JSON, ORC, Avro, and Parquet

  • Pricing: 5$ per TB of data scanned

  • Commonly used with Amazon Quicksight for reporting/dashboards

Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails,…

Exam tips: analyze data in S3 using serverless SQL, use Athena

Federated Query

  • Allows you to run SQL queries across data stored in SQL, NoSQL, object,…

  • Uses Data Source Connectors that run on AWS Lambda to run Federated Queries

  • Store the result back in the S3 bucket

Redshift

  • It is based on PostgreSQL, but it’s not used for OLTP

  • It’s OLAP (Online Analytical Processing)

  • 10x better performance than other data warehouses, scale to PBs of data

  • Columnar storage of data & parallel query engine

  • Two modes: Serverless Cluster & Provisioned Cluster

  • Has a SQL interface for performing the queries

  • BI Tools such as Amazon Quicksight or Tableau

  • vs Athena: faster queries/joins/aggregations thanks to indexes

Redshift CLuster

  • The architecture:

    • Leader node: for query planning, results aggregation

    • Compute node: for performing the queries, send results to the leader node

    • Provisioned mode:

      • Choose instance types in advance

      • Can reserve instances for cost savings

Snapshots & DR

  • Snapshots are point-in-time backups of a cluster, stored internally in S3

  • Snapshots are incremental

  • You can restore a snapshot into a new cluster

  • Automated: every 8 hours, every 5 GB, or a schedule

  • Manual: snapshot is retained until you delete it

  • You can configure Redshift to copy snapshots of a cluster to another region automatically

Redshift Spectrum

  • Query data that is already in S3 without loading it

  • Must have a Redshift cluster available to start the query

  • The query is then submitted to thousands of Redshift Spectrum nodes

OpenSearch

  • Two modes:

    • managed cluster

    • serverless cluster

  • Does not natively support SQL (can be enabled via a plugin)

  • Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs,…

  • Comes with OpenSearch Dashboards

EMR

  • EMR stands for “Elastic MapReduce”

  • EMR helps create Hadoop clusters to analyze and process vast amounts of data

  • The cluster can be made of hundreds of EC2 instances

  • EMR comes bundled with Spark, HBase, Presto, Flink,…

  • EMR takes care of all the provisioning and configuration

  • Auto-scaling and integrated with Spot instances

Use cases: data processing, machine learning, web indexing, big data,…

Node types & Purchasing

  • Master Node

  • Core Node

  • Task Node (optional)

  • Purchasing options:

    • On-demand

    • Reserved (min 1 year): cost savings

    • Spot instances: cheaper

  • Can have a long-running cluster, or transient (temporary) cluster

Quicksight

  • Serverless machine learning-powered business intelligence service to create interactive dashboards

  • Fast, automatically scalable, embeddable, with per-session pricing

  • Use cases:

    • Business Analytics

    • Building visualizations

    • Perform ad-hoc analysis

  • Integrated with RDS, Aurora, Athena, Redshift, S3,…

  • Im-memory computation using the SPICE engine if data is imported into QuickSight

  • Enterprise edition: Column-level Security

Glue

  • Managed extract, transform, and load (ETL) service

  • Useful to prepare and transform data for analytics

  • Full serverless services

  • Use cases: convert data into Parquet format

Things to know at a high level

  • Glue Job Bookmarks: prevent re-processing old data

  • Glue Elastic Views:

    • Combine and replicate data across multiple data stores using SQL

    • No custom code

    • Leverages a “virtual table”

  • Glue DataBrew: clean and normalize data using pre-built transformation

  • Glue Studio: new GUI to create, run, and monitor ETL jobs in Glue

  • Glue Streaming ETL (built on Spark): compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)

AWS Lake Formation

  • Data lake = central place to have all data for analytics purposes

  • Fully managed service that makes it easy to set up a data lake in days

  • Discover, cleanse, transform, and ingest data into Data Lake

  • It automates many complex manual steps (collecting, cleansing, moving, cataloging data,…) and de-duplicate (using ML Transforms)

  • Combine structured and unstructured data in the data lake

  • Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB,…

  • Fine-grained Access Control for applications (row and column-level)

  • Built on top of AWS Glue

Kinesis Data Analytics

For SQL

  • Real-time analytics on Kinesis Data Stream & Firehose using SQL

  • Add reference data from S3 to enrich streaming data

  • It is fully managed, with no servers to provision

  • Automatic scaling

  • Pay for actual consumption rate

  • Output:

    • Kinesis Data Streams

    • Kinesis Data Firehose

Use cases: Time series analytics, Real-time dashboards, Real-time metrics

For Apache Flink

  • Use Flink (Java, Scala, or SQL) to process and analyze streaming data

  • Run any Apache Flink application on a managed cluster on AWS

    • provisioning compute resources, parallel computation, automatic scaling

    • application backups

    • Use any Apache Flink programming features

    • Flink does not read from Firehose (use Kinesis Analytics for SQL instead)

MSK (Managed Streaming for Apache Kafka)

  • Alternative to Amazon Kinesis

  • Fully managed Kafka on AWS

    • Allow to create, update, and delete clusters

    • MSK creates & manages Kafka brokers nodes & Zookeeper nodes

    • Deploy the MSK cluster in VPC, multi-AZ (up to 3 for HA)

    • Automatic recovery from common Kafka failures

    • Data is stored on EBS volumes for as long as you want

  • MSK Serverless

    • Run Kafka on AWS on MSK without managing the capacity

    • MSK automatically provisions resources and scales computing & storage

The difference between Kinesis Data Streams vs MSK

Kinesis Data StreamMSK
1 MB message size limit1 MB default, configured for higher
Data Streams with ShardsKafka Topic with Partitions
TLS in-flight encryptionPLAINTEXT or TLS In-flight encryption
KMS at-rest encryptionKMS at-rest encryption
6 views

More from this blog

Tuan Do's Blog

37 posts

The blog acts like a personal notebook for jotting down thoughts