SAA - C03 Certification: Data & Analytics
I am a dedicated software engineer with a deep passion for security and a commitment to developing robust and scalable solutions. With over three years of hands-on experience in the .NET ecosystem, I have built, maintained, and optimized various software applications, demonstrating my ability to adapt to diverse project needs. In addition to my expertise in .NET, I have six months of specialized experience working with Spring Boot and ReactJS, further broadening my skill set to include full-stack development and modern web technologies. My professional journey includes deploying small to medium-sized systems to cloud platforms and on-premises environments, where I have ensured reliability, scalability, and efficient resource utilization. This combination of skills and experience reflects my versatility and commitment to staying at the forefront of the ever-evolving tech landscape.
Athena
Serverless query service to analyze data stored in S3
Use SQL language to query the files (built on Presto)
Supports CSV, JSON, ORC, Avro, and Parquet
Pricing: 5$ per TB of data scanned
Commonly used with Amazon Quicksight for reporting/dashboards
Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails,…
Exam tips: analyze data in S3 using serverless SQL, use Athena
Federated Query
Allows you to run SQL queries across data stored in SQL, NoSQL, object,…
Uses Data Source Connectors that run on AWS Lambda to run Federated Queries
Store the result back in the S3 bucket
Redshift
It is based on PostgreSQL, but it’s not used for OLTP
It’s OLAP (Online Analytical Processing)
10x better performance than other data warehouses, scale to PBs of data
Columnar storage of data & parallel query engine
Two modes: Serverless Cluster & Provisioned Cluster
Has a SQL interface for performing the queries
BI Tools such as Amazon Quicksight or Tableau
vs Athena: faster queries/joins/aggregations thanks to indexes
Redshift CLuster
The architecture:
Leader node: for query planning, results aggregation
Compute node: for performing the queries, send results to the leader node
Provisioned mode:
Choose instance types in advance
Can reserve instances for cost savings
Snapshots & DR
Snapshots are point-in-time backups of a cluster, stored internally in S3
Snapshots are incremental
You can restore a snapshot into a new cluster
Automated: every 8 hours, every 5 GB, or a schedule
Manual: snapshot is retained until you delete it
You can configure Redshift to copy snapshots of a cluster to another region automatically
Redshift Spectrum
Query data that is already in S3 without loading it
Must have a Redshift cluster available to start the query
The query is then submitted to thousands of Redshift Spectrum nodes
OpenSearch
Two modes:
managed cluster
serverless cluster
Does not natively support SQL (can be enabled via a plugin)
Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs,…
Comes with OpenSearch Dashboards
EMR
EMR stands for “Elastic MapReduce”
EMR helps create Hadoop clusters to analyze and process vast amounts of data
The cluster can be made of hundreds of EC2 instances
EMR comes bundled with Spark, HBase, Presto, Flink,…
EMR takes care of all the provisioning and configuration
Auto-scaling and integrated with Spot instances
Use cases: data processing, machine learning, web indexing, big data,…
Node types & Purchasing
Master Node
Core Node
Task Node (optional)
Purchasing options:
On-demand
Reserved (min 1 year): cost savings
Spot instances: cheaper
Can have a long-running cluster, or transient (temporary) cluster
Quicksight
Serverless machine learning-powered business intelligence service to create interactive dashboards
Fast, automatically scalable, embeddable, with per-session pricing
Use cases:
Business Analytics
Building visualizations
Perform ad-hoc analysis
Integrated with RDS, Aurora, Athena, Redshift, S3,…
Im-memory computation using the SPICE engine if data is imported into QuickSight
Enterprise edition: Column-level Security
Glue
Managed extract, transform, and load (ETL) service
Useful to prepare and transform data for analytics
Full serverless services
Use cases: convert data into Parquet format
Things to know at a high level
Glue Job Bookmarks: prevent re-processing old data
Glue Elastic Views:
Combine and replicate data across multiple data stores using SQL
No custom code
Leverages a “virtual table”
Glue DataBrew: clean and normalize data using pre-built transformation
Glue Studio: new GUI to create, run, and monitor ETL jobs in Glue
Glue Streaming ETL (built on Spark): compatible with Kinesis Data Streaming, Kafka, MSK (managed Kafka)
AWS Lake Formation
Data lake = central place to have all data for analytics purposes
Fully managed service that makes it easy to set up a data lake in days
Discover, cleanse, transform, and ingest data into Data Lake
It automates many complex manual steps (collecting, cleansing, moving, cataloging data,…) and de-duplicate (using ML Transforms)
Combine structured and unstructured data in the data lake
Out-of-the-box source blueprints: S3, RDS, Relational & NoSQL DB,…
Fine-grained Access Control for applications (row and column-level)
Built on top of AWS Glue
Kinesis Data Analytics
For SQL
Real-time analytics on Kinesis Data Stream & Firehose using SQL
Add reference data from S3 to enrich streaming data
It is fully managed, with no servers to provision
Automatic scaling
Pay for actual consumption rate
Output:
Kinesis Data Streams
Kinesis Data Firehose
Use cases: Time series analytics, Real-time dashboards, Real-time metrics
For Apache Flink
Use Flink (Java, Scala, or SQL) to process and analyze streaming data
Run any Apache Flink application on a managed cluster on AWS
provisioning compute resources, parallel computation, automatic scaling
application backups
Use any Apache Flink programming features
Flink does not read from Firehose (use Kinesis Analytics for SQL instead)
MSK (Managed Streaming for Apache Kafka)
Alternative to Amazon Kinesis
Fully managed Kafka on AWS
Allow to create, update, and delete clusters
MSK creates & manages Kafka brokers nodes & Zookeeper nodes
Deploy the MSK cluster in VPC, multi-AZ (up to 3 for HA)
Automatic recovery from common Kafka failures
Data is stored on EBS volumes for as long as you want
MSK Serverless
Run Kafka on AWS on MSK without managing the capacity
MSK automatically provisions resources and scales computing & storage
The difference between Kinesis Data Streams vs MSK
| Kinesis Data Stream | MSK |
| 1 MB message size limit | 1 MB default, configured for higher |
| Data Streams with Shards | Kafka Topic with Partitions |
| TLS in-flight encryption | PLAINTEXT or TLS In-flight encryption |
| KMS at-rest encryption | KMS at-rest encryption |