AWS Cloud-Native Platform

Scalable Observability
Stack on AWS

A production-grade platform for collecting, storing, visualizing, and alerting on infrastructure, container, and application metrics using Prometheus, Grafana, Amazon EKS, and AWS managed services.

4
Monitoring Layers
7
AWS Services
5
Alert Rules
2
Deploy Modes
Scroll
Section 01

Architecture Explorer

Click any component below to explore its role in the observability stack, configuration details, and documentation links.

Architecture diagram
AWS Cloud-Native Topology

Select a component

Click any node to explore its configuration and role

Data Flow
AppPrometheus(exposes /metrics)
PrometheusAMP(remote_write)
PrometheusAlertmanager(fires alerts)
AlertmanagerSNS/Slack(routes alerts)
FluentdCloudWatch(ships logs)
OTelX-Ray(sends traces)
Section 02

Platform Components

Six functional categories covering the full observability spectrum — from raw metric collection to distributed tracing and centralized logging.

Metrics Collection

Pull-based metrics collection from all layers of the stack. Prometheus scrapes annotated pods every 15 seconds and evaluates alerting rules continuously.

Prometheusnode_exporterkube-state-metrics

Visualization

Rich, interactive dashboards for infrastructure, Kubernetes, and application metrics. AMG provides enterprise SSO and team collaboration features.

GrafanaAmazon Managed Grafana

Alerting

Multi-channel alert routing with deduplication, grouping, and silencing. Alerts flow from Prometheus → Alertmanager → SNS → Slack/Email.

AlertmanagerAmazon SNSSlack Webhooks

Managed Storage

Serverless, scalable metric and log storage. AMP provides unlimited retention with PromQL compatibility. CloudWatch handles logs and RDS metrics.

Amazon Managed PrometheusAmazon CloudWatch

Tracing

Distributed tracing for end-to-end request visibility. The OTel Collector receives OTLP traces and exports them to X-Ray for service map generation.

OpenTelemetry CollectorAWS X-Ray

Logging

Centralized log aggregation from all Kubernetes pods. Fluentd enriches logs with pod metadata and ships them to CloudWatch for search and analysis.

FluentdAmazon CloudWatch Logs
15s
Prometheus Scrape Interval
99.9%
AMP SLA Uptime
AMP Metric Retention
Section 03

Monitoring Strategy

Four distinct monitoring layers provide complete observability coverage from bare-metal node metrics up through application-level business KPIs.

Monitoring dashboards

Infrastructure Metrics

Node-level hardware and OS metrics

node_exporterkube-state-metrics
CPU UsagePromQL
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)

Percentage of CPU time not spent idle, averaged across all cores per node.

Memory UsagePromQL
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Percentage of total memory currently in use on each node.

Disk I/OPromQL
rate(node_disk_io_time_seconds_total[5m])

Rate of time the disk was busy performing I/O operations.

Network TrafficPromQL
rate(node_network_receive_bytes_total[5m])

Rate of bytes received on all network interfaces per node.

Section 04

Alerting Rules

Production-grade Prometheus alerting rules with severity classification, PromQL expressions, and multi-channel routing via Alertmanager.

Pipeline:Prometheus RulesAlertmanagerSNS TopicSlack / Email
Section 05

Security Model

A multi-layered security approach following AWS Well-Architected Framework principles — least privilege, defense in depth, and encryption everywhere.

01Identity & Access

IAM Roles for Service Accounts (IRSA)

IRSA binds a Kubernetes ServiceAccount to an AWS IAM Role via the EKS OIDC provider. Each workload — Prometheus, Fluentd, OTel Collector — receives only the minimum IAM permissions it needs. This eliminates the need for long-lived credentials and enforces the principle of least privilege at the pod level.

Prometheus: AmazonPrometheusRemoteWriteAccess
Fluentd: CloudWatchLogsFullAccess (scoped to log group)
OTel Collector: AWSXRayDaemonWriteAccess
02Network Isolation

Private Subnet Architecture

All critical workloads — EKS nodes, RDS, Prometheus, Grafana — are deployed exclusively in private subnets with no direct internet-facing exposure. Outbound internet access is provided through NAT Gateways in the public subnets. The Application Load Balancer is the only public-facing entry point.

EKS nodes: private subnets only
RDS: private subnets with dedicated subnet group
NAT Gateway: outbound-only internet access
ALB: public subnets, HTTPS termination
03Network Access Control

Security Group Restrictions

Security groups enforce strict ingress and egress rules at the resource level. The RDS security group only allows traffic on port 3306 from the EKS node security group. The ALB security group only allows HTTPS (443) from the internet. Internal service communication is restricted to known port ranges.

RDS SG: port 3306 from EKS nodes only
ALB SG: port 443 from 0.0.0.0/0
EKS nodes: no direct inbound from internet
Prometheus: port 9090 from monitoring namespace only
04Data Protection

TLS Encryption in Transit

All data in transit between components is encrypted using TLS. The ALB terminates TLS for external traffic. Internal communication between Prometheus and Amazon Managed Prometheus uses SigV4 signing over HTTPS. Grafana connects to Prometheus via HTTPS with certificate validation enabled.

ALB: TLS 1.2+ termination with ACM certificate
Prometheus → AMP: HTTPS + SigV4 signing
Grafana → Prometheus: HTTPS with cert validation
Fluentd → CloudWatch: HTTPS API calls
Security Compliance Checklist
Least Privilege IAM
Private Subnets
TLS in Transit
Security Groups
No Hardcoded Secrets
IRSA Enabled
CloudTrail Logging
VPC Flow Logs
Section 06

Cost-Aware Design

Two deployment configurations — a lean lab setup for learning and a production-grade HA architecture — with estimated monthly AWS costs.

Budget Lab Version

Self-managed on minimal EKS — ideal for learning and portfolio demos

$120–$180
per month (est.)
EKS Cluster
1 cluster (control plane)
$73/mo
EC2 Node Group
2× t3.medium nodes
$30/mo
Prometheus
Self-managed pod on EKS
$0
Grafana
Self-hosted on EC2 t3.micro
$8/mo
RDS MySQL
db.t3.micro, 20GB, single-AZ
$15/mo
ALB
1 load balancer + LCU hours
$18/mo
CloudWatch
Basic metrics + 5GB logs
$5/mo
NAT Gateway
1 NAT + data transfer
$35/mo
Estimated Monthly Total$120–$180
Advantages
Lowest cost — suitable for personal projects
Full control over Prometheus configuration
Demonstrates hands-on Kubernetes expertise
Easy to spin up and tear down
Limitations
Manual Prometheus upgrades and maintenance
No built-in HA for monitoring components
Limited long-term metric retention
Cost Optimization Tips

Use Spot Instances for EKS node groups to reduce EC2 costs by up to 70%.

Enable metric downsampling in Prometheus to reduce AMP ingestion costs.

Set CloudWatch log retention to 30 days to avoid unbounded storage costs.

Section 07

Deployment Guide

Step-by-step instructions for deploying the full observability stack on AWS from scratch using Terraform and kubectl.

Progress
Step 1 of 7
01

Prerequisites

Ensure the following tools are installed and configured on your local machine before proceeding with the deployment.

Commands
# Verify required tools
aws --version         # AWS CLI v2+
terraform --version   # Terraform v1.5+
kubectl version       # kubectl v1.27+
helm version          # Helm v3.12+
docker --version      # Docker 24+

# Configure AWS credentials
aws configure
# AWS Access Key ID: <your-key>
# AWS Secret Access Key: <your-secret>
# Default region name: us-east-1

Ensure your AWS IAM user has AdministratorAccess or the specific permissions required for EKS, VPC, RDS, AMP, and AMG.