AWS Cloud-Native Platform

Scalable Observability
Stack on AWS

A production-grade platform for collecting, storing, visualizing, and alerting on infrastructure, container, and application metrics using Prometheus, Grafana, Amazon EKS, and AWS managed services.

Explore Architecture Deployment Guide

Monitoring Layers

AWS Services

Alert Rules

Deploy Modes

Scroll

Section 01

Architecture Explorer

Click any component below to explore its role in the observability stack, configuration details, and documentation links.

AWS Cloud-Native Topology

Select a component

Click any node to explore its configuration and role

Data Flow

App→Prometheus(exposes /metrics)

Prometheus→AMP(remote_write)

Prometheus→Alertmanager(fires alerts)

Alertmanager→SNS/Slack(routes alerts)

Fluentd→CloudWatch(ships logs)

OTel→X-Ray(sends traces)

Section 02

Platform Components

Six functional categories covering the full observability spectrum — from raw metric collection to distributed tracing and centralized logging.

◈

Metrics Collection

Pull-based metrics collection from all layers of the stack. Prometheus scrapes annotated pods every 15 seconds and evaluates alerting rules continuously.

Prometheusnode_exporterkube-state-metrics

◉

Visualization

Rich, interactive dashboards for infrastructure, Kubernetes, and application metrics. AMG provides enterprise SSO and team collaboration features.

GrafanaAmazon Managed Grafana

◎

Alerting

Multi-channel alert routing with deduplication, grouping, and silencing. Alerts flow from Prometheus → Alertmanager → SNS → Slack/Email.

AlertmanagerAmazon SNSSlack Webhooks

◫

Managed Storage

Serverless, scalable metric and log storage. AMP provides unlimited retention with PromQL compatibility. CloudWatch handles logs and RDS metrics.

Amazon Managed PrometheusAmazon CloudWatch

◌

Tracing

Distributed tracing for end-to-end request visibility. The OTel Collector receives OTLP traces and exports them to X-Ray for service map generation.

OpenTelemetry CollectorAWS X-Ray

◧

Logging

Centralized log aggregation from all Kubernetes pods. Fluentd enriches logs with pod metadata and ships them to CloudWatch for search and analysis.

FluentdAmazon CloudWatch Logs

15s

Prometheus Scrape Interval

99.9%

AMP SLA Uptime

∞

AMP Metric Retention

Section 03

Monitoring Strategy

Four distinct monitoring layers provide complete observability coverage from bare-metal node metrics up through application-level business KPIs.

Infrastructure Metrics

Node-level hardware and OS metrics

node_exporterkube-state-metrics

CPU UsagePromQL

100 - (avg by(instance)(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)

Percentage of CPU time not spent idle, averaged across all cores per node.

Memory UsagePromQL

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Percentage of total memory currently in use on each node.

Disk I/OPromQL

rate(node_disk_io_time_seconds_total[5m])

Rate of time the disk was busy performing I/O operations.

Network TrafficPromQL

rate(node_network_receive_bytes_total[5m])

Rate of bytes received on all network interfaces per node.

Section 04

Alerting Rules

Production-grade Prometheus alerting rules with severity classification, PromQL expressions, and multi-channel routing via Alertmanager.

Pipeline:Prometheus Rules→Alertmanager→SNS Topic→Slack / Email

Section 05

Security Model

A multi-layered security approach following AWS Well-Architected Framework principles — least privilege, defense in depth, and encryption everywhere.

01Identity & Access

IAM Roles for Service Accounts (IRSA)

IRSA binds a Kubernetes ServiceAccount to an AWS IAM Role via the EKS OIDC provider. Each workload — Prometheus, Fluentd, OTel Collector — receives only the minimum IAM permissions it needs. This eliminates the need for long-lived credentials and enforces the principle of least privilege at the pod level.

Prometheus: AmazonPrometheusRemoteWriteAccess

Fluentd: CloudWatchLogsFullAccess (scoped to log group)

OTel Collector: AWSXRayDaemonWriteAccess

02Network Isolation

Private Subnet Architecture

All critical workloads — EKS nodes, RDS, Prometheus, Grafana — are deployed exclusively in private subnets with no direct internet-facing exposure. Outbound internet access is provided through NAT Gateways in the public subnets. The Application Load Balancer is the only public-facing entry point.

EKS nodes: private subnets only

RDS: private subnets with dedicated subnet group

NAT Gateway: outbound-only internet access

ALB: public subnets, HTTPS termination

03Network Access Control

Security Group Restrictions

Security groups enforce strict ingress and egress rules at the resource level. The RDS security group only allows traffic on port 3306 from the EKS node security group. The ALB security group only allows HTTPS (443) from the internet. Internal service communication is restricted to known port ranges.

RDS SG: port 3306 from EKS nodes only

ALB SG: port 443 from 0.0.0.0/0

EKS nodes: no direct inbound from internet

Prometheus: port 9090 from monitoring namespace only

04Data Protection

TLS Encryption in Transit

All data in transit between components is encrypted using TLS. The ALB terminates TLS for external traffic. Internal communication between Prometheus and Amazon Managed Prometheus uses SigV4 signing over HTTPS. Grafana connects to Prometheus via HTTPS with certificate validation enabled.

ALB: TLS 1.2+ termination with ACM certificate

Prometheus → AMP: HTTPS + SigV4 signing

Grafana → Prometheus: HTTPS with cert validation

Fluentd → CloudWatch: HTTPS API calls

Security Compliance Checklist

✓Least Privilege IAM

✓Private Subnets

✓TLS in Transit

✓Security Groups

✓No Hardcoded Secrets

✓IRSA Enabled

✓CloudTrail Logging

✓VPC Flow Logs

Section 06

Cost-Aware Design

Two deployment configurations — a lean lab setup for learning and a production-grade HA architecture — with estimated monthly AWS costs.

Budget Lab Version

Self-managed on minimal EKS — ideal for learning and portfolio demos

$120–$180

per month (est.)

EKS Cluster

1 cluster (control plane)

$73/mo

EC2 Node Group

2× t3.medium nodes

$30/mo

Prometheus

Self-managed pod on EKS

Grafana

Self-hosted on EC2 t3.micro

$8/mo

RDS MySQL

db.t3.micro, 20GB, single-AZ

$15/mo

ALB

1 load balancer + LCU hours

$18/mo

CloudWatch

Basic metrics + 5GB logs

$5/mo

NAT Gateway

1 NAT + data transfer

$35/mo

Estimated Monthly Total$120–$180

Advantages

✓Lowest cost — suitable for personal projects

✓Full control over Prometheus configuration

✓Demonstrates hands-on Kubernetes expertise

✓Easy to spin up and tear down

Limitations

✗Manual Prometheus upgrades and maintenance

✗No built-in HA for monitoring components

✗Limited long-term metric retention

Cost Optimization Tips

Use Spot Instances for EKS node groups to reduce EC2 costs by up to 70%.

Enable metric downsampling in Prometheus to reduce AMP ingestion costs.

Set CloudWatch log retention to 30 days to avoid unbounded storage costs.

Section 07

Deployment Guide

Step-by-step instructions for deploying the full observability stack on AWS from scratch using Terraform and kubectl.

Progress

Step 1 of 7

Prerequisites

Ensure the following tools are installed and configured on your local machine before proceeding with the deployment.

Commands

# Verify required tools
aws --version         # AWS CLI v2+
terraform --version   # Terraform v1.5+
kubectl version       # kubectl v1.27+
helm version          # Helm v3.12+
docker --version      # Docker 24+

# Configure AWS credentials
aws configure
# AWS Access Key ID: <your-key>
# AWS Secret Access Key: <your-secret>
# Default region name: us-east-1

⚠

Ensure your AWS IAM user has AdministratorAccess or the specific permissions required for EKS, VPC, RDS, AMP, and AMG.

Scalable ObservabilityStack on AWS