CloudWatch is a collection of services that monitor/observe cloud resources – via metrics, logs, and events – and provide actionable insights.
- CloudWatch monitors the performance of AWS services – a repository service for metric data.
- CloudWatch logs the events and visualizes them.
Amazon CloudWatch
Features
- Collect metrics on AWS and on premises
- A metric is a set of data points over time. (ex. CPU Utilization of EC2 instances)
- Enhance operational visibility and insights
- Metrics can be configured with alarms that can take actions.
- Improve resource optimization
- for example, auto-scaling is dependent on CloudWatch to trigger the addition or removal of instances.
Data Retention
- one-hour metrics (for 455 days)
- five-minute metrics (for 63 days)
- one-minute metrics (for 15 days)
Monitoring Plans
- Standard (Basic) monitoring : Data is available in 5-minute period at no charge.
- Detailed monitoring: Data is available in 1-minute period with an additional charge.
CloudWatch Services
CloudWatch is a collection of services: Alarms, Logs, Metrics, and Events.
Alarms
An alarm watches a metric over a specified time period and performs one or more specified actions, based on the value of the metric relative to a threshold over time.
- Alarm State
- Insufficient: not enough data
- Alarm: the threshold is breached.
- OK: The metric is within the defined threshold.
- The components of alarms are:
- Metric: The data points being measured
- Threshold: the criteria to check it is normal or abnormal
- Period: How long the state over the threshold is bad before an alarm is generated
- Action (Target): What needs to be done when an alarm is triggered
- SNS Notification
- Lambda actions
- EC2 actions: Stop, terminate, recover, or reboot an EC2 instance
- Auto Scaling actions: Execute an Auto Scaling policy
- Composite Alarms
- An alarm is on a single metric but you can combine alarms.
- monitoring the status of multiple alarms
- AND / OR conditions
- An alarm is on a single metric but you can combine alarms.
Metrics
- A time-ordered (timestamp) set of data points
- Exist only in a region where they are created
- Cannot be deleted. But old data are aggregated, and data older than 15 months are dropped.
| Services/Resources | Metric |
|---|---|
| EC2 instances | CPUUtilization StatusCheckFailed_System StatusCheckFailed_Instance |
| ELB | HealthyHostCount, UnHealthyHostCount RequestCount ELB_5xx HTTP_Code_ELB_5xx HTTP_Code_ELB_4xx SurgeQueueLength SpillOverCount |
| Network (instance) | NetworkIn NetworkOut |
| I/O (Instance) | DiskRead DiskWrite |
| ASG | Min/Max Group Size Desired Capacity Instance State: Service, Pending, Standby, & Terminating 400 and 500 Errors |
| Billing | by Service (such as AmazonS3, AmazonEC2, …) |
| Custom | Application Specific |
Metric Streams
- You can stream CloudWatch metrics to specified destinations in near-realtime.
- sends metrics to Kinesis firehose and then to other destinations (S3, Redshift, OpenSearch)
- You can send only a subset of streams via filtering.
CloudWatch Metrics Components
Namespaces
- A container for ClouldWatch metrics
- the naming convention: aws/service
- aws/ec2, aws/s3
Dimensions
- A name/value pair (attribute) that uniquely identifies a metric.
- instance id, environement
Statistics
- Aggregated metric data over specified periods of time
- Minimum, Maximum, Average, Sum, SampleCount …
CloudWatch Anomaly Detection
- Once it’s enabled, CloudWatch analyzes metrics to:
- determine the normal baseline
- check anomalies using Machine Learning
- display the values in a graph
- You can create an alarm based on the expected (normal) values
CloudWatch and EC2 Instances
CloudWatch does not collect some metrics for EC2 instances. You need to install a CloudWatch agent in the instances.
Default Metrics:
- Host Level metrics:
- CPU Utilization, Disk Reads/Writes, and Network Utilization (Network In/Out)
Custom Metrics with CloudWatch agents:
- EC2 does not send OS-level metrics to CloudWatch
- Memory utilization, processes, and disk space/swap usages
- The agent includes metrics such as ‘mem_active‘, ‘mem_available‘, and ‘mem_free‘.
Metric Resolutions
- Custom metrics can be one of the following resolutions:
- Standard resolution, with data having a one-minute granularity.
- High resolution, with data at a granularity of one second.
- Standard metrics use standard resolution.
- When you publish a high-resolution metric, CloudWatch stores it with a resolution of 1 second, and you can read and retrieve it with a period of 1 second, 5 seconds, 10 seconds, 30 seconds, or any multiple of 60 seconds.
Use API: PutMetricData
- Metric Resolution (–storage-resolution parameter)
- Standard: 60 seconds
- High Resolution: 1/5/10/30 seconds with higher cost
- Timestamp (–time-stamp)
- You can push a metric data at the specified time point (from 2 weeks in the past to 2 hours in the future)
- You need to configure the time of EC2 instances correctly to avoid errors.
Custom Metric Aggregation
You can aggregate your data before you publish to CloudWatch. When you have multiple data points per minute, aggregating data minimizes the number of calls to put-metric-data.
CloudWatch Unified Agents
- A new Unified agent replaces an old “CloudWatch agent” and “CloudWatch Logs agent”.
- The CloudWatch unified agent enables you to collect system logs, application logs, and metrics from your EC2 or on-premise instances.
- The logs can be sent in Amazon CloudWatch Logs.
- CloudWatch Logs Insights can be used to query logs.
- Systems Manager Parameter Store can be used to save the centralized configuration.
CloudWatch Logs
CloudWatch Logs is a service for centralizing logs. It stores, monitors, and accesses logs from AWS services and applications.
- “CloudWatch Logs” accepts log data from various sources a flow of time-ordered events.
- Structure
- A log group is a container for log streams. It usually represents an application. It controls retention, monitoring, and access. You can set filters in a group.
- A log stream is a sequence of log events with the same source within a log group.
- A log event is a timestamp and a raw message.
Sources
- AWS Services
- EC2, CloudTrail, …
- Elastic Beanstalk: a collection of application logs
- ECS: a collection of container logs
- Lambda: a collection of function logs
- VPC Flow Logs
- WAF Logs
- API Streams
- Custom applications
- CloudWatch Unified agents
- “CloudWatch Logs agents” are depreciated.
Export to S3
- To back up your logs, you can export them to S3
- S3 buckets must be encrypted with AES-256 (SSE-S3) or SSE-KMS.
- CreateExportTask API
- Not a real time processing
- For real-time process, use CloudWatch Subscriptions
CloudWatch Logs – Metric Filters
You can search and filter the log data coming into CloudWatch Logs by creating one or more metric filters.
- A metric filter uses pattern matches to analyze logs and create metrics.
- You need to assign dimensions and a unit to the metric.
- Metric filters are used to create alarms based on the metric filter.
- CloudWatch Logs uses these metric filters to turn log data into numerical CloudWatch metrics that you can graph or set an alarm on.
CloudWatch Logs Insights
CloudWatch Logs Insights is an enhanced-service for monitoring some AWS services and your applications It collects, aggregates, and summarizes logs and metrics.
You can query logs to help respond to operational issues more efficiently.
- Container Insights
- for containerized applications and services
- Lambda Insights
- for detailed performance metrics and logs of your lambda functions
- Application Insights
- for resources and work-load specific metrics of your application
-- Find the 25 most recently added log events.
fields @timestamp, @message
| sort @timestamp desc | limit 25
-- Get a list of the number of exceptions per hour.
filter @message like /Exception/
| stats count(*) as exceptionCount by bin(1h)
| sort exceptionCount desc
-- API Gateway
-- Find the last 10 4XX errors
fields @timestamp, status, ip, path, httpMethod
| filter status>=400 and status<=499
| sort @timestamp desc
| limit 10
- With CloudWatch Logs Insights, you can only query the data in the past. It is not a real time analysis.
Real-time Log Events Subscriptions
- You need to use “CloudWatch Logs Subscription” to get a real-time log event.
- You can send events to:
- Kinesis Firehose -> S3, Redshift, OpenSearch
- Kinesis Data Streams
- Lambda
- You can setup “Subscription Filters” to set up which logs are delivered.
CouldWatch Events
CloudWatch Events has been replaced with EventBridge.
CloudWatch Synthetics Canary
- Provides configurable scripts to monitor APIs or Web endpoints
- Node.js or Python
- Reproduce what customers do
- Find issues before customers are impacted
- Blueprints
- Heartbeat Monitor
- loads URL and stores screenshots
- Visual Monitoring
- Compares the screenshots taken during the Canary with the baseline screenshots
- API Canary
- tests read/write functions of REST APIs
- Broken Link Checker
- checks all link connections
- Canary Recorder
- records the actions
- Heartbeat Monitor
