5 CloudWatch metrics worth monitoring when using AWS Kinesis Streams
Good health monitoring is key when using Kinesis Streams for real-time (or close to real-time) processing. Tracking key metrics in CloudWatch during development can illuminate bottlenecks before they become a problem in production. Tracking these metrics in production can give you an early warning before your system is slowed down by a bottleneck.
#1 - GetRecords.IteratorAgeMilliseconds
This metric tracks the read position across all shards and consumers across the stream. Keep in mind that if this metric passes 50% of your retention period, there is a risk of losing data. Refer to the official AWS documentation on how to troubleshoot consumers falling behind in processing records.
Enable Enhanced Shard-level Metrics if you need this metric on a per shard basis. If you use the Kinesis Client Library (KCL), you can get access to these metrics for a specific application.
Tracking this metric on the Maximum statistic will alert you before losing data could become a risk.
#2 - ReadProvisionedThroughputExceeded
Calling the GetRecords API has some limitations you should be aware of. Each shard can support up to 5 transactions per second for reads. If you exceed these limits you will receive a ProvisionedThroughputExceededException. You should track these exceptions as they can cause your consumer to fall behind.
Use this metric to track the number of throttled GetRecords calls. A Minimum statistic of 1 will indicate that all your requests were being throttled. A Maximum statistic of 0 will indicate that no requests were being throttled. Check the Average statistic to see when your requests are throttled.
#3 - WriteProvisionedThroughputExceeded
Similar to your reads, writes can be throttled too. Each shard supports up to 1,000 records per second for writes up to a maximum of 1 MB per second (including your partition keys). If your call to PutRecord or PutRecords exceeds these limits a ProvisionedThroughputExceededException is returned.
When the Minimum statistic is non-zero, records were being throttled. When the Maximum statistic has a value of 0, no records were being throttled.
Most commonly you would check the Average statistic for throttled writes.
#4 - PutRecord.Success, PutRecords.Success
This metric measures the number of successful PutRecord and PutRecords requests. You can read more about the reasons why these requests can fail here:
Check the Average statistic for any drops in successful writes to your stream.
#5 - GetRecords.Success
This metric returns the number of successful GetRecords operations. Any possible errors are described in the official documentation.
Check this metric to see if your consumers fail to retrieve records from your stream. Checking the Average statistic will give you a good indication on when this is happening.