[AWS] Kinesis Data Streams

Kinesis Data Streams (KDS) collects and processes a large amount of incoming data from an unlimited number of producers.

https://aws.amazon.com/kinesis

Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service.

KDS Features

KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events.
The data collected is available in milliseconds to enable real-time analytics.

Use Cases

Producers supply data to Kinesis, e.g., any IoT (Internet of Things) devices.
Consumers are any entity that can consume the data.
“Kinesis Data Streams” is used:
- Real-time analytics or feed data into other services in real-time with data retention.
- e.g.) analyze logs continuously or run real-time analytics on click system data
“Kinesis Data Streams” keeps the item for a while (retention period – 24 hours by default)
- You can invoke the Lambda function to process each item.
- “Kinesis Firehose” is just a connector and does not keep the item so you cannot invoke a function per item.

How “Kinesis Data Streams” Works

By default, a Lambda function is invoked as soon as records are available in the stream.
- Lambda service can process up to 10 batches in each shard simultaneously.
- If you increase the number of concurrent batches per Shard, Lambda still ensures in-order processing at the partition-key level.
Transient Data Store:
- Streams are deleted based on their rolling retention window.
- Once data is inserted in Kinesis, it can’t be deleted.
- Retention Period (1~365 days)
  - 24-hour default
  - ~ 7 days: resolve potential downstream data losses
  - ~ 365 days: reprocess data, back-fill data stores, and auditing
Kinesis Data Stream provides an ordering of records.
- Data that shares the same partition goes to the same shard.

Kinesis Shards

Shards

Shards are the capacity of a Kinesis Stream.
Each shard has the same partition key, but the data are divided by the sequence number.
Kinesis Data Record
- Data record is the basic entity.
- Each shard consists of a sequence of data records.
- Data records are composed of a sequence number, a partition key, and a data blob. Data blob can be up to 1MB.

Resharding

Resharding enables you to increase or decrease the number of shards in a stream in order to adapt to changes in the rate of data flowing through the stream.
Typically, each shard is processed by a single worker and has one corresponding record processor. All record processors run in parallel within a process.
When you do the resharding, you can use Auto Scaling to automatically scale your instances.

Capacity Modes

Provisioned Mode
- Choose the number of shards provisioned
  - Read Capacity: 1 MB/s per shard
  - Write Capacity: 2 MB/s per shard
- Pay per shard provisioned
On-demand Mode
- Capacity is adjusted on demand
- Scales automatically based on the observed throughput
- Pay per throughput

Security

IAM Policies
Encryption in transit via HTTPS
Encryption at rest using KMS

Error Handling

The “PutRecords” API call to write multiple data records into a “Kinesis data streams” in a single call.

An unsuccessfully processed record includes ErrorCode and ErrorMessage values. The “ProvisionedThroughputExceededException” indicates that the request rate for the stream is too high, or the requested data is too large for the available throughput.

Reduce the frequency or size of your requests
Use an error retry and exponential backoff mechanism
Distribute read and write operations as evenly as possible across all of the shards in Data Streams
Finally, you can reshard your stream to increase the number of shards in the stream

Interacting with “Kinesis Data Streams”

Kinesis Producer Library (KPL)
- KPL passes data to Kinesis Data Stream.
- KPL provides the efficient abstraction layer for ingesting data with automatic retry and better performance
Kinesis Client Library (KCL)
- KCL delivers all records for a given partition key to the same record processor, making it easier to build multiple applications reading from the same “Kinesis Data Streams”.
Kinesis API (AWS SDK)
- “Kinesis API” is used to interact with “Kinesis Data Streams” through low level API operations such as (PutRecord or GetRecords).

Scaling Consumers

You can use the CloudWatch metric and alarm to scale consumers.
GetRecords.IteratorAgeMilliseconds
- The CloudWatch metric can be used to track the progress of consumers.
- If the value is bigger than 0, there are some unprocessed records.
Create an alarm and link it to the Auto Scaling

Working with Lambda

Kinesis Data Stream can be used as an event source for Lambda.

Kinesis Data Stream

The number of Shards

Lambda

Function batch size
Concurrent batches per shard
Batch window

[AWS] Kinesis Data Streams

KDS Features

Use Cases

How “Kinesis Data Streams” Works

Kinesis Shards

Capacity Modes

Security

Error Handling

Interacting with “Kinesis Data Streams”

Scaling Consumers

Working with Lambda

Published by P. L.

Leave a Comment Cancel reply

KDS Features

Use Cases

How “Kinesis Data Streams” Works

Kinesis Shards

Capacity Modes

Security

Error Handling

Interacting with “Kinesis Data Streams”

Scaling Consumers

Working with Lambda

Share this:

Related

Published by P. L.

Leave a Comment Cancel reply