Wednesday, June 3, 2015

Amazon Kinesis Update – Simplified Capture of Streaming Data

Amazon Kinesis is a managed service designed to handle real-time streaming of big data. It can accept any amount of data, from any number of sources, scaling up and down as needed (see my introductory post for more information on Kinesis). Developers can use the Kinesis Client Library (KCL) to simplify the implementation of apps that consume and process streamed data.

Today we are making the capture of streaming data with Kinesis even easier, with a powerful new Kinesis Producer Library, a big step up in the maximum record size, and a price reduction that makes capture of small-sized records even more cost-effective.

Let’s take a closer look!

Increased Record Size
A Kinesis record is simply a blob of data, also known as the payload. Kinesis does not look inside the data; it simply accepts the record (via PutRecord or PutRecords) from a producer and puts it into a stream.

We launched Kinesis with support records that could be as large as 50 KB. With today’s update we are raising this limit by a factor of 20; individual records can now be as large as 1 MB. This gives you a lot more flexibility and opens the door to some interesting new ways to use Kinesis. For example, you can now send larger log files, semi-structured documents, email messages, and other data types without having to split them in to small chunks.

Price Reduction for Put Calls
Up until now, pricing for Put operations was based on the number of records, with a charge of $0.028 for every million records.

Going forward, pricing for Put operations will be based on the size of the payload, as expressed in “payload units” of 25 KB. The charge will be $0.014 per million units. In other words, Putting small records (25 KB or less) now costs half as much as it did before. The vast majority of our customers use Kinesis in this way today and they’ll benefit from the price reduction.

For more information, take a look at the Kinesis Pricing page.

Kinesis Producer Library (KPL)
I’ve saved the biggest news for last!

You can use Kinesis to handle the data streaming needs of many different types of applications including websites (clickstream data), ad servers (publisher data), mobile apps (customer engagement data), and so forth.

In order to achieve high throughput, you should combine multiple records into a single call to PutRecords. You should also consider aggregating multiple user records into a single Kinesis record, and then de-aggregating them immediately prior to consumption. Finally, you will need code to detect and retry failed calls.

The new Kinesis Producer Library (KPL) will help you with all of the tasks that I identified above. It will allow you to write to one or more Kinesis streams with automatic and configurable retry logic; collect multiple records and write them in batch fashion using PutRecords; aggregate user records to increase payload size and throughput, and submit Amazon CloudWatch metrics (including throughput and error rates) on your behalf.

The KPL plays well with the Kinesis Client Library (KCL). The KCL takes care of many of the more complex tasks associated with consuming and processing streaming data in a distributed fashion, including load balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to chances in sharding.

When the KCL receives an aggregated record with multiple KPL user records inside, it will automatically de-aggregate the records before making them available to the client application (you will need to upgrade to the newest version of the KCL in order to take advantage of this feature).

The KPL presents a Java API that is asynchronous and non-blocking; you simply hand records to it and receive a Future object in return Here’s a sample call to the addUserRecord method:

public void run() {
  ByteBuffer data = Utils.generateData(sequenceNumber.get(), DATA_SIZE);
  // TIMESTAMP is our partition key
  ListenableFuture f =
    producer.addUserRecord(STREAM_NAME, TIMESTAMP, Utils.randomExplicitHashKey(), data);
  Futures.addCallback(f, callback);
}

The core of the KPL takes the form of a C++ module; wrappers in other languages will be available soon.

KPL runs on Linux and OSX. Self-contained binaries are available for the Amazon Linux AMI, Ubuntu, RHEL, OSX, and OSX Server. Source code and unit tests are also available (note that the KCL and the KPL are made available in separate packages).

For more information, read about Developing Producers with KPL.

Jeff;

No comments:

Post a Comment