Key Differences in Treating Data in Motion (Confluent Focus) 

By Ankita Shephali | @intelia | November 11

Introduction

In today’s age, data is omnipresent and with time we have also realised how valuable it is. We are not only consumers of data but also producers of it. Thanks to the invention of mobile technology like smartphones, tablets etc. in our tech-savvy world, an average person creates 2.5 quintillion bytes of data per day. People are not the only contributors in data creation, there are other sources of data as well such as fitness bands, sensors, trackers, banking transactions, payment processors, retail transactions, financial data, regulatory and compliance programs etc. With this huge amount of data coming in, we also need to process it in such a way that we can make sense of it, utilise it and benefit from it.

 

What is streaming data?

Data can be broadly categorised as bounded or finite data and unbounded or infinite data. In some scenarios, the nature of data is static i.e. we receive the entire data set at the start which has mostly been stored over a period of time, this type of data is termed as batch data whereas in some scenarios, the nature of data is dynamic i.e. there is continuous flow of data coming from sources that are subject to change and due to this it’s impossible to store the entire data set before the processing starts, this type of data is termed as Streaming Data. Since the streaming data is continuous, it helps to capture the events/changes as and when they occur at the source. The real-time nature of stream data allows data experts to deliver continuous insights to business users across the organisation. It helps the decision makers to react and respond to crisis events much quicker.

As businesses strive to attain an edge versus their competitors, their ability to receive, understand and act on this data in real-time is becoming ever important.

How is streaming data treated differently to batch data

As we know that both streaming and batch data are different in nature, it would be unfair to expect that we can put both in the same bracket and process it in the same way. Batch data processing requires extracting the data sets from sources, transforming to make them useful and then loading them into a destination system. Batch processing works for reporting and applications that can tolerate latency of hours or even days before data becomes available downstream. As opposed to batch data processing, streaming data pipelines are executed continuously, all the time. They consume streams of messages, apply transformations, filters, aggregations, or joins, to the messages, and publish the processed messages to another stream. Typically, streaming data pipelines are deployed with:

  • A data source connector, which takes care of extracting data change events from a data store and writing them into the stream consumed by the data pipeline
  • A data sink connector, which extracts processed messages from the stream filled by the data pipeline and publishes them to a data store

Why is Confluent a good way to deploy/utilise a streaming data solution?

There are currently a lot of data streaming platforms available in the market that facilitate ingesting, storing and analysing continuously streaming data in real-time. Confluent is one such platform. It is a complete, enterprise-grade distribution of Apache Kafka. It lets users connect, process and react to data in real-time.

At the core of Confluent Platform is Apache Kafka, the most popular open-source distributed streaming platform. The key capabilities of Kafka are:

  • Publish and subscribe to streams of records
  • Store streams of records in a fault-tolerant way
  • Process streams of records

Confluent expands the benefits of Kafka with enterprise-grade features while removing the burden of Kafka management and monitoring. By integrating historical and real-time data into a single, central source of truth, Confluent makes it easy to build an entirely new category of modern, event-driven applications, gain a universal data pipeline, and unlock powerful new use cases with full scalability, performance, and reliability. Confluent Platform lets you focus on how to derive business value from your data rather than worrying about the underlying mechanics, such as how data is being transported or integrated between disparate systems. Specifically, Confluent Platform simplifies connecting data sources to Kafka (via its inbuilt connectors to sources ranging from legacy data platforms e.g., Oracle & SQL through to next-generation and SaaS applications like Salesforce and Zendesk), building streaming applications, as well as securing, monitoring, and managing your Kafka infrastructure. Today, the Confluent Platform is used for a wide array of use cases across numerous industries, from financial services, omnichannel retail, and autonomous cars, to fraud detection, microservices, and IoT.