Written by Chris Latimer, Vice President, Product Management, DataStax
There is a lot of talk about the importance of data streams and event-based architectures right now. You may have heard of it, but do you really know why it’s important to so many organizations? Streaming technologies unlock the ability to capture insights and take instant action on the data flowing into your organization; It is a basic building block for developing applications that can respond in real time to user actions, security threats, or other events. In other words, it’s an essential part of building great customer experiences and increasing revenue.
Here’s a quick breakdown of what streaming technologies do, and why they’re so important to organizations.
Data is in motion
Organizations have become very good at creating a relatively complete view of what’s called “data at rest” – the kind of information that is often captured in databases, data warehouses, and even data lakes for immediate (“real-time”) use or to feed applications and analysis later.
Increasingly, data driven by activities, actions, and events that occur in real time flows across the enterprise from mobile devices, retail systems, sensor networks, and communications call routing systems.
While this “moving data” may eventually be captured in a database or other store, it is very valuable on the go. For a bank, moving data might enable it to detect fraud in real time and act on it instantly. Retailers can make product recommendations based on consumer research or purchase history, or the moment someone visits a webpage or clicks on a specific item.
Take, for example, Overstock, an online retailer in the United States. You must consistently deliver engaging customer experiences and generate revenue from monetization opportunities right now. In other words, Overstock sought the ability to make very fast decisions based on data that was arriving in real time (in general, brands have 20 seconds To communicate with customers before they move to another website).
“It’s like a self-driving car,” says Thor Sigurjonsson, Head of Data Engineering at Overstock. “If you wait for feedback, you’ll be out of the way.”
To maximize the value of their data as it’s generated—instead of waiting hours, days, or even longer to analyze it once you’re at rest—Overstock needed a platform for broadcasting and messaging, which will enable them to use real-time decision-making to deliver personalized experiences and recommend products most likely to appeal to customers at the right time (really fast, in other words).
Data messaging and data flow is an essential part of an event-driven architecture, which is a software architecture or programming approach built around event capture, communication, processing, and event persistence—mouse clicks, sensor outputs, and the like.
Processing data flows involves taking action on a series of data that originates from a system that is constantly generating “events”. The ability to query that continuous stream and find anomalies, realize that something important has happened, and act upon it quickly and in a meaningful way, is what streaming technology enables.
This is in contrast to bulk processing, where an application stores data after it has been entered and processed and then stores the processed result or forwards it to another application or tool. Processing may not start until after, say, 1,000 data points have been collected. This is too slow for the type of apps that require Interaction at the point of interaction.
It is worth pausing to break this idea:
- the interaction point It can be a system making an API call or a mobile app.
- engagement It is defined as adding value to an interaction. It can be giving a tracking number to a customer after they place an order, recommending a product based on a user’s browsing history, or authorizing billing or a service upgrade.
- reaction means that the engagement action takes place in real-time or near-real-time; This translates to hundreds of milliseconds for human interactions, while machine-to-machine interactions that occur in a power facility’s sensor network, for example, may not require such a response in nearly real time.
When the message queue is not enough
Some organizations have realized that they need to derive value from their data in motion and have compiled their event-driven architectures from a variety of technologies, including message-oriented middleware systems such as Java Messaging Service (JMS) or message queuing (MQ) platforms.
But these platforms were built on the basic premise that the data that was processed was transient and should be disposed of immediately as soon as each message was delivered. This essentially eliminates a high-value asset: data that can be identified as arriving A certain point in time. Time series information is important for applications that involve asynchronous analysis, such as machine learning. Data scientists cannot build machine learning models without it. A modern broadcast system needs not only to pass events from one service to another, but also to store them in such a way that they retain their value or be used later.
The system also needs to be able to scale to manage terabytes of data and millions of messages per second. Legacy MQ systems were not designed to do either of these.
Pulsar and Kafka: The Old Guard and the Unified Next Generation Rival
As I touched on above, there are a lot of options out there when it comes to messaging and streaming technology.
It includes many open source projects such as RabbitMQ, ActiveMQ, and NATS, along with proprietary solutions such as IBM MQ or Red Hat AMQ. Then there are two popular and standardized platforms for handling real-time data: Apache Kafka, a very popular technology that has become almost synonymous with streaming; and Apache Pulsar, a new streaming and message queuing platform.
These two technologies are designed to handle the high throughput and scalability demanded by many data-driven applications.
Kafka was developed by LinkedIn to facilitate data communication between different services in a business networking company and became an open source project in 2011. Over the years it has become a standard for many organizations looking for ways to extract value from real-time data.
Pulsar is developed by Yahoo! To solve messaging and data issues faced by apps like Yahoo! Mail; It became a top-tier open source project in 2018. While it still catches up with Kafka in popularity, it has more features and functionality. And it bears a very important distinction: MQ solutions are messaging only platforms, and Kafka deals only with the enterprise’s live broadcasting needs – Pulsar addresses these two needs of an enterprise, making it the only unified platform available.
Pulsar can handle high-speed, real-time use cases like Kafka, but it’s also a more complete, durable, and reliable solution when compared to the legacy platform. For a stream and queue (an asynchronous communications protocol that enables applications to talk to each other), for example, a Kafka user would need to install something like RabbitMQ or other solutions. On the other hand, Pulsar can handle many use cases of a conventional queuing system without additives.
Pulsar holds other advantages over Kafka, including higher throughput, better scalability, and geographical replication, which is especially important when a data center or cloud region fails. Geo-replication allows an application to propagate events to another datacenter without interruption, which prevents the application from crashing – and prevents outages from affecting end users. (here is a file More technical comparison Kafka and Pulsar).
In the case of Overstock, Pulsar was chosen as the retailer’s streaming platform. With that, the company has built what Chief Engineering Officer SegurJohnson describes as “an integrated layer of data and connected processes governed by a metadata layer that supports the deployment and use of integrated, reusable data in all environments.”
In other words, Overstock now has a way to understand and act on real-time data at the enterprise level, enabling the company to convince its customers with fast, relevant offers and personalized experiences.
As a result, teams can reliably transform data during flight in a way that is easy to use and requires less data engineering. This makes it easier to satisfy customers – and ultimately generate more profits.
To learn more about DataStax, visit us here.
About Chris Latimer
Chris Latimer is a Technical Executive with a career spanning more than twenty years in a variety of roles including enterprise architecture, technical advance sales, and product management. He is currently the Vice President of Product Management at DataStax where he focuses on building the company’s product strategy around cloud messaging and event flow. Prior to joining DataStax, Chris was a Senior Product Manager at Google focusing on APIs and API management in Google Cloud. Chris is located near Boulder, CO, and when he’s not working, he’s an avid skater and musician and enjoys the endless variety of outdoor activities that Colorado has to offer with his family.