Case Study: Complex Data Transformation Using Real-time Stream Processing (2024)

If you’ve been onboarding data to Splunk for any amount of time, you have likely encountered formatting that initially appears straightforward but quickly becomes complicated as you dig deeper into the use-case with your stakeholders. In this post, I’ll introduce you to a data scenario, the challenges and opportunities presented by it, and then share with you a new way to make that data more valuable and usable.

At the surface level, onboarding data seems to be straightforward: where do I need to linebreak? Where is the appropriate time stamp and what's the format? And if you are performance minded, you'll be thinking about all of those "magic 8" settings. Over time, you see patterns emerge and these settings become fairly trivial to define.

However when you consider who will be using the data and why, you’ll find that there’s often the opportunity to pre-process the data, making it easier to consume - thus making it more valuable. Some questions you might consider tackling in addition to line breaking and time stamping:

  • Why is this data valuable?
  • Who is searching the data?
  • Who are the consumers of the output?
  • What searches or results are most important to the consumers?
  • Is the raw data valuable or are just summaries needed? Metrics or Events?
  • Is the data sensitive, do we need to redact, deduplicate, or enrich the data?

We’ll review feedback from our stakeholder about these questions after we review the raw data source. In the sample below, assume that those initial onboarding best practices have been followed, and what we're left with is a well-formatted JSON event.

{ "device": { "deviceId": "127334527887", "deviceSourceId": "be:f3:af:c2:01:f1", "deviceType": "IPGateway" }, "timestamp": 1489095004000, "rawAttributes": { "WIFI_TX_2_split": "325,650,390,150,150,780,293,135,325", "WIFI_RX_2_split": "123,459,345,643,234,534,123,134,656", "WIFI_SNR_2_split": "32, 18, 13, 43, 32, 50, 23, 12, 54", "ClientMac_split": "BD:A2:C9:CB:AC:F3,9C:DD:45:B1:16:53,1F:A7:42:DE:C1:4B,40:32:5D:4E:C3:A1,80:04:15:73:1F:D9,85:B2:15:B3:04:69,34:04:13:AA:4A:EC,4D:CB:0F:6B:3F:71,12:2A:21:13:25:D8" }}

At first glance this onboarded data is great:

  • We have a structured format.
  • Splunk will expose field names for easy data discovery.
  • The timestamp has its own field so we can easily designate that for our record.

And now the extra detail from our stakeholder:

"These events are from our router. The device field at the top describes the router itself, and then the rawAttributes describes all of the downstream devices (ClientMac_split) that connect to the router and their respective performance values like transmit, receive, and signal to noise values. We want to be able to report on these individual downstream devices and associate those individual devices with the router that serviced them as well as investigate the metrics over time. We use this data to triage customer complaints and over time, improve the resiliency of our network.

This context helps us make some key decisions:

  • We now know that the SPL required to process this data would be extensive, possibly complex, and would have to be executed every time this data is searched. We should pre-process these events from a single record containing many values to distinct records that contain the pertinent metadata. This would simplify the end-user search experience, and reduce resource utilization.
  • Near-real-time data for investigations is important, the key data are the performance metrics, and those metrics have dimensionality from within the record. As part of pre-processing, we should ensure that the data is consumed as metrics by Splunk and as above those metrics should have the proper dimensions attached. These resulting metrics will improve search performance through super fast mstats commands and reduce time to investigate through easier searches on more timely data.

While some of this can be done with traditional props or transforms, either on a heavyweight forwarder or on the indexers themselves, there is (I think) a better way to address these requirements. Stream Processing, either Data Stream Processor (DSP) for on-prem or Stream Processor Services (SPS) on Splunk Cloud offers us the ability to author powerful data pipelines to solve these complex data processing challenges.

With Stream Processing we can use familiar search processing language to apply the needed transformations in the stream before the data is indexed. This will remove complexity from the data, reduce search and index time resource consumption, and improve data quality.

You can learn more about Splunk stream processing here or here. Then follow me over to Splunk Lantern for the step-by-step walkthrough of the pipeline I created to address this fun challenge of aligning the incoming data to business value.

So what do you think? Have you had similar data challenges? Let me know below in the comments, as I’d love to hear about your use cases!

Nick Zambo, Platform Architect

Case Study: Complex Data Transformation Using Real-time Stream Processing (2024)

FAQs

What is the use of stream processing? ›

Stream processing is ideal for projects that require speed and nimbleness, which are common in today's world. For example, fraud detection, supply chain monitoring, rideshare apps, and ecommerce websites all rely on stream processing.

What is a Splunk data stream processor? ›

Splunk's Data Stream Processor is a real time streaming solution that collects, processes and delivers data to Splunk and other destinations in milliseconds.

What are the disadvantages of stream processing? ›

Stream processing cons

Can be resource-intensive over time: Since it runs continuously, resource demands can accumulate, potentially leading to higher costs. Potential data order issues: Handling data in the correct order becomes crucial, especially in scenarios where sequence matters.

What are the benefits of stream processing? ›

Manage data continuously: One of the primary benefits of stream processing is that you can continually perform data analytics in real-time. Some other data processing methods require you to temporarily pause your data collection process to analyze each set of information.

What are the three main processing components of Splunk? ›

Splunk Components. The primary components in the Splunk architecture are the forwarder, the indexer, and the search head.

When to use data streaming? ›

Streaming data is critical for any applications which depend on in-the-moment information to support the following use cases:
  1. Streaming media.
  2. Stock trading.
  3. Real-time analytics.
  4. Fraud detection.
  5. IT monitoring.
  6. Instant messaging.
  7. Geolocation.
  8. Inventory control.

What are the issues in stream processing? ›

Challenges in Stream Processing
  • Ensuring scalability.
  • Maintaining fault tolerance.
  • Maintaining cost-effective use of data.
  • Ensuring data consistency.
  • Managing event ordering.
Oct 25, 2023

What is the purpose of stream? ›

Besides providing drinking water and irrigation for crops, streams wash away waste and can provide electricity through hydropower. People often use streams recreationally for activities such as swimming, fishing, and boating. Streams also provide important habitat for wildlife.

What does stream processing most likely deal with? ›

Stream processing is often applied to data generated as a series of events, such as data from IoT sensors, payment processing systems, and server and application logs. Common paradigms include publisher/subscriber (commonly referred to as pub/sub) and source/sink.

What is the stream of method used for? ›

You've probably already heard that streams in Java are designed to process data. They represent a sequence of elements that can be processed using various operations such as filtering, sorting, reducing, etc. The Stream. of method allows you to get a stream from any data type that implements the Iterable interface.

What are the benefits of using streams? ›

STREAMS offers two major benefits for applications programmers: Easy creation of modules that offer standard data communications services. See Creating Service Interfaces. The ability to manipulate those modules on a stream.

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Otha Schamberger

Last Updated:

Views: 5895

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Otha Schamberger

Birthday: 1999-08-15

Address: Suite 490 606 Hammes Ferry, Carterhaven, IL 62290

Phone: +8557035444877

Job: Forward IT Agent

Hobby: Fishing, Flying, Jewelry making, Digital arts, Sand art, Parkour, tabletop games

Introduction: My name is Otha Schamberger, I am a vast, good, healthy, cheerful, energetic, gorgeous, magnificent person who loves writing and wants to share my knowledge and understanding with you.