Data enrichment at ingest-time, not search time, with Cribl
Disclaimer: I’m a Splunk employee, and I’m not a Cribl customer, but I do know the founders (including the author of the blog post). I figured I’d write this exploration up here rather than as an exceedingly-long Twitter thread. My reactions are all to the content of the blog post, not actual use of the product.
If I’m reading this blog post from Cribl correctly, their product makes it easy to enrich events with metadata at ingest-time. This is relevant/exciting for me because when I’m ingesting music data for my side project, I’m only ever getting the initial slice that’s available from a specific REST endpoint or in a file.
I’ve been identifying and collecting additional data sources that I want to enrich my dataset with, but doing so requires extra calls to other endpoints in the same API, or other web services, which means I then need to figure out where I want to store all of that data.
It quickly turns into an architectural and conceptual headache that I delay handling, because I know I’d either be dumping a lot of data into lookups / the KV store, or having to seriously level up my Python skills and do data processing and enrichment in my code before sending it to Splunk Enterprise.
As a specific example, I use the Last.fm getRecentTracks endpoint to send my listening data to Splunk Enterprise, but to enrich that data with additional metadata like track duration, or album release date, I’d have to hit 2 additional endpoints (track.getInfo and album.getInfo, respectively).
Deciding when in the data processing pipeline to hit those endpoints, how to hit them, and where to store that information to enrich my events has been a struggle that I’ve been avoiding dealing with.
There is an advantage to collecting the metadata once and storing it in a lookup or the KV store, because the metadata is relatively static. That means that it is relatively straightforward to call an endpoint, collect the data, and store it somewhere for when I need it. That means that I then have the added flexibility to enrich my events with extra data at search time when I want to, but not otherwise.
However, this means that I’m having to make conceptual decisions at multiple points—when collecting the data, when deciding what format to store it in, and where, and when I am enriching events at search time. It’s a lot of added complexity, but this type of enrichment doesn’t affect the size of my originally-indexed events, though it might end up being indexed separately instead.
But with Cribl’s solution, I’d be making that choice once. That does mean I lose potential flexibility about when and which events I can enrich with the data, but it also means that the conceptual decisions aren’t something I have to belabor. I can enrich my listening data at ingest-time with additional metadata about the album, artist, and track, then send it on to be indexed. Then when I’m searching and want to perform additional work with the metadata, it’s all right there with my events already.
This is a convenient, if imperfect, solution for my use case. But my use case is pretty basic: enrich events with static information that might be shared across many events. That’s a use case with a lot of potential solutions. I could use this approach if I didn’t care about reducing the amount of data that I indexed to the bare minimum, and focused instead on convenience and context for my data ingestion, allowing me to save time when searching my data.
This solution is much more exciting for use cases other than mine, where you’re enriching events with dynamic information that is relevant and true for specific events at index-time. The blog post includes an example of this, combining the web access logs with context from proxy logs, making the time-to-discovery for investigations that use web access logs shorter.
There is flexibility in combining data at search time, but there is complexity with that approach as well. Cribl shows that there is convenience in creating that context at index-time as well.