Hadoop being the batch processing framework makes it a little hard to get the real time analytics for big data. Apache Spark overcomes this batch nature and provides distributed computation capabilities and events processed in streaming fashion. In this post, we will cover to explore Spark streaming capability to process Flume Events data to generate Top search query strings generated in last an hour or top product views in the last one hour.
In continuation to the previous posts on
- Customer product search clicks analytics using big data,
- Flume: Gathering customer product search clicks data using Apache Flume,
- Hive: Query customer top search query and product views count using Apache Hive,
- ElasticSearch-Hadoop: Indexing product views count and customer top search query from Hadoop to ElasticSearch,
- Oozie: Scheduling Coordinator/Bundle jobs for Hive partitioning and ElasticSearch indexing
We have so far utilized the Hadoop system batching capabilities to process huge amount of data. But the overall batching operation makes it a bit of latency issue depending on your data. This is where Spark comes into picture. We will explore Spark streaming capability here to get some real time analytics and those can be used on the website for display purpose or for monitoring purpose.
Spark
Apache spark “is a fast and general engine for large-scale data processing.”
Functionality
As shared in other above exmaples, we have the customer search clicks data available to us. We have Flume system in place to process the data and store in Hadoop for later processing perspective. Take a scenario, you want to display real time customer behavior on the website, how other customers are doing
- What other customers searching?
- Other customers also searching for…
- Top search query string on the website in last an hour
- What other customers viewing?
- Other customers also viewing products…
- Top product views in the last an hour