Jai’s Weblog – Tech, Security & Fun…

Tech, Security & Fun…

  • Jaibeer Malik

    Jaibeer Malik
  • View Jaibeer Malik's profile on LinkedIn
  • Subscribe

  • Feedburner

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 32 other followers

  • Archives

  • Categories

  • Stats

    • 414,576
  • Live Traffic

  • Advertisements

Archive for the ‘Flume’ Category

HBase: Generating search click events statistics for customer behavior

Posted by Jai on July 9, 2014

In this post we will explore HBase to store customer search click events data and utilizing same to derive customer behavior information based on search query string and facet filter clicks. We will cover to use MiniHBaseCluster, HBase Schema design, integration with Flume using HBaseSink to store JSON data.

In continuation to the previous posts on,

We have explored to store search click events data in Hadoop and to query same using different technologies. Here we will use HBase to achieve same,

  •  HBase mini cluster setup
  •  HBase template using Spring Data
  •  HBase Schema Design
  •  Flume Integration using HBaseSink
  •  HBaseJsonSerializer to serialize json data
  •  Query Top 10 search query string in last an hour
  •  Query Top 10 search facet filter in last an hour
  •  Get recent search query string for a customer in last 30 days


Read the rest of this entry »


Posted in Architecture, Flume, Hadoop, HBase, Java, Spring Data | Tagged: , , | 1 Comment »

Spark: Real time analytics for big data for top search queries and top product views

Posted by Jai on June 4, 2014

Hadoop being the batch processing framework makes it a little hard to get the real time analytics for big data. Apache Spark overcomes this batch nature and provides distributed computation capabilities and events processed in streaming fashion. In this post, we will cover to explore Spark streaming capability to process Flume Events data to generate Top search query strings generated in last an hour or top product views in the last one hour.

In continuation to the previous posts on

We have so far utilized the Hadoop system batching capabilities to process huge amount of data. But the overall batching operation makes it a bit of latency issue depending on your data. This is where Spark comes into picture. We will explore Spark streaming capability here to get some real time analytics and those can be used on the website for display purpose or for monitoring purpose.


Apache spark  “is a fast and general engine for large-scale data processing.”


As shared in other above exmaples, we have the customer search clicks data available to us. We have Flume system in place to process the data and store in Hadoop for later processing perspective. Take a scenario, you want to display real time customer behavior on the website, how other customers are doing

  • What other customers searching?
  • Other customers also searching for…
  • Top search query string on the website in last an hour
  • What other customers viewing?
  • Other customers also viewing products…
  • Top product views in the last an hour

Read the rest of this entry »

Posted in Flume, Hadoop, Java, Spark | Tagged: , , , , , | 2 Comments »

Flume: Gathering customer product search clicks data using Apache Flume

Posted by Jai on May 19, 2014

This post covers to use Apache flume to gather customer product search clicks and store the information using hadoop and elasticsearch sinks. The data may consist of different product search events like filtering based on different facets, sorting information, pagination information and further the products viewed and some of the products marked as favorite by the customers. In later posts we will analyze data further to use the same information for display and analytic.

Product Search Functionality

Any eCommerce platform offers different products to customers and search functionality is one of the basics of that. Allowing user for guided navigation using different facets/filters or free text search for the content is trivial of the any of existing search functionality.


Consider a similar scenario where customer can search for a product and allows us to capture the product search behavior with following information,

Read the rest of this entry »

Posted in ElasticSearch, Flume, Hadoop, Java | Tagged: , , | 5 Comments »

Customer product search clicks analytics using big data

Posted by Jai on May 14, 2014

The application demonstrate to setup customer product search clicks analytics using big data Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data etc.

Github Repository

URL: https://github.com/jaibeermalik/searchanalytics-bigdata

Analyzing Search Clicks Data Using Flume, Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data.

Repository contains unit/integration test cases to generate analytics based on clicks events related to the product search on any e-commerce website.


Getting Started

The project is maven project and can be build with Eclipse. Check pom dependencies for relevant version of earch application. It uses cloudera hadoop distribution version 2.3.0-cdh5.0.0.


The scenario covered in the application for the search analytics using big data is as follow,
Read the rest of this entry »

Posted in Akka, ElasticSearch, Flume, Hadoop, Hive, Java, Oozie, Pig, Spring, Spring Data | Tagged: , , , , , , , , | 5 Comments »

%d bloggers like this: