Posted by Jai on July 9, 2014
In this post we will explore HBase to store customer search click events data and utilizing same to derive customer behavior information based on search query string and facet filter clicks. We will cover to use MiniHBaseCluster, HBase Schema design, integration with Flume using HBaseSink to store JSON data.
In continuation to the previous posts on,
We have explored to store search click events data in Hadoop and to query same using different technologies. Here we will use HBase to achieve same,
- HBase mini cluster setup
- HBase template using Spring Data
- HBase Schema Design
- Flume Integration using HBaseSink
- HBaseJsonSerializer to serialize json data
- Query Top 10 search query string in last an hour
- Query Top 10 search facet filter in last an hour
- Get recent search query string for a customer in last 30 days
Read the rest of this entry »
Posted in Architecture, Flume, Hadoop, HBase, Java, Spring Data | Tagged: Flume, HBase, HBaseSink | 1 Comment »
Posted by Jai on June 4, 2014
Hadoop being the batch processing framework makes it a little hard to get the real time analytics for big data. Apache Spark overcomes this batch nature and provides distributed computation capabilities and events processed in streaming fashion. In this post, we will cover to explore Spark streaming capability to process Flume Events data to generate Top search query strings generated in last an hour or top product views in the last one hour.
In continuation to the previous posts on
We have so far utilized the Hadoop system batching capabilities to process huge amount of data. But the overall batching operation makes it a bit of latency issue depending on your data. This is where Spark comes into picture. We will explore Spark streaming capability here to get some real time analytics and those can be used on the website for display purpose or for monitoring purpose.
Spark
Apache spark “is a fast and general engine for large-scale data processing.”
Functionality
As shared in other above exmaples, we have the customer search clicks data available to us. We have Flume system in place to process the data and store in Hadoop for later processing perspective. Take a scenario, you want to display real time customer behavior on the website, how other customers are doing
- What other customers searching?
- Other customers also searching for…
- Top search query string on the website in last an hour
- What other customers viewing?
- Other customers also viewing products…
- Top product views in the last an hour
Read the rest of this entry »
Posted in Flume, Hadoop, Java, Spark | Tagged: Apache Spark, Flume, Hadoop, Java, Spark, Spark Streaming | 2 Comments »
Posted by Jai on May 19, 2014
This post covers to use Apache flume to gather customer product search clicks and store the information using hadoop and elasticsearch sinks. The data may consist of different product search events like filtering based on different facets, sorting information, pagination information and further the products viewed and some of the products marked as favorite by the customers. In later posts we will analyze data further to use the same information for display and analytic.
Product Search Functionality
Any eCommerce platform offers different products to customers and search functionality is one of the basics of that. Allowing user for guided navigation using different facets/filters or free text search for the content is trivial of the any of existing search functionality.
SearchQueryInstruction
Consider a similar scenario where customer can search for a product and allows us to capture the product search behavior with following information,
Read the rest of this entry »
Posted in ElasticSearch, Flume, Hadoop, Java | Tagged: ElasticSearch, Flume, Hadoop | 6 Comments »
Posted by Jai on May 14, 2014
The application demonstrate to setup customer product search clicks analytics using big data Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data etc.
Github Repository
URL: https://github.com/jaibeermalik/searchanalytics-bigdata
Analyzing Search Clicks Data Using Flume, Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data.
Repository contains unit/integration test cases to generate analytics based on clicks events related to the product search on any e-commerce website.
Getting Started
The project is maven project and can be build with Eclipse. Check pom dependencies for relevant version of earch application. It uses cloudera hadoop distribution version 2.3.0-cdh5.0.0.
Functionality
The scenario covered in the application for the search analytics using big data is as follow,
Read the rest of this entry »
Posted in Akka, ElasticSearch, Flume, Hadoop, Hive, Java, Oozie, Pig, Spring, Spring Data | Tagged: Akka, Big Data, ElasticSearch, Flume, Hadoop, Hive, Oozie, Pig, Spring Data | 6 Comments »