May 2024
M	T	W	T	F	S	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Archive for the ‘Hadoop’ Category

HBase: Generating search click events statistics for customer behavior

Posted by Jai on July 9, 2014

In this post we will explore HBase to store customer search click events data and utilizing same to derive customer behavior information based on search query string and facet filter clicks. We will cover to use MiniHBaseCluster, HBase Schema design, integration with Flume using HBaseSink to store JSON data.

In continuation to the previous posts on,

Customer product search clicks analytics using big data,
Flume: Gathering customer product search clicks data using Apache Flume,
Hive: Query customer top search query and product views count using Apache Hive,
ElasticSearch-Hadoop: Indexing product views count and customer top search query from Hadoop to ElasticSearch,
Oozie: Scheduling Coordinator/Bundle jobs for Hive partitioning and ElasticSearch indexing,
Spark: Real time analytics for big data for top search queries and top product views

We have explored to store search click events data in Hadoop and to query same using different technologies. Here we will use HBase to achieve same,

HBase mini cluster setup
HBase template using Spring Data
HBase Schema Design
Flume Integration using HBaseSink
HBaseJsonSerializer to serialize json data
Query Top 10 search query string in last an hour
Query Top 10 search facet filter in last an hour
Get recent search query string for a customer in last 30 days

Read the rest of this entry »

Posted in Architecture, Flume, Hadoop, HBase, Java, Spring Data | Tagged: Flume, HBase, HBaseSink | 1 Comment »

Spark: Real time analytics for big data for top search queries and top product views

Posted by Jai on June 4, 2014

Hadoop being the batch processing framework makes it a little hard to get the real time analytics for big data. Apache Spark overcomes this batch nature and provides distributed computation capabilities and events processed in streaming fashion. In this post, we will cover to explore Spark streaming capability to process Flume Events data to generate Top search query strings generated in last an hour or top product views in the last one hour.

In continuation to the previous posts on

We have so far utilized the Hadoop system batching capabilities to process huge amount of data. But the overall batching operation makes it a bit of latency issue depending on your data. This is where Spark comes into picture. We will explore Spark streaming capability here to get some real time analytics and those can be used on the website for display purpose or for monitoring purpose.

Spark

Apache spark “is a fast and general engine for large-scale data processing.”

Functionality

As shared in other above exmaples, we have the customer search clicks data available to us. We have Flume system in place to process the data and store in Hadoop for later processing perspective. Take a scenario, you want to display real time customer behavior on the website, how other customers are doing

What other customers searching?
Other customers also searching for…
Top search query string on the website in last an hour
What other customers viewing?
Other customers also viewing products…
Top product views in the last an hour

Read the rest of this entry »

Posted in Flume, Hadoop, Java, Spark | Tagged: Apache Spark, Flume, Hadoop, Java, Spark, Spark Streaming | 2 Comments »

Oozie: Scheduling Coordinator/Bundle jobs for Hive partitioning and ElasticSearch indexing

Posted by Jai on May 28, 2014

This post covers to use Oozie to schedule Hive add partition every hour with the help of Coordinator jobs and to automatically update the ElasticSearch data served to customer based on nightly jobs using Bundle jobs functionality. The automated procedure using oozie jobs will help to update the statistical data used on website to display product views count and top search query string.

In continuation to the previous posts on

As described in earlier posts, the hive partitioning strategy is added based on current time and accordingly the elasticsearch indexing based on analytic data also. We will cover here to automate the process using Oozie to add hive partition once data is available in hadoop directory.

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Read the rest of this entry »

Posted in ElasticSearch, Hadoop, Hive, Java, Oozie | Tagged: ElasticSearch-Hadoop, Hadoop, Hive, Oozie | 3 Comments »

ElasticSearch-Hadoop: Indexing product views count and customer top search query from Hadoop to ElasticSearch

Posted by Jai on May 22, 2014

This post covers to use ElasticSearch-Hadoop to read data from Hadoop system and index that in ElasticSearch. The functionality it covers is to index product views count and top search query per customer in last n number of days. The analyzed data can further be used on website to display customer recently viewed, product views count and top search query string.

In continuation to the previous posts on

we already have customer search clicks data gathered using Flume and stored in Hadoop HDFS and ElasticSearch, and how to analyze same data using Hive and generate statistical data. Here we will further see how to use the analyzed data to enhance customer experience on website and make it relevant for the end customers.

ElasticSearch-Hadoop

Elasticsearch for Apache Hadoop allows Hadoop jobs to interact with ElasticSearch with small library and easy setup.

elasticsearch-hadoop-hive, allows to access ElasticSearch using Hive. As shared in previous post, we have product views count and also customer top search query data extracted in Hive tables. We will read and index the same data to ElasticSearch so that it can be used for display purpose on website.

Read the rest of this entry »

Posted in ElasticSearch, Hadoop, Java, Spring Data | Tagged: ElasticSearch, ElasticSearch-Hadoop, Hadoop, Spring Data | 4 Comments »

Hive: Query customer top search query and product views count using Apache Hive

Posted by Jai on May 20, 2014

This post covers to use Apache Hive to query the search clicks data stored under Hadoop. We will take examples to generate customer top search query and statistics on total product views.

In continuation to the previous posts on

we already have customer search clicks data gathered using Flume in Hadoop HDFS.

Here will analyze further to use Hive to query the stored data under Hadoop.

Hive

Hive allow us to query big data using SQL-like language HiveQL.

Hadoop Data

As shared in last post, we have search clicks data stored under hadoop with the following format “/searchevents/2014/05/15/16/”. The data is stored in separate directory created per hour.

The files are created as,

hdfs://localhost.localdomain:54321/searchevents/2014/05/06/16/searchevents.1399386809864

Read the rest of this entry »

Posted in Hadoop, Hive, Java, Spring, Spring Data | Tagged: Hadoop, Hive, Spring Data, Spring Hadoop | 4 Comments »

« Previous Entries

	Exploring Enterprise… on Oozie: Scheduling Coordinator/…
	Exploring Enterprise… on ElasticSearch-Hadoop: Indexing…
	Exploring Enterprise… on Flume: Gathering customer prod…
	Exploring Enterprise… on Customer product search clicks…
	Exploring Enterprise… on ElasticSearch: Indexing setup…
	Exploring Enterprise… on ElasticSearch: Learn Java API…
	Exploring Enterprise… on ElasticSearch: Boosting score…
	Exploring Enterprise… on ElasticSearch: Text analysis f…
	Exploring Enterprise… on ElasticSearch: Faceted Search…
	Exploring Enterprise… on Getting started with Elas…

Jai’s Weblog – Tech, Security & Fun…

Tech, Security & Fun…

Jaibeer Malik

Subscribe

Feedburner

Email Subscription

Archives

Categories

Stats

Live Traffic

Books

Posts on:

Top Posts

Recent Comments

Follow me on Twitter

Interesting Links

Follow me on FriendFeed

Archive for the ‘Hadoop’ Category

HBase: Generating search click events statistics for customer behavior

Spark: Real time analytics for big data for top search queries and top product views

Spark

Functionality

Oozie: Scheduling Coordinator/Bundle jobs for Hive partitioning and ElasticSearch indexing

Oozie

ElasticSearch-Hadoop: Indexing product views count and customer top search query from Hadoop to ElasticSearch

Recently Viewed Items

ElasticSearch-Hadoop

Hive: Query customer top search query and product views count using Apache Hive

Hive

Hadoop Data

Tech, Security & Fun…

Jaibeer Malik

Subscribe

Feedburner

Email Subscription

Archives

Categories

Stats

Live Traffic

Books

Posts on:

Top Posts

Recent Comments

Archive for the ‘Hadoop’ Category

Share this:

Spark

Functionality

Share this:

Oozie

Share this:

Recently Viewed Items

ElasticSearch-Hadoop

Share this:

Hive

Hadoop Data

Share this: