Jai’s Weblog – Tech, Security & Fun…

Tech, Security & Fun…

  • Jaibeer Malik

    Jaibeer Malik
  • View Jaibeer Malik's profile on LinkedIn
  • Subscribe

  • Feedburner

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 31 other followers

  • Archives

  • Categories

  • Stats

    • 398,117
  • Live Traffic

Archive for the ‘Hive’ Category

Oozie: Scheduling Coordinator/Bundle jobs for Hive partitioning and ElasticSearch indexing

Posted by Jai on May 28, 2014


This post covers to use Oozie to schedule Hive add partition every hour with the help of Coordinator jobs and to automatically update the ElasticSearch data served to customer based on nightly jobs using Bundle jobs functionality. The automated procedure using oozie jobs will help to update the statistical data used on website to display product views count and top search query string.

In continuation to the previous posts on

As described in earlier posts, the hive partitioning strategy is added based on current time and accordingly the elasticsearch indexing based on analytic data also. We will cover here to automate the process using Oozie to add hive partition once data is available in hadoop directory.

Oozie

Oozie  is a workflow scheduler system to manage Apache Hadoop jobs.

oozie-hive-coord-bundle-job

Read the rest of this entry »

Posted in ElasticSearch, Hadoop, Hive, Java, Oozie | Tagged: , , , | 2 Comments »

Hive: Query customer top search query and product views count using Apache Hive

Posted by Jai on May 20, 2014


This post covers to use Apache Hive to query the search clicks data stored under Hadoop. We will take examples to generate customer top search query and statistics on total product views.

In continuation to the previous posts on

we already have customer search clicks data gathered using Flume in Hadoop HDFS.

Here will analyze further to use Hive to query the stored data under Hadoop.

Hive

Hive allow us to query big data using SQL-like language HiveQL.

hive-query-search-events

Hadoop Data

As shared in last post, we have search clicks data stored under hadoop with the following format “/searchevents/2014/05/15/16/”. The data is stored in separate directory created per hour.

The files are created as,

hdfs://localhost.localdomain:54321/searchevents/2014/05/06/16/searchevents.1399386809864

Read the rest of this entry »

Posted in Hadoop, Hive, Java, Spring, Spring Data | Tagged: , , , | 4 Comments »

Customer product search clicks analytics using big data

Posted by Jai on May 14, 2014


The application demonstrate to setup customer product search clicks analytics using big data Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data etc.

Github Repository

URL: https://github.com/jaibeermalik/searchanalytics-bigdata

Analyzing Search Clicks Data Using Flume, Hadoop, Hive, Pig, Oozie, ElasticSearch, Akka, Spring Data.

Repository contains unit/integration test cases to generate analytics based on clicks events related to the product search on any e-commerce website.

bigdata-tech-analytics

Getting Started

The project is maven project and can be build with Eclipse. Check pom dependencies for relevant version of earch application. It uses cloudera hadoop distribution version 2.3.0-cdh5.0.0.

Functionality

The scenario covered in the application for the search analytics using big data is as follow,
Read the rest of this entry »

Posted in Akka, ElasticSearch, Flume, Hadoop, Hive, Java, Oozie, Pig, Spring, Spring Data | Tagged: , , , , , , , , | 5 Comments »

 
%d bloggers like this: