[Solved-1 Solution] Running Pig query over data stored in Hive ?

What is Pig query ?

Pig can be used to run a query to find the rows which exceed a threshold value. It can be used to join two different types of datasets based upon a key.
Pig can be used to iterative algorithms over a dataset. It is ideal for ETL operations i.e; Extract, Transform and Load. It allows a detailed step by step procedure by which the data has to be transformed. It can handle inconsistent schema data.

What is apache hive ?

Apache Hive enables advanced work on Apache Hadoop Distributed File System and MapReduce. It allows SQL developers to write Hive Query Language statements similar to standard SQL ones.

Problem:

How to run Pig queries stored in Hive format ?

We have configured Hive to store compressed data. Before that we used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use ?

Solution 1:

Here's what we found out: Using HiveColumnarLoader makes sense if we store data as a RCFile.

To load table using this we need to register some jars first:

register /srv/pigs/piggybank.jar
register /usr/lib/hive/lib/hive-exec-0.5.0.jar
register /usr/lib/hive/lib/hive-common-0.5.0.jar

a = LOAD '/user/hive/warehouse/table' USING org.apache.pig.piggybank.storage.HiveColumnarLoader('ts int, user_id int, url string');

To load data from Sequence file we have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:

register /srv/pigs/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
a = LOAD '/user/hive/warehouse/table' USING SequenceFileLoader AS (int, int);

This may not work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and we may get this exception:

2011-07-01 10:30:08,589 WARN org.apache.pig.piggybank.storage.SequenceFileLoader: Unable to translate key class org.apache.hadoop.io.BytesWritable to a Pig datatype
2011-07-01 10:30:08,625 WARN org.apache.hadoop.mapred.Child: Error running child
org.apache.pig.backend.BackendException: ERROR 0: Unable to translate class org.apache.hadoop.io.BytesWritable to a Pig datatype
    at org.apache.pig.piggybank.storage.SequenceFileLoader.setKeyType(SequenceFileLoader.java:78)
    at org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:132)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:142)
    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:448)
    at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:315)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
    at org.apache.hadoop.mapred.Child.main(Child.java:211)

Apache Pig Basics

Apache Pig - Filtering

Apache Pig - Operators

Apache Pig - Functions

Eval Functions

Bag-Tuple Functions

DateTime Function

User Defined Function

Load-store Function

Math-function

Apache Pig- Regex

Apache Pig - Running Scripts