[Solved-1 Solution] Running Pig query over data stored in Hive ?
What is Pig query ?
- Pig can be used to run a query to find the rows which exceed a threshold value. It can be used to join two different types of datasets based upon a key.
- Pig can be used to iterative algorithms over a dataset. It is ideal for ETL operations i.e; Extract, Transform and Load. It allows a detailed step by step procedure by which the data has to be transformed. It can handle inconsistent schema data.
What is apache hive ?
- Apache Hive enables advanced work on Apache Hadoop Distributed File System and MapReduce. It allows SQL developers to write Hive Query Language statements similar to standard SQL ones.
Problem:
How to run Pig queries stored in Hive format ?
We have configured Hive to store compressed data. Before that we used to just use normal Pig load function with Hive's delimiter (^A). But now Hive stores data in sequence files with compression. Which load function to use ?
Solution 1:
Here's what we found out: Using HiveColumnarLoader makes sense if we store data as a RCFile.
To load table using this we need to register some jars first:
- To load data from Sequence file we have to use PiggyBank (as in previous example). SequenceFile loader from Piggybank should handle compressed files:
- This may not work with Pig 0.7 because it's unable to read BytesWritable type and cast it to Pig type and we may get this exception: