[Solved-1 Solution] Hadoop Pig - Removing csv header ?
What is hadoop ?
- Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model. It consists of computer clusters built from commodity hardware
Csv ?
- Hadoop File System (CSV) Use the Hadoop File System (CSV) tab to specify processing options for converting to HDFS CSV format and to enter file names for object-specific target files. The tab displays when you select the Hadoop File System - CSV format for the converted file.
Problem :
Here the csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). Here to apply a filter on the loaded data to remove the rows containing the headers:
affaires = load 'affaires.csv' using PigStorage(',') as (NU_AFFA:chararray, date:chararray) ;
affaires = filter affaires by date matches '../../..';
Is there is a way to tell pig not to load the first line of the csv, like an "as_header" boolean parameter to the load function. What would be a best practice?
Solution 1:
- CSVExcelStorage loader support to skip the header row, so instead of PigStorage use CSVExcelStorage. Download piggybank.jar and try this option.
Sample example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER option)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
Output:
(a,10,chennai)
(b,20,banglore)