[Solved-3 Solutions] How to incorporate the current input filename into my Pig Latin script ?
What is PigStorage() ?
- The
PigStorage()function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takesâ\tâas a parameter.
Syntax
- Given below is the syntax of the PigStorage() function.
grunt> PigStorage(field_delimiter)
Problem:
How to incorporate the current input filename into my Pig Latin script ?
Solution 1:
We can use PigStorage by specify -tagsource as following
A = LOAD 'input' using PigStorage(',','-tagsource');
B = foreach A generate INPUT_FILE_NAME;
The first field in each Tuple will contain input path (INPUT_FILE_NAME)
Solution 2:
- The Pig wiki as an example of PigStorageWithInputPath which had the filename in an additional chararray field:
Example:
A = load '/directory/of/files/*' using PigStorageWithInputPath()
as (field1:chararray, field2:int, field3:chararray);
UDF
// Note that there are several versions of Path and FileSplit. These are intended:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
import org.apache.pig.builtin.PigStorage;
import org.apache.pig.data.Tuple;
public class PigStorageWithInputPath extends PigStorage {
Path path = null;
@Override
public void prepareToRead(RecordReader reader, PigSplit split) {
super.prepareToRead(reader, split);
path = ((FileSplit)split.getWrappedSplit()).getPath();
}
@Override
public Tuple getNext() throws IOException {
Tuple myTuple = super.getNext();
if (myTuple != null)
myTuple.append(path.toString());
return myTuple;
}
}Solution 3:
- tagSource is deprecated in Pig 0.12.0 . Instead use
- -tagFile - Appends input source file name to beginning of each tuple.
- -tagPath - Appends input source file path to beginning of each tuple.
A = LOAD '/user/myFile.TXT' using PigStorage(',','-tagPath');
DUMP A ;
It will gives the full file path as first column
( hdfs://myserver/user/blo/input/2015.TXT,439,43,05,4,NAVI,PO,P &C,P &CR,UC,40)