[Solved-1 Solution] Equivalent of linux 'diff' in Apache Pig ?
Problem :
We want to be able to do a standard diff on two large files.
The above code will work but it's not nearly as quick as diff on the command line.
Solution 1:
We use the following approaches , perhaps we were using only one reducer
We use the pig that has algorithm that adjust the reducers
- Both approaches we use are within a few percent of each other in performance but do not treat duplicates the same
- The JOIN approach collapses duplicates
- The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
- Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort &</foo> | diff)
- If we have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins.
- If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting.
The below code is given by using JOIN approach
Here is the code that using UNION approach
- It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
- Each approach only takes one Map/Reduce cycle.
- This results in roughly 1.8GB differ per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.