[Solved-2 Solutions] Too many filter matching in pig ?
What is filter
- The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax
- Given below is the syntax of the FILTER operator.
Problem:
- If we have a list of filter keywords (about 1000 in numbers) and we need to filter a field of a relation in pig using this list.
- Initially, We have declared these keywords like
We are doing filtering like:
Assume that my source relation is in SRC and we need to apply filtering on first field i.e. $0. If we are reducing the number of filters to 100-200, it's working fine. But as number of filters increases to 1000. It doesn't work. How to get the result of above problem ?
Solution 1:
We can write a simple filter UDF like below
Solution 2:
- One shallow approach is to divide the filtration into stages. Filter keywords 1 to 100 in stage one and then filter another 100 and so on for a total of stages. However, given more details of your data, there is probably a better solution to this.
- As for the above shallow solution, we can wrap the pig script in a shell script that does the parcelling out of input and starts the run on the current keyword subset being filtered.