[Solved-3 Solutions] Could not infer COUNT function in pig ?
What is Count() ?
- The COUNT() function of Pig Latin is used to get the number of elements in a bag. While counting the number of tuples in a bag, the COUNT() function ignores (will not count) the tuples having a NULL value in the FIRST FIELD
- We trying to write a pig latin script to pull the count of a dataset that we have filtered.
Syntax
- Given below is the syntax of the COUNT() function
grunt> COUNT(expression)
Problem :
We are trying to write a pig latin script to pull the count of a dataset that we have filtered.
Here is the Script:
/* scans by title */
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
scancount = FOREACH productscans GENERATE COUNT($0);
DUMP scancount;
- For some reason, we get the error:
Could not infer the matching function for org.apache.pig.builtin.COUNT as multiple or none of them fit. Please use an explicit cast.
Solution 1:
- Group by all to bring everything into one bag, then count the item
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT(productscans);
dump count;
- COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
Solution 2:
- We can use this below code to count function in pig
/* scans by title */
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
scancount = FOREACH productscans GENERATE COUNT(productscans);
DUMP scancount;
Solution 3:
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT(productscans);
DUMP scancount;
Alternative way to solve:
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT($1);
DUMP scancount;