apache pig - Pig Unique Counts on Multiple Subsets of a Large Input -

i have huge input on hdfs , use pig calculate several unique metrics. explain problem more easily, assume input file has following schema:

userid:chararray, dimensiona_key:chararray, dimensionb_key:chararray, dimensionc_key:chararray, activity:chararray, ...

each record represent activity performed userid.

based on value in activity field, activity record mapped 1 or more categories. there 10 categories in total.

now need count number of unique users different dimension combinations (i.e. a, b, c, a+b, a+c, b+c, a+b+c) each activity category.

what best practices perform such calculation?

i have tried several ways. although can results want, takes long time (i.e. days). found of time spent on map phase. looks script tries load huge input file every time tries calculate 1 unique count. there way improve behavior?

i tried similar below, looks reaches memory cap single reducer , stuck @ last reducer step.

source = load ... (userid:chararray, dimensiona_key:chararray, dimensionb_key:chararray, dimensionc_key:chararray, activity:chararray, ...); = group source (dimensiona_key, dimensionb_key); b = foreach {     userid1 = udf.newuseridforcategory1(userid, activity);     -- udf returns original user id if activity should mapped category1 , none otherwise     userid2 = udf.newuseridforcategory2(userid, activity);     userid3 = udf.newuseridforcategory3(userid, activity);     ...     userid10 = udf.newuseridforcategory10(userid, activity);     generate flatten(group), count(userid1), count(userid2), count(userid3), ..., count(userid10); } store b ...;

thanks. t.e.

if understand you're looking for, it's cube operation allows grouping many combinations - available in pig 0.11.

if can't use pig 0.11, may try group every permutation , union grouped results.

Abdulmateen

Search This Blog

apache pig - Pig Unique Counts on Multiple Subsets of a Large Input -

Comments

Post a Comment