i have huge input on hdfs , use pig calculate several unique metrics. explain problem more easily, assume input file has following schema:
userid:chararray, dimensiona_key:chararray, dimensionb_key:chararray, dimensionc_key:chararray, activity:chararray, ...
each record represent activity performed userid.
based on value in activity field, activity record mapped 1 or more categories. there 10 categories in total.
now need count number of unique users different dimension combinations (i.e. a, b, c, a+b, a+c, b+c, a+b+c) each activity category.
what best practices perform such calculation?
i have tried several ways. although can results want, takes long time (i.e. days). found of time spent on map phase. looks script tries load huge input file every time tries calculate 1 unique count. there way improve behavior?
i tried similar below, looks reaches memory cap single reducer , stuck @ last reducer step.
source = load ... (userid:chararray, dimensiona_key:chararray, dimensionb_key:chararray, dimensionc_key:chararray, activity:chararray, ...); = group source (dimensiona_key, dimensionb_key); b = foreach { userid1 = udf.newuseridforcategory1(userid, activity); -- udf returns original user id if activity should mapped category1 , none otherwise userid2 = udf.newuseridforcategory2(userid, activity); userid3 = udf.newuseridforcategory3(userid, activity); ... userid10 = udf.newuseridforcategory10(userid, activity); generate flatten(group), count(userid1), count(userid2), count(userid3), ..., count(userid10); } store b ...;
thanks. t.e.
if understand you're looking for, it's cube operation allows grouping many combinations - available in pig 0.11.
if can't use pig 0.11, may try group every permutation , union grouped results.
Comments
Post a Comment