[Solved-1 Solution] Apache pig - url parsing into a map ?
What's a URL?
- Uniform Resource Locators (URLs) provide a way to locate a resource using a specific scheme, most often but not limited to HTTP. Just think of a URL as an address to a resource, and the scheme as a specification of how to get there.
Parsing a url
- The URL class provides several methods that let you query URL objects. You can get the protocol, authority, host name, port number, path, query, filename, and reference from a url.
Problem:
How to URL parsing into a map in apache pig ?
Solution 1:
Use of flatten
- The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.
- FLATTEN the result of STRSPLIT so that there is no useless level of nesting in tuples, and FLATTEN again inside the nested foreach
- Also, STRSPLIT has an optional third argument to give the maximum number of output strings. Use that to guarantee a schema for its output.
The below code is helps for url parsing:
A = load 'test.log' as (f:chararray, url:chararray);
B = foreach A generate f, TOKENIZE(url,'&') as attr;
C = foreach B {
D = foreach attr generate FLATTEN(STRSPLIT($0,'=',2)) AS (key:chararray, val:chararray);
generate f, FLATTEN(D);
};
E = foreach (group C by (f, key)) generate group.f, TOMAP(group.key, C.val);
dump E;
Output
test1,[user#{(3553)}])
(test1,[friend#{(2042)}])
(test1,[system#{(262)}])
(test2,[user#{(12523),(205)}])
(test2,[friend#{(26546),(3525),(353)}])
(test2,[browser#{(firfox)}])
- After finished splitting out the tags and values, group also by the tag to get your bag of values. Then put that into a map. Note that this assumes that if we have two lines with the same id (test2, here) we have to combine them.
- Unfortunately, there is apparently no way to combine maps without resorting to a UDF, but this should be just about the simplest of all possible UDFs.
public class COMBINE_MAPS extends EvalFunc<Map> {
public Map<String, DataBag> exec(Tuple input) throws IOException {
if (input == null || input.size() != 1) { return null; }
// Input tuple is a singleton containing the bag of maps
DataBag b = (DataBag) input.get(0);
// Create map that we will construct and return
Map<String, Object> m = new HashMap<String, Object>();
// Iterate through the bag, adding the elements from each map
Iterator<Tuple> iter = b.iterator();
while (iter.hasNext()) {
Tuple t = iter.next();
m.putAll((Map<String, Object>) t.get(0));
}
return m;
}
}With a UDF like that, we can do
F = foreach (group E by f) generate COMBINE_MAPS(E.$1);
For better url parsing ,we should add the error-checking code to the UDF