say have following array of strings input:
foo-139875913 foo-aeuefhaiu foo-95hw9ghes barbazabejgoiagjaegioea barbaz8gs98ghsgh9es8h 9a8efa098fea0 barbaza98fyae9fghaefag bazfa90eufa0e9u bazgeajga8ugae89u bazguea9guae aifeaufhiuafhe
there 3 different prefixes used here, "foo-", "barbaz" , "baz" - these prefixes not known ahead of time (they different).
how establish different common prefixes grouped by? made bit tricky since in data i've provided there's 2 start "bazg" , 1 starts "bazf" of course "baz" prefix.
what i've tried far sorting them alphabetical order, , looping through them in order , counting how many characters in row identical previous. if number different or when 0 characters identical, starts new group. problem falls on @ "bazg" , "bazf" problem mentioned earlier , separates 2 different groups (one 1 element in it)
edit: alright, let's throw few more rules in:
- longer potential groups should preferred on shorter ones, unless there closely matching group of less x characters difference in length. (so x 2, baz preferred on bazg)
- a group must have @ least y elements in or not group @ all
- it's okay throw away elements don't match of 'groups' within rules above.
to clarify first rule in relation second, if x 0 , y 2, 2 'bazg' entries in group, , 'bazf' thrown away because on own.
well, here's quick hack, o(something_bad)
:
ienumerable<tuple<string, ienumerable<string>>> guessgroups(ienumerable<string> source, int minnamelength=0, int mingroupsize=1) { // todo: error checking return innerguessgroups(new stack<string>(source.orderbydescending(x => x)), minnamelength, mingroupsize); } ienumerable<tuple<string, ienumerable<string>>> innerguessgroups(stack<string> source, int minnamelength, int mingroupsize) { if(source.any()) { var tuple = extracttuple(getbestgroup(source, minnamelength), source); if (tuple.item2.count() >= mingroupsize) yield return tuple; foreach (var element in guessgroups(source, minnamelength, mingroupsize)) yield return element; } } tuple<string, ienumerable<string>> extracttuple(string prefix, stack<string> source) { return tuple.create(prefix, popwithprefix(prefix, source).tolist().asenumerable()); } ienumerable<string> popwithprefix(string prefix, stack<string> source) { while (source.any() && source.peek().startswith(prefix)) yield return source.pop(); } string getbestgroup(ienumerable<string> source, int minnamelength) { var s = new stack<string>(source); var counter = new dictionarywithdefault<string, int>(0); while(s.any()) { var g = getcommonprefix(s); if(!string.isnullorempty(g) && g.length >= minnamelength) counter[g]++; s.pop(); } return counter.orderby(c => c.value).last().key; } string getcommonprefix(ienumerable<string> coll) { return (from len in enumerable.range(0, coll.min(s => s.length)).reverse() let possiblematch = coll.first().substring(0, len) coll.all(f => f.startswith(possiblematch)) select possiblematch).firstordefault(); } public class dictionarywithdefault<tkey, tvalue> : dictionary<tkey, tvalue> { tvalue _default; public tvalue defaultvalue { { return _default; } set { _default = value; } } public dictionarywithdefault() : base() { } public dictionarywithdefault(tvalue defaultvalue) : base() { _default = defaultvalue; } public new tvalue this[tkey key] { { return base.containskey(key) ? base[key] : _default; } set { base[key] = value; } } }
example usage:
string[] input = { "foo-139875913", "foo-aeuefhaiu", "foo-95hw9ghes", "barbazabejgoiagjaegioea", "barbaz8gs98ghsgh9es8h", "barbaza98fyae9fghaefag", "bazfa90eufa0e9u", "bazgeajga8ugae89u", "bazguea9guae", "9a8efa098fea0", "aifeaufhiuafhe" }; guessgroups(input, 3, 2).dump();
Comments
Post a Comment