c# - Grouping by an unknown initial prefix -


say have following array of strings input:

foo-139875913 foo-aeuefhaiu foo-95hw9ghes barbazabejgoiagjaegioea barbaz8gs98ghsgh9es8h 9a8efa098fea0 barbaza98fyae9fghaefag bazfa90eufa0e9u bazgeajga8ugae89u bazguea9guae aifeaufhiuafhe 

there 3 different prefixes used here, "foo-", "barbaz" , "baz" - these prefixes not known ahead of time (they different).

how establish different common prefixes grouped by? made bit tricky since in data i've provided there's 2 start "bazg" , 1 starts "bazf" of course "baz" prefix.

what i've tried far sorting them alphabetical order, , looping through them in order , counting how many characters in row identical previous. if number different or when 0 characters identical, starts new group. problem falls on @ "bazg" , "bazf" problem mentioned earlier , separates 2 different groups (one 1 element in it)

edit: alright, let's throw few more rules in:

  • longer potential groups should preferred on shorter ones, unless there closely matching group of less x characters difference in length. (so x 2, baz preferred on bazg)
  • a group must have @ least y elements in or not group @ all
  • it's okay throw away elements don't match of 'groups' within rules above.

to clarify first rule in relation second, if x 0 , y 2, 2 'bazg' entries in group, , 'bazf' thrown away because on own.

well, here's quick hack, o(something_bad):

ienumerable<tuple<string, ienumerable<string>>> guessgroups(ienumerable<string> source, int minnamelength=0, int mingroupsize=1) {     // todo: error checking     return innerguessgroups(new stack<string>(source.orderbydescending(x => x)), minnamelength, mingroupsize); }  ienumerable<tuple<string, ienumerable<string>>> innerguessgroups(stack<string> source, int minnamelength, int mingroupsize) {     if(source.any())     {         var tuple = extracttuple(getbestgroup(source, minnamelength), source);         if (tuple.item2.count() >= mingroupsize)             yield return tuple;         foreach (var element in guessgroups(source, minnamelength, mingroupsize))             yield return element;        } }  tuple<string, ienumerable<string>> extracttuple(string prefix, stack<string> source) {     return tuple.create(prefix, popwithprefix(prefix, source).tolist().asenumerable()); }  ienumerable<string> popwithprefix(string prefix, stack<string> source) {     while (source.any() && source.peek().startswith(prefix))         yield return source.pop(); }  string getbestgroup(ienumerable<string> source, int minnamelength) {     var s = new stack<string>(source);     var counter = new dictionarywithdefault<string, int>(0);     while(s.any())     {         var g = getcommonprefix(s);         if(!string.isnullorempty(g) && g.length >= minnamelength)             counter[g]++;         s.pop();     }     return counter.orderby(c => c.value).last().key; }  string getcommonprefix(ienumerable<string> coll) {     return (from len in enumerable.range(0, coll.min(s => s.length)).reverse()             let possiblematch = coll.first().substring(0, len)             coll.all(f => f.startswith(possiblematch))             select possiblematch).firstordefault(); }  public class dictionarywithdefault<tkey, tvalue> : dictionary<tkey, tvalue> {   tvalue _default;   public tvalue defaultvalue {     { return _default; }     set { _default = value; }   }   public dictionarywithdefault() : base() { }   public dictionarywithdefault(tvalue defaultvalue) : base() {     _default = defaultvalue;   }   public new tvalue this[tkey key]   {     { return base.containskey(key) ? base[key] : _default; }     set { base[key] = value; }   } } 

example usage:

string[] input = {     "foo-139875913",     "foo-aeuefhaiu",     "foo-95hw9ghes",     "barbazabejgoiagjaegioea",     "barbaz8gs98ghsgh9es8h",     "barbaza98fyae9fghaefag",     "bazfa90eufa0e9u",     "bazgeajga8ugae89u",     "bazguea9guae",     "9a8efa098fea0",     "aifeaufhiuafhe" };  guessgroups(input, 3, 2).dump(); 

enter image description here


Comments