@MISC{_supportingtext, author = {}, title = {Supporting Text 1. Algorithm Details}, year = {} }

Share

OpenURL

Abstract

Consider a corpus of m sentences (sequences) of variable length, each expressed in terms of a lexicon of finite size N. The sentences in the corpus correspond to m different paths in a pseudograph (a nonsimple graph in which both loops and multiple edges are permitted) whose vertices are the unique lexicon entries, augmented by two special symbols, begin and end. Each of the N nodes has a number of incoming paths that is equal to the number of outgoing paths. Fig. 4 illustrates the type of structure that we seek, namely, the bundling of paths, signifying a relatively high probability associated with a sub-structure that can be identified as a pattern. To extract it from the data, two probability functions are defined over the graph for any given search path S = (e1 → e2 → · · · → ek) = (e1; ek). ∗ The first one, PR(ei; ej), is the rightmoving ratio of fan-through flux of paths at ej to fan-in flux of paths at ej−1, starting at ei and moving along the subpath ei → ei+1 → ei+2 · · · → ej−1 PR(ei; ej) = p(ej|eiei+1ei+2...ej−1) = l(ei; ej), [1] l(ei; ej−1) where l(ei; ej) is the number of occurrences of subpaths (ei; ej) in the graph. Proceeding in the opposite direction, from the right end of the path to the left, we define the left-going probability function PL and note that