java - How does LCP help in finding the number of occurrences of a pattern? -
i have read longest common prefix (lcp) used find number of occurrences of pattern in string.
specifically, need create suffix array of text, sort it, , instead of doing binary search find range can figure out number of occurrences, compute lcp each successive entry in suffix array.
although using binary search find number of occurrences of pattern obvious can't figure out how lcp helps find number of occurrences here.
for example suffix array banana
:
lcp suffix entry n/a 1 ana 3 anana 0 banana 0 na 2 nana
how lcp find number of occurrences of substring "banana" or "na" not obvious me.
any figuring out how lcp helps here?
i not know way of using lcp array instead of carrying out binary search, believe refer technique described udi manber , gene myers in suffix arrays: new method on-line string searches.
the idea this: in order find number of occurrences of given string p (length m) in text t (length n),
- you use binary search against suffix array of t (just suggested)
- but speed up using lcp array auxiliary data structure. more specifically, generate special version of lcp array (i call lcp-lr below) , use that.
the issue using standard binary search (without lcp information) in each of o(log n) comparisons need make, compare p current entry of suffix array, means full string comparison of m characters. complexity o(m*log n).
the lcp-lr array helps improve o(m+log n), in following way:
- at point during binary search algorithm, consider, usual, range (l,...,r) of suffix array , central point m, , decide whether continue search in left sub-range (l,...,m) or in right sub-range (m,...,r).
- in order make decision, compare p string @ m. if p identical m, done, if not, have compared first k characters of p , decided whether p lexicographically smaller or larger m. let's assume outcome p larger m.
-
so, in next step, consider (m,...,r) , new central point m' in middle:
m ...... m' ...... r | know: lcp(p,m)==k
the trick lcp-lr precomputed such o(1)-lookup tells longest common prefix of m , m', lcp(m,m').
you know (from previous step) m has prefix of k characters in common p: lcp(p,m)=k. there 3 possibilities:
- case 1: k < lcp(m,m'), i.e. p has fewer prefix characters in common m m has in common m'. means (k+1)-th character of m' same of m, , since p lexicographically larger m, must lexicographically larger m', too. continue in right half (m',...,r).
- case 2: k > lcp(m,m'), i.e. p has more prefix characters in common m m has in common m'. consequently, if compare p m', common prefix smaller k, , m' lexicographically larger p, so, without making comparison, continue in left half (m,...,m').
- case 3: k == lcp(m,m'). m , m' both identical p in first k characters. decide whether continue in left or right half, suffices compare p m' starting (k+1)-th character.
we continue recursively.
the overall effect no character of p compared character of text more once. total number of character comparisons bounded m, total complexity indeed o(m+log n).
obviously, key remaining question how did precompute lcp-lr able tell in o(1) time lcp between 2 entries of suffix array? said, standard lcp array tells lcp of consecutive entries only, i.e. lcp(x-1,x) x. m , m' in description above not consecutive entries, how done?
the key realize ranges (l,...,r) ever occur during binary search: starts (0,...,n) , divides @ center, , continues either left or right , divide half again , forth. if think of it: every entry of suffix array occurs central point of 1 possible range during binary search. there n distinct ranges (l...m...r) can possibly play role during binary search, , suffices precompute lcp(l,m) , lcp(m,r) n possible ranges. 2*n distinct precomputed values, hence lcp-lr o(n) in size.
moreover, there straight-forward recursive algorithm compute 2*n values of lcp-lr in o(n) time standard lcp array – i'd suggest posting separate question if need detailed description of that.
to sum up:
- it possible compute lcp-lr in o(n) time , o(2*n)=o(n) space lcp
- using lcp-lr during binary search helps accelerate search procedure o(m*log n) o(m+log n)
- as suggested, can use 2 binary searches determine left , right end of match range p, , length of match range corresponds number of occurrences p.
Comments
Post a Comment