Two-Dimensional Pattern Matching: Technische Universiteit Eindhoven Department of Mathematics and Computer Science

TECHNISCHE UNIVERSITEIT EINDHOVEN Department of Mathematics and Computer Science
MASTERS THESIS
Two-dimensional pattern matching

by M.G.W.H. van de Rijdt
Supervisors:
dr. ir. G. Zwaan prof. dr. B.W. Watson
Eindhoven, August 2005
Abstract This thesis contains formal derivations of several two-dimensional pattern matching algorithms. The two-dimensional pattern matching problem is to nd all exact occurrences of a given twodimensional pattern matrix within a larger matrix. Two-dimensional pattern matching is mostly applied in image processing (and image recognition in particular), although there are other applications as well. We give a formal derivation (and correctness proof) for several known algorithms in this eld, as well as a few improvements to some of these algorithms.
Samenvatting Dit afstudeerverslag bevat formele aeidingen voor enige algoritmen voor twee-dimensionale patroonherkenning. Het probleem van twee-dimensionale patroonherkenning bestaat uit het vinden van de voorkomens van een gegeven twee-dimensionale patroonmatrix binnen een grotere matrix. Twee-dimensionale patroonherkenning wordt vooral toegepast binnen de beeldverwerking (en in het bijzonder beeldherkenning), maar er zijn ook andere toepassingen. We geven een formele aeiding (tevens correctheidsbewijs) van verschillende bekende algoritmen die dit probleem oplossen. Daarnaast introduceren we enige verbeteringen op sommige van deze algoritmen.
Preface This document is my Masters Thesis, written to complete my education in Technische Informatica (Technical Computer Science). It is the result of my research for the Software Construction group at the Department of Mathematics and Computer Science at the Eindhoven University of Technology (TU/e), under the supervision of dr. ir. Gerard Zwaan and prof. dr. Bruce Watson. I thank Bruce Watson, for initially suggesting the topic of two-dimensional pattern matching and involving me in the FASTAR (Finite Automata Systems Theoretical and Applied Research) group. I would also like to thank Gerard Zwaan, for his reviews of countless draft versions of this thesis and oering many corrections and suggestions to improve this document greatly. I thank Loek Cleophas for his reviews of several early versions of this dcument, as well as some pointers to relevant articles. Many thanks go to my good friend Remi Bosman, for all the brainstorming and his reviews of this documents near-nal drafts. I thank LaQuSo (the Laboratory for Quality Software at the TU/e), for providing me with a workspace for many months. Finally, I would like to thank my friends and family, particularly my parents, for all their support during the time of this research. Martijn van de Rijdt, August 2005.
Contents
0 Introduction 1 Preliminaries 1.0 1.1 1.2 Two-dimensional arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Composed matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-dimensional pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 12 12 13 14 16 18 18 18 20 20 21 22 23 26 30 32 32 33 36 39 44 44 48 48 48 54
2 Problem 3 Naive algorithm 3.0 3.1 Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Match function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Filter-based approach 4.0 4.1 4.2 4.3 4.4 4.5 The lter function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-dimensional pattern matching in one direction . . . . . . . . . . . . . . . . . . Ecient computation and storage of the reduced text . . . . . . . . . . . . . . . . Baker and Bird . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takaoka and Zhu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Baeza-Yates and Rgnier e 5.0 5.1 5.2 5.3 5.4 5.5 Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Baeza-Yates and Rgniers CheckMatch approach . . . . . . . . . . . . . . . . . . . e Inspecting fewer pattern rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inspecting only matching pattern rows . . . . . . . . . . . . . . . . . . . . . . . . . Computation of unique row indices . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Polcar 6.0 6.1 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
Precomputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.0 6.3.1 6.3.2 6.3.3 6.3.4 Representing sets of matrices by lists of maximal elements . . . . . . . . . . Precomputation algorithm structure . . . . . . . . . . . . . . . . . . . . . . Case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entire precomputation algorithm . . . . . . . . . . . . . . . . . . . . . . . . Computation of the failure function . . . . . . . . . . . . . . . . . . . . .
56 56 61 63 79 81 85 86 88 88 88 90 92 94 96
6.4 6.5
Entire algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Conclusions and future work 7.0 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Properties of div and mod B Properties of pref and su (for strings) C Properties of pref and su (for matrices) D Lists
List of Figures
0 1 2 3 4 A two-dimensional array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Every match intersects with a row i m1 1 . . . . . . . . . . . . . . . . . . . . . Submatrix of the text, inspected when a match occurs in row i m1 1 . . . . . . The Baeza-Yates algorithm idea, applied in three dimensions . . . . . . . . . . . . A pattern occurrence as a sux of a prex of the text . . . . . . . . . . . . . . . . 12 32 35 45 48
Introduction
In this text we will formally derive a number of known two-dimensional pattern matching algorithms. The two-dimensional pattern matching problem consists of nding all occurrences of a given two-dimensional matrix, the so-called pattern, in a larger two-dimensional matrix, the text. We provide formal derivations of these algorithms for a number of reasons. First of all, a formal derivation is also a correctness proof. This method also ensures that all algorithms are presented in a (more-or-less) uniform way, which is independant of implementation details and choice of programming language. And nally, this presentation also highlights the major design decisions during the algorithms construction. Variations on these decisions may give rise to new solutions to the two-dimensional pattern matching problem. Part of the original goal of this research was to construct a taxonomy of algorithms. A taxonomy is a structured classication of algorithms see for examples [Wat95, WZ92, WZ93, WZ95, WZ96]. However, as we will see, the dierences between most of the algorithms discussed here are so great, that the corresponding taxonomy would have a very coarse structure and therefore would not provide much additional value. Section 1 introduces some denitions and notations used in the rest of the thesis. In section 2, we formally dene the two-dimensional pattern matching problem. Section 3 contains a description of a very straightforward, but inecient, solution to the problem: the naive algorithm. In section 4 the so-called lter-based approaches are discussed; most notably: Baker and Birds algorithm ([Bak78, Bir77]) and Takaoka and Zhus algorithm ([TZ89, TZ94]). Section 5 contains the description of Baeza-Yates and Rgniers algorithm ([BYR93]) and in section 6 we will derive e Polcars algorithm ([Pol04, MP04]). Section 7 will contain the conclusions and suggestions for future work. Finally, in the appendices we list some denitions, notations and properties that are useful in the derivations in the main text, but not an essential part of the derivations themselves.
10
11
1
1.0
Preliminaries
Two-dimensional arrays
Say we have a two-dimensional array, or matrix, M . Such a matrix can be visualised as shown in gure 0. The size of M is determined by its number of rows, denoted by row(M ), and its number of columns, col(M ). Let row(M ) = l1 and col(M ) = l2 . Rows are numbered from 0 to l1 1, columns from 0 to l2 1. We call M [i][0 .. l2 ), or simply M [i], the (i + 1)th row of M . Similarly, M [0 .. l1 )[i] is the (i + 1)th column of M . The set of all two-dimensional matrices over is denoted by M2 (). 0 0 1 1 l2 1
l1 1
Figure 0: A two-dimensional array We call a matrix for which the number of rows or the number of columns (or both) is equal to 0 an empty matrix. This is a special kind of matrix because, since it has no elements, it is completely dened by its size. We denote the empty matrix of size k1 k2 by Ek1 ,k2 (where k1 = 0 k2 = 0). We call the set of all empty matrices ES: ES = set j1 , j2 : 0 j1 0 j2 (j1 = 0 j2 = 0) : Ej1 ,j2 We can also give the following alternate denition: ES = set j : 0 j : Ej,0 set j : 0 j : E0,j We introduce the following notation for a set of empty matrices of a certain size, for 0 i1 , 0 i2 : ESi1 ,i2 = set j1 , j2 : 0 j1 i1 0 j2 i2 (j1 = 0 j2 = 0) : Ej1 ,j2 ESi1 ,i2 = set j : 0 j i1 : Ej,0 set j : 0 j i2 : E0,j Furthermore, for 0 i1 l1 k1 and 0 i2 l2 k2 , the following is a k1 k2 submatrix of M : M [i1 .. i1 + k1 )[i2 .. i2 + k2 )
12
For 0 j1 < k1 and 0 j2 < k2 : (M [i1 .. i1 + k1 )[i2 .. i2 + k2 ))[j1 , j2 ] = M [i1 + j1 , i2 + j2 ] If i1 and i2 are both equal to 0, we call the submatrix a prex of M . If i1 +k1 = l1 and i2 +k2 = l2 , we call the submatrix a sux of M . More formally, a prex of M is an element of the set pref(M ). A sux is an element of su(M ). We dene the sets pref(M ) and su(M ) similarly to the onedimensional case: pref(M ) = set i1 , i2 : 0 i1 l1 0 i2 l2 : M [0 .. i1 )[0 .. i2 ) su(M ) = set i1 , i2 : 0 i1 l1 0 i2 l2 : M [i1 .. l1 )[i2 .. l2 ) Note that both pref(M ) and su(M ) include the following empty matrices: ESl1 ,l2 . Two matrices M and N are equal if they have the same size, say k1 k2 , and: h1 , h2 : 0 h1 < k1 0 h2 < k2 : M [h1 , h2 ] = N [h1 , h2 ]
1.1
Composed matrices
Suppose we have l1 l2 matrices, called Mi1 ,i2 (0 i1 < l1 and 0 i2 < l2 ), for which the following holds: i1 , i2 : 0 i1 < l1 0 i2 < l2 : row(Mi1 ,i2 ) = row(Mi1 ,0 ) col(Mi1 ,i2 ) = col(M0,i2 ) Then we can introduce the following composed matrix : M0,0 M1,0 . . . Ml1 1,0 M0,1 M1,1 . . . Ml1 1,1 .. . M0,l2 1 M1,l2 1 . . . Ml1 1,l2 1
The meaning of this notation should be intuitively obvious. For a formal denition, we rst introduce an auxiliary denition. For 0 i1 l1 and 0 i2 l2 : r(i1 ) = j : 0 j < i1 : row(Mj,0 ) c(i2 ) = j : 0 j < i2 : col(M0,j ) Let us call the composed matrix N . It is the matrix for which the following holds: row(N ) = r(l1 ) col(N ) = c(l2 ) j1 , j2 : 0 j1 < l1 0 j2 < l2 : N [r(j1 ) .. r(j1 + 1))[c(j2 ) .. c(j2 + 1)) = Mj1 ,j2 We will discuss two special cases of the matrix composition. First we have A B (which is only dened if row(A) = row(B)). This is known as the column concatenation and sometimes 13
written as A B or [A B]. The other special case is
A (only dened if col(A) = col(B)): the B row concatenation, which is sometimes denoted in the literature by A B or [A; B]. For a more extensive description of row and column concatenation and how these two operators can be used to dene two-dimensional regular expressions and two-dimensional languages, we refer to [RS97].
Using the matrix composition, we can give an alternate denition for pref and su, which is equivalent to the dentions given in section 1.0, but expressed in terms of the matrix composition, as opposed to indices: A B C D A B C D
pref(M ) = set A, B, C, D : M = su(M ) = set A, B, C, D : M =
: A : D
1.2
One-dimensional pattern matching
In some of the two-dimensional pattern matching algorithms to be discussed, we will use a onedimensional pattern matching algorithm (for example, on rows or columns of the text). When it is not relevant which one-dimensional pattern matching algorithm is used, we will refer to function P M1 , with the following specication (for strings p and t): P M1 (p, t) = set l, r : t = lpr : |l| In the same spirit we introduce the multipattern matching function M P M1 , with the following specication (for set of strings P S and string t): M P M1 (P S, t) = set l, p, r : p P S t = lpr : (|l|, p)
14
15
Problem
In normal (one-dimensional) pattern matching, the problem is to nd all occurrences of a pattern p in a text t, where both p and t are strings over an alphabet . In the two-dimensional case, instead of matching strings, we are matching two-dimensional arrays. Our pattern P and text T are matrices over , with row(P ) = m1 , col(P ) = m2 , row(T ) = n1 and col(T ) = n2 . In practical applications, the text is often a picture. However, we continue using the term text in this overview to emphasise the similarities to one-dimensional pattern matching. The problem is to nd all exact matches of pattern P in text T . More formally, our postcondition is R : O = set i1 , i2 : 0 i1 n1 m1 0 i2 n2 m2 T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 )
Note that we identify occurrences of the pattern by their upper-left corner. Since the problem we are focussing on is exact two-dimensional pattern matching, the size of each occurrence is the same and equal to the size of the pattern. Therefore, any point can be used to represent an occurrence. We have chosen the upper left corner because it is convenient for most of the algorithms we will discuss, but we could have just as easily reported the lower-right corner, or any other point, instead.
16
17
3
3.0
Naive algorithm
Algorithm structure
The simplest way of establishing R is by checking, for all positions (i1 , i2 ) in the text, whether there is a match starting at that position. We introduce a set of index pairs D : N N. We will maintain the following invariant: P 0 : O = set i1 , i2 : (i1 , i2 ) D T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 ) Invariant P 0 is trivially established by the assignment O, D := ?, ?. We have established R when: D = [0, n1 m1 ] [0, n2 m2 ] Ad P 0(D := D {(j1 , j2 )}): set i1 , i2 : (i1 , i2 ) D {(j1 , j2 )} T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 ) = { split o (i1 , i2 ) = (j1 , j2 ), P 0 } O {(j1 , j2 )} O = O {(j1 , j2 )} O if T [j1 .. j1 + m1 )[j2 .. j2 + m2 ) = P if T [j1 .. j1 + m1 )[j2 .. j2 + m2 ) = P if matchT,P (j1 , j2 ) if matchT,P (j1 , j2 )
{ matchT,P (i1 , i2 ) T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P }
Now we can give the so-called naive algorithm: O, D := ?, ?; for j1 , j2 : 0 j1 n1 m1 0 j2 n2 m2 { inv. P 0 } if matchT,P (j1 , j2 ) O := O {(j1 , j2 )} [] matchT,P (j1 , j2 ) skip f i; D := D {(j1 , j2 )} rof Note that the assignments to set D can be removed from the algorithm without harming its correctness; D is never inspected.
3.1
Match function
Recall our specication of matchT,P : matchT,P (i1 , i2 ) T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P Again, we introduce a set of index pairs, C : N N. We will maintain the following invariant: Q0 : res j1 , j2 : (j1 , j2 ) C : T [i1 + j1 , i2 + j2 ] = P [j1 , j2 ] 18
Invariant Q0 is initially trivially established by res, C := true, ?. When the following holds, we have established res = matchT,P (i1 , i2 ): C = [0, m1 [0, m2 Ad Q0(C := C {(k1 , k2 )}): j1 , j2 : (j1 , j2 ) C {(k1 , k2 )} : T [i1 + j1 , i2 + j2 ] = P [j1 , j2 ] { split o (j1 , j2 ) = (k1 , k2 ), Q0 } res T [i1 + k1 , i2 + k2 ] = P [k1 , k2 ] The complete implementation becomes the following: func matchT,P (i1 , i2 :integer) :boolean { pre: 0 i1 n1 m1 0 i2 n2 m2 } { result: T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P } |[ res, C := true, ?; for k1 , k2 : 0 k1 < m1 0 k2 < m2 res := res T [i1 + k1 , i2 + k2 ] = P [k1 , k2 ]; C := C {(k1 , k2 )} rof ; return(res) ]| Note that we can improve on this algorithm: we can stop the computation as soon as res becomes false. We can also omit set C, since it is never inspected.
19
4
4.0
Filter-based approach
The lter function
The idea is to reduce the two-dimensional pattern matching problem to normal (one-dimensional) pattern matching, by means of a lter function. More specically, we want to reduce the problem to matching occurrences of a pattern string p in the columns of a matrix t, where a detected match corresponds to a (possible) occurrence of our pattern P in text T . In order to do this, we reduce each row of P to a single value, using our lter function. Then p is simply the concatenation of these values. Note that the two-dimensional pattern matching problem is symmetrical in both dimensions. We could just as well reduce columns of the pattern to a single value and then search for their occurrence in the rows of the text. (This incidentally corresponds to applying our method to the transposed pattern and text.) We introduce column vector p of length m1 over X (we will later implicitly interpret p as a string), matrix t over X, of size n1 n2 m2 , and function fP : m2 X (where X is some still unspecied set), with the following relationship. p[i] = fP (P [i]) t[i1 , i2 ] = fP (T [i1 ][i2 .. i2 + m2 )) (0 i < m1 ) (0 i1 < n1 , 0 i2 n2 m2 )
We write the subscript P in fP , because in some (but not all) of the algorithms that we will describe, the value of fP will depend on P . Now we have: T [i1 ][i2 .. i2 + m1 ) = P [i] { fP is a function } fP (T [i1 ][i2 .. i2 + m2 )) = fP (P [i]) { spec. p, t } t[i1 , i2 ] = p[i] Therefore, we can conclude: T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P t[i1 .. i1 + m1 )[i2 ] = p (0)
We can now use one-dimensional pattern matching techniques for matching p against the columns of t. As we have seen, there can only be a match of P in T on those positions, where p and t match. On the other hand, in the general case, when we detect a match of p in t, we still need to check whether there is an actual match of P on that position in T . To do this, we can use the function matchT,P , as described in section 3.1. We introduce the following invariant: P 0 : O = set i1 , i2 : (i1 , i2 ) D T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 ) This is the same invariant we used in the Naive Algorithm (see section 3.0). Termination: when D = [0, n1 m1 ] [0, n2 m2 ]. However, our update of D will be slightly dierent; instead of adding one pair of indices to D at a time, we will add [0, n1 m1 ] {j}. 20
set i1 , i2 : (i1 , i2 ) D ([0, n1 m1 ] {j}) T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 ) = = { split o: [0, n1 m1 ] {j}, use: P 0 } O set i1 : 0 i1 n1 m1 T [i1 .. i1 + m1 )[j .. j + m2 ) = P : (i1 , j) { (0) } O set i1 : 0 i1 n1 m1 t[i1 .. i1 + m1 )[j] = p T [i1 .. i1 + m1 )[j .. j + m2 ) = P : (i1 , j) = { spec. P M1 , spec. matchT,P } O set i1 : i1 P M1 (p, t[0 .. n1 )[j]) matchT,P (i1 , j) : (i1 , j) Our algorithm now becomes: construct p; construct t; O, D := ?, ?; for j : 0 j n2 m2 { inv. P 0 } for i : i P M1 (p, t[0 .. n1 )[j]) if matchT,P (i, j) O := O {(i, j)} [] matchT,P (i, j) skip fi rof ; D := D ([0, n1 m1 ] {j}) rof Again, set D can be omitted without loss of correctness. Note that, depending on the choice of a lter function, i P M1 (p, t[0 .. n1 )[j]) may imply that certain elements of P and T are equal, rendering some or all of the comparisons in matchT,P (i, j) unnecessary. Now we only need to decide how to construct p and t, which depends on our choice for fP . Depending on this choice of fP , we can get several dierent algorithms. The most simple lter function is, for x m2 and any arbitrary value a: fP (x) = a This essentially results in the naive algorithm, as described in section 3. In the remainder of this section we will investigate several other lter functions, which will give rise to dierent algorithms.
4.1
One-dimensional pattern matching in one direction
Another very simple choice for fP is the following, for all x m2 and some i, 0 i < m2 : fP (x) = x[i] Here, the codomain of function fP is . Obviously this is only a valid choice if we assume 0 < m2 ; that is, our pattern P is not the empty matrix E. What this boils down to is searching for one column of the pattern in the texts columns using a one-dimensional pattern matching technique. If such a column is found, we use a brute force check to match the other columns. 21
Note that this lter functions values do not depend on the rst i and the last m2 1 i columns of both pattern P and text T .
4.2
Ecient computation and storage of the reduced text
As we have seen, the columns of matrix t are inspected one by one in our lter-based algorithms. The order in which they are inspected has been left unspecied so far. In the remaining lterbased algorithms, it is possible to eciently compute the values of column i2 + 1 from those of column i2 . So we will decide to inspect the columns of t in increasing order. In this case, it is not even necessary to precompute and store the entire matrix t; we can simply precompute the rst column and then compute the next column on the y. Recall our relationship between t and fP : t[i1 , i2 ] = fP (T [i1 ][i2 .. i2 + m2 )) (0 i1 < n1 , 0 i2 n2 m2 )
We can make this improvement whenever the value of fP (xb) (for x m2 1 and b ) can be expressed in terms of fP (ax), a and b (for any a: a ). In other words, when we can nd a function gP : X X, with the following property: fP (xb) = gP (fP (ax), a, b) To replace t, we introduce s[0 .. n1 ), which has the following relationship with t (for 0 i < n1 ): s[i] = t[i, j] That is, we have the following invariant: s[i] = fP (T [i][j .. j + m2 )) Then our algorithm becomes: construct p; construct initial value of s; O := ?; j := 0; do j = n2 m2 for i : i P M1 (p, s) if matchT,P (i, j) O := O {(i, j)} [] matchT,P (i, j) skip fi rof ; for i : 0 i < n1 s[i] := gP (s[i], T [i1 , i2 ], T [i1 , i2 + m2 ]) rof ; j := j + 1 od; for i : i P M1 (p, s) 22 (1)
if matchT,P (i, j) O := O {(i, j)} [] matchT,P (i, j) skip fi rof The initial value of s[i] is fP (T [i][0 .. m2 )), for 0 i < n1 . To avoid inspection of the elements of the (nonexisting) column n2 of the text, here we have peeled o the last layer of the main repetition. This is only possible if m2 < n2 (the text is larger than the pattern). This approach, using function gP , allows us to exploit the possibility of eciently computing the next required values of fP . It also provides a space improvement: we replaced matrix t of size n1 n2 m2 by vector s of size n1 . This improvement was included in the original description ([TZ89, TZ94]) of the Takaoka-Zhu algorithm (which is the lter-based algorithm we will discuss in section 4.4). In a recently published a r paper, [MZ05], Boivoj Melichar and Jan Zdrek propose the same space improvement for the Baker and Bird algorithm (discussed in section 4.3). We had already generalised the improvement a to be applied to Baker and Birds algorithm, independently of Melichar and Zdrek.
4.3
Baker and Bird
The idea is to construct the optimal Aho-Corasick automaton (introduced in [AC75]), where the pattern set contains the rows of P , and to use this automatons states for the result values of function fP . More formally, our pattern set is the set P R (pattern rows), dened by: P R = set i : 0 i < m1 : P [i] The optimal Aho-Corasick automaton based on pattern row set P R is a deterministic nite automaton (DFA): (Q, , , q0 , F ). We will not go into the details of constructing such an automaton; instead we refer to [WZ92] (pages 11 16), or to [WZ93]0 . Our lter function becomes: fP (x) = (q0 , x) In this case, the codomain of fP is state set Q. For the computation of (q0 , x) we can use the following program fragment: j, r := 0, q0 ; do j = |x| r := (r, x[j]); j := j + 1 od { r = (q0 , x) } We introduce the following abbreviation for this program fragment:
0 In these articles, a Moore machine is presented. However, we do not need the properties of a Moore machine here; a DFA suces for our purposes. In fact, we do not even need the set of nal states F .
23
r := (q0 , x) { r = (q0 , x) } Vector p p[i] =
{ spec. p, def. fP } (q0 , P [i])
The program for construct p simply becomes the following: for i : 0 i < m1 p[i] := (q0 , P [i]) rof Vector s The initial computation of s is very similar to that of p, since our specication gives: s[i] = (q0 , T [i][0 .. m2 )) Now, for our update of s, we will try to nd a function gP , satisfying the following property, for x m2 1 and a, b (see section 4.2): fP (xb) = gP (fP (ax), a, b) fP (xb) =
{ def. fP } (q0 , xb) { m2 |xb|, (q, axb) = (q, xb) } (q0 , axb) { } ( (q0 , ax), b) { def. fP } (fP (ax), b)
= = =
So we dene: gP (q, a, b) = (q, b) Note that the value of g is independent of parameter a in this case. In order to prove that the preceding derivation is correct, we still need to prove: (q, ax) = (q, x), for m2 |x| 24
Here we will use that our state set Q is in fact P(pref(P R)) and the following denition for state q: q = su(wq ) pref(P R) In this denition, wq is the string that consists of the labels on the shortest path from q0 to q. (q, ax) = = = = = = = { } su(wq ax) pref(P R) { theorem B.1 (on page 92) } (su(wq )ax su(x)) pref(P R) su(x) pref(P R) { x su(x) } ({x} su(x)) pref(P R) { m2 |x|, therefore: su(wq )x pref(P R) = {x} pref(P R); distributivity } (su(wq )x su(x)) pref(P R) { theorem B.0 (on page 92) } su(wq x) pref(P R) { } (q, x) Remarks We know, from (0) and (1): T [i .. i + m1 )[j .. j + m2 ) = P s[i .. i + m1 ) = p However, because of our particular choice of fP (the states in an Aho-Corasick automaton), in this case we have: s[i .. i + m1 ) = p { string equality } k : 0 k < m1 : s[i + k] = p[k] { specication s, p } k : 0 k < m1 : fP (T [i + k][j .. j + m2 )) = fP (P [k]) { denition fP } k : 0 k < m1 : (q0 , T [i + k][j .. j + m2 )) = (q0 , P [k]) { property of Aho-Corasick automata } k : 0 k < m1 : T [i + k][j .. j + m2 ) = P [k] { equality of matrices } T [i .. i + m1 )[j .. j + m2 ) = P 25 { m2 < |ax|, therefore: su(wq )ax pref(P R) = ?; distributivity }
Therefore, the call to matchT,P in our main algorithm is not necessary; when we detect a match of p in s we can immediately conclude that there is a match of P in T . We will give the complete algorithm one more time. for i : 0 i < m1 p[i] := (q0 , P [i]) rof ; for i : 0 i < n1 s[i] := (q0 , T [i][0 .. m2 )) rof ; O := ?; j := 0; do j = n2 m2 for i : i P M1 (p, s) O := O {(i, j)} rof ; for i : 0 i < n1 s[i] := (s[i], T [i, j + m2 ]) rof ; j := j + 1 od; for i : i P M1 (p, s) O := O {(i, j)} rof This algorithm was rst discovered, independently, by Bird ([Bir77]) and Baker ([Bak78]). Another presentation is given in [CR02]. In the original presentations, the method for one-dimensional pattern matching was explicitly selected: the Knuth-Morris-Pratt algorithm ([KMP77]), which is the single-string version of the Aho-Corasick algorithm ([AC75]). I have not explicitly chosen an algorithm here, instead using P M1 .
4.4
Takaoka and Zhu
Here we choose for fP the hash function, as used in Karp and Rabins one-dimensional pattern matching algorithm ([KR87]). For x m2 we dene: fP (x) = j : 0 j < m2 : ord(x[j]) ||m2 1j mod q (2)
Here, q is some large prime number. The function ord is a bijective function [0 .. || . That is, a function for which the following holds for all a, b : ord(a) = ord(b) a = b An interesting special case of this choice of fP is the one where is equal to {0, 1} (and therefore, || = 2) and ord is simply the identity function. In this case, our text and alphabet are matrices over bits. The lter function fP will then look like this: fP (x) = j : 0 j < m2 : x[j] 2m2 1j mod q 26
Because we are multiplying by (powers of) 2, this may lead to ecient implementations of the algorithm (possibly using bit parallelism). In most practical applications, converting any given matrix into such a form is rather easy. Each element in the original matrix is represented by a subrow (or subcolumn) in our new bitmatrix. Using this technique, the size of our matrix increases by the number of bits necessary to represent all values of the original alphabet. In practice we do not need to duplicate the entire matrix; we can use its internal bit representation. In [KR87], Karp and Rabin examine the same special case of = {0, 1} for their one-dimensional pattern matching algorithm as well. For the remainder of this section, we will consider the more general denition of fP , as given in (2). Vector p We will need to compute p[i], for all i (0 i < m1 ). The postcondition for p[i] := fP (P [i]) is the following: p[i] = j : 0 j < m2 : ord(P [i, j]) ||m2 1j mod q We introduce the following invariants: Q0 : 0 k m2 Q1 : p[i] = j : 0 j < k : ord(P [i, j]) ||k1j mod q Ad Q1(k := k + 1): j : 0 j < k + 1 : ord(P [i, j]) ||k+11j mod q = = = { split o: j = k } ( j : 0 j < k : ord(P [i, j]) ||k+11j + ord(P [i, k]) ||0 ) mod q { math } ( j : 0 j < k : ord(P [i, j]) ||k1j || + ord(P [i, k])) mod q { (a b + c) mod q = ((a mod q) b + c) mod q (theorem A.2 on page 90), Q1 } (p[i] || + ord(P [i, k])) mod q Our algorithm construct p becomes: for i : 0 i < m1 k, p[i] := 0, 0; do k = m2 p[i] := (p[i] || + ord(P [i, k])) mod q; k := k + 1 od rof Because our update of p[i] is always computed modulo q, p[i] < q is an invariant of the computation. Therefore we know that all intermediate results in the computation will be at most q ||. 27
Vector s The initial value of s can be computed using an algorithm, similar to construct p. For the update of s, we have the following derivation (for x m2 1 and a, b ): fP (xb) = = = = = = = { def. fP } j : 0 j < m2 : ord((xb)[j]) ||m2 1j mod q { split o: j = m2 1, use: |xb| = m2 } ( j : 0 j < m2 1 : ord(x[j]) ||m2 1j + ord(b)) mod q { dummy transformation: j := j 1 } ( j : 1 j < m2 : ord(x[j 1]) ||m2 j + ord(b)) mod q { domain expansion: j = 0, use: |ax| = m2 } ( j : 0 j < m2 : ord((ax)[j]) ||m2 j ord(a) ||m2 + ord(b)) mod q { math } ( j : 0 j < m2 : ord((ax)[j]) ||m2 1j || ord(a) ||m2 + ord(b)) mod q { theorem A.2 (on page 90) } (( j : 0 j < m2 : ord((ax)[j]) ||m2 1j mod q) || ord(a) ||m2 + ord(b)) mod q { def. fP } (fP (ax) || ord(a) ||m2 + ord(b)) mod q So we get the following denition for gP : gP (q, a, b) = (q || ord(a) ||m2 + ord(b)) mod q This corresponds to the rehash function from the original article by Takaoka and Zhu ([TZ89, TZ94]). Note that ||m2 is a constant and can therefore be precomputed. Remarks We present the complete algorithm. for i : 0 i < m1 k, p[i] := 0, 0; do k = m2 p[i] := (p[i] || + ord(P [i, k])) mod q; k := k + 1 od rof ; for i : 0 i < n1 k, s[i] := 0, 0; do k = m2 s[i] := (s[i] || + ord(T [i, k])) mod q; k := k + 1 od rof ; O := ?; 28
j := 0; do j = n2 m2 for i : i P M1 (p, s) if matchT,P (i, j) O := O {(i, j)} [] matchT,P (i, j) skip fi rof ; for i : 0 i < n1 s[i] := (s[i] || ord(T [i, j]) ||m2 + ord(T [i, j + m2 ])) mod q rof ; j := j + 1 od; for i : i P M1 (p, s) if matchT,P (i, j) O := O {(i, j)} [] matchT,P (i, j) skip fi rof This algorithm is presented in [TZ89, TZ94]. In the original article, the authors wrote the following update of s[i], which diers slightly from the one presented above:1 s[i] := s[i] + || q ord(T [i, j]) (||m2 1 mod q) || + ord(T [i, j + m2 ]) mod q
However, we can prove that the two expressions are equal, using theorem A.2. ((s[i] + || q ord(T [i, j]) (||m2 1 mod q)) || + ord(T [i, j + m2 ])) mod q = = = { over + } (s[i] || + ||2 q ord(T [i, j]) (||m2 1 mod q) || + ord(T [i, j + m2 ])) mod q { (||2 q) mod q = 0 } (s[i] || ord(T [i, j]) (||m2 1 mod q) || + ord(T [i, j + m2 ])) mod q { theorem A.2, math } (s[i] || ord(T [i, j]) ||m2 + ord(T [i, j + m2 ])) mod q The dierences between the two expressions are the following: Where we suggested that the value of ||m2 can be precomputed, Takaoka and Zhu chose to precompute ||m2 1 mod q and then multiply it by ||. The advantage of storing the constant modulo q is that its value is at most q. In an implementation, this can be useful for preventing an overow. By replacing the constant ||m2 with ||m2 mod q in the algorithm text we presented, we can achieve the same advantage. Takaoka and Zhu included an extra term || q (or, more accurately, ||2 q). It is possible that this term was included to ensure that the computation only has nonnegative intermediate results. (The only negative term in the expression is ord(T [i, j])(||m2 1 mod q). We know: ord(T [i, j]) < || and ||m2 1 mod q < q, so ord(T [i, j]) (||m2 1 mod q) < || q.) Another dierence with the original article is that the authors, like Baker and Bird, explicitly chose the Knuth-Morris-Pratt algorithm ([KMP77]) for one-dimensional pattern matching, where we referred to P M1 .
1 In
an attempt to improve readability, we have used brackets of dierent sizes in the expression.
29
4.5
Generalisations
We can use these lter-based techniques easily for two-dimensional multipattern matching, where we have k pattern matrices: P0 , , Pk1 . We can use our lter function fP to reduce each of these patterns to a column vector pi . Then we use one-dimensional multipattern matching to match these column vectors with the reduced text. To apply fP to all rows of all patterns, the pattern matrices need to have the same width (number of columns). However, their lengths (that is: number of rows) may vary; one-dimensional multipattern matching strategies do not depend on the patterns all being of the same length. Pattern matching in more than two dimensions is also possible. For matching in n + 1 dimensions (1 n), we reduce our pattern to an n-dimensional matrix using our lter function fP and then apply an n-dimensional pattern matching technique for matching the reduced pattern and the reduced text.
30
31
5
5.0
Baeza-Yates and Rgnier e

Algorithm structure
In section 4 we described algorithms where we reduce an entire row of the pattern to a single value, in order to be able to use one-dimensional pattern matching techniques on the resulting column vector of values. Here we will present a dierent kind of lter algorithm. It is based on the knowledge that all matches of pattern P in text T intersect with exactly one row of T with a number of the form i m1 1, since our pattern has exactly m1 rows. See gure 1. 0
m1 1
2 m1 1
3 m1 1 n1 1 Figure 1: Every match intersects with a row i m1 1 In this algorithm description, we will use the set P R (as in Baker and Bird, in section 4.3), which is the set of rows of the pattern P . P R = set i : 0 i < m1 : P [i] Recall our postcondition R: O = set i1 , i2 : 0 i1 n1 m1 0 i2 n2 m2 T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 )
We start our derivation with the predicate a match of P in T exists: i1 , i2 : 0 i1 n1 m1 0 i2 n2 m2 : T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P { i1 (i1 div m1 + 1) m1 1 < i1 + m1 (see theorem A.3 on page 90); P R contains all rows of P } i1 , i2 : 0 i1 n1 m1 0 i2 n2 m2 : T [(i1 div m1 + 1) m1 1][i2 .. i2 + m2 ) P R { specication M P M1 }
32
i1 : 0 i1 n1 m1 : M P M1 (P R, T [(i1 div m1 + 1) m1 1]) = ? i : 1 i n1 div m1 : M P M1 (P R, T [i m1 1]) = ? { property ; dummy transformation: i = i1 div m1 + 1 }
For 0 k < m1 , 1 i n1 div m1 and 0 j n2 m2 : (j, P [k]) M P M1 (P R, T [i m1 1]) / { specication M P M1 } (j, P [k]) set l, p, r : p P R T [i m1 1] = lpr : (|l|, p) / { set calculus, |P [k]| = m2 , P [k] P R } T [i m1 1][j .. j + m2 ) = P [k] { (i + 1) m1 1 k n1 } T [i m1 1 k .. (i + 1) m1 1 k)[j .. j + m2 ) = P If the above asumption, (i + 1) m1 1 k n1 , does not hold, then there can also be no match starting at row i m1 1 k, because it would not t within the bounds of the text. So our algorithm will be of the following form: for i : 1 i n1 div m1 for j, p : (j, p) M P M1 (P R, T [i m1 1]) O := O set k : 0 k < m1 (i + 1) m1 1 k n1 P [k] = p T [i m1 1 k .. (i + 1) m1 1 k)[j .. j + m2 ) = P : (i m1 1 k, j) rof rof Note that the multipattern matching function M P M1 is always called with pattern set P R as its rst parameter. Any precomputation, that depends on P R, necessary for the multipattern matching algorithm needs to occur only once. A simple implementation of this algorithm is the following: for i : 1 i n1 div m1 for j, p : (j, p) M P M1 (P R, T [i m1 1]) for k : 0 k < m1 (i + 1) m1 1 k n1 if P [k] = p if matchT,P (i m1 1 k, j) O := O {(i m1 1 k, j)} [] matchT,P (i m1 1 k, j) skip fi [] P [k] = p skip fi rof rof rof
5.1
Baeza-Yates and Rgniers CheckMatch approach e
While the algorithm we presented is correct, the call to matchT,P can lead to the same comparisons being repeated several times. For instance, from our call to M P M1 and the guard P [k] = p we 33
can conclude that T [i m1 1][j .. j + m1 ) = P [k], yet each call to matchT,P will repeat the computation. Besides that, if more than one row of P is equal to string p, the matchT,P function is called multiple times and there is some overlap in the rows of T that are being matched each time. This should be obvious if we look at the inner repition of the algorithm, with an implementation of matchT,P lled in. for k : 0 k < m1 (i + 1) m1 1 k n1 if P [k] = p h, res := 0, true; do h = m1 res { inv.: res T [i m1 1 k .. i m1 1 k + h)[j .. j + m2 ) = P [0 .. h) } res := T [i m1 1 k + h][j .. j + m2 ) = P [h]; h := h + 1 od; if res O := O {(i m1 1 k, j)} [] res skip fi [] P [k] = p skip fi rof This matchT,P implementation diers slightly from the one presented in section 3.1. Here we consider entire rows in a xed order, instead of single elements in an unspecied order. We also terminate the computation as soon as a mismatch is discovered, as suggested at the end of section 3.1. To avoid the aforementioned unnecessary comparisons, we will keep a record of the rows of T that have already been matched against the rows of P . To do this, we rst introduce a unique index for each row of the pattern P . We introduce function g : N: g(x) = l : 0 l < m1 x = P [l] : l g(x) = m1 if x P R if x P R / (3)
For strings that occur as a row in the pattern P , the g-value is equal to the row number of their rst occurrence. This is a unique numbering for these strings; that is, for all x, y P R: x = y g(x) = g(y) In fact, we have the following property for g: Theorem 5.0 For strings x and y, with g(y) = m1 , we have: x = y g(x) = g(y) Proof Property x = y g(x) = g(y) follows directly from the fact that g is a function. We only need to prove g(x) = g(y) x = y, assuming g(y) = m1 . Case: g(x) = m1 .
34
(i 1) m1 1
i m1 1
(i + 1) m1 1
Figure 2: Submatrix of the text, inspected when a match occurs in row i m1 1 g(x) = g(y) { g(x) = m1 , g(y) = m1 } false { predicate calculus } x=y Case: g(x) = m1 . g(x) = g(y) { g(x) = m1 , g(y) = m1 , spec. g } l : 0 l < m1 : x = P [l] y = P [l] { distributivity, transitivity = } x=y
We can precompute the g-values for the rows of the pattern. We introduce auxiliary array r[0 .. m1 ) for this purpose, with (for 0 h < m1 ): r[h] = g(P [h]). We get: r[h] = l : 0 l < m1 P [h] = P [l] : l (4)
Now, we note that the computation in the for k loop attempts matches in the following submatrix of the text: T [(i 1) m1 .. (i + 1) m1 1)[j .. j + m2 ) (a band surrounding row T [i m1 1]; see gure 2). This submatrix has 2 m1 1 rows. We introduce array f [0 .. 2 m1 1) to store the results of the row comparisons that have already been made. We will maintain the following invariant: h : 0 h < 2 m1 1 : f [h] = f [h] = g(T [(i 1) m1 + h][j .. j + m2 )) If f [h] =, that means that the corresponding value of g has not been computed yet. Initially f [h] =, for all h (0 h < 2 m1 1). Now we have, for 0 h < 2 m1 1 and 0 l < m1 , assuming f [h] =: 35
T [(i 1) m1 + h][j .. j + m2 ) = P [l] { theorem 5.0 (page 34) } g(T [(i 1) m1 + h][j .. j + m2 )) = g(P [l]) { f [h] =, invariant, spec. r } f [h] = r[l] The algorithm now becomes: for h : 0 h < 2 m1 1 f [h] := rof ; { T [i m1 1][j .. j + m2 ) = p } f [m1 1] := g(p); for k : 0 k < m1 (i + 1) m1 1 k n1 { inv. h : 0 h < 2 m1 1 : f [h] = f [h] = g(T [(i 1) m1 + h][j .. j + m2 )) } if r[k] = g(p) { P [k] = p } h, res := 0, true; do h = m1 res { inv.: res T [i m1 1 k .. i m1 1 k + h)[j .. j + m2 ) = P [0 .. h) } if f [m1 1 k + h] = f [m1 1 k + h] := g(T [i m1 1 k + h][j .. j + m2 )) [] f [m1 1 k + h] = skip f i; { f [m1 1 k + h] = } res := f [m1 1 k + h] = r[h]; h := h + 1 od; if res O := O {(i m1 1 k, j)} [] res skip fi [] r[k] = g(p) { P [k] = p } skip fi rof This is essentially the so-called Checkmatch function in Baeza-Yates and Rgniers algorithm, as e presented in [BYR93]. The dierence is that we have not specied how to compute the g-function yet. We will get to that later, in section 5.4. Note that string variable p by itself is now no longer relevant; we are only interested in its index, g(p). We should note that we added f [m1 1] := g(p) to the intitialisation of f . We use the value of g(p) in several other places in the algorithm (and f [m1 1] is likely to be inspected), so we may as well initialise that value of f immediately.
5.2
Inspecting fewer pattern rows
In the version of the repetition presented in section 5.1, we examine for every value of k in the range [0 ((i + 1) m1 1 n1 ), m1 , whether row k of the pattern matches with string p. If so, we attempt to match the surrounding rows of the text with the corresponding rows of the pattern and conclude whether or not there is a match of our pattern in the text. However, after such an attempted match, we may be able to use the last inspected value of f to conclude that
36
we can safely skip a number of values of k. This can speed up the computation and prevent some unnecessary inspections of the text. In the previous section we have left the order in which the values of k are examined unspecied by writing a for-statement. An easy way to incorporate the idea of skipping certain subranges is by considering the values of k in increasing order and, when possible, increasing k by more than 1. If we replace the for-statement by a do-statement we get the following algorithm: for h : 0 h < 2 m1 1 f [h] := rof ; f [m1 1] := g(p); k := 0 ((i + 1) m1 1 n1 ); do k < m1 if r[k] = g(p) h, res := 0, true; do h = m1 res if f [m1 1 k + h] = f [m1 1 k + h] := g(T [i m1 1 k + h][j .. j + m2 )) [] f [m1 1 k + h] = skip f i; res := f [m1 1 k + h] = r[h]; h := h + 1 od; if res O := O {(i m1 1 k, j)}; [] res skip fi { f [m1 2 k + h] = } [] r[k] = g(p) skip f i; k := k + 1 od We added an extra assertion after the inner repition: f [m1 2 k + h] =. Its correctness follows from the program structure and 0 < m1 ; we assume we do not have an empty pattern matrix. If f [m1 2 k + h] is not equal to , we know its value must be either in the interval [0, m1 or equal to m1 . We will investigate both of these cases. 0 f [m1 2 k + h] < m1 { spec. f } f [m1 2 k + h] = g(T [i m1 2 k + h][j .. j + m2 )) 0 f [m1 2 k + h] < m1 { spec. g } f [m1 2 k + h] = l : 0 l < m1 T [i m1 2 k + h][j .. j + m2 ) = P [l] : l 0 f [m1 2 k + h] < m1 { property } l : 0 l < f [m1 2 k + h] : T [i m1 2 k + h][j .. j + m2 ) = P [l] { equality of matrices } l : 0 l < f [m1 2 k + h] : T [i m1 2 k + h l .. (i + 1) m1 2 k + h l)[j .. j + m2 ) = P { dummy transformation } 37
l : i m1 1 k + h f [m1 2 k + h] l < i m1 1 k + h : T [l .. l + m1 )[j .. j + m2 ) = P So in this case, we can disregard the next (f [m1 2 k + h] h + 1) 1 values of k. f [m1 2 k + h] = m1 { invariant } g(T [i m1 2 k + h][j .. j + m2 )) = m1 { spec. g } T [i m1 2 k + h][j .. j + m2 ) P R / { P R contains all rows of P } l : (i 1) m1 1 k + h l < i m1 1 k + h : T [l .. l + m1 )[j .. j + m2 ) = P So if we nd an f -value of m1 , we know we can skip the next m1 h + 1 values of k. (We know that this is a positive number, since h m1 .) We can rewrite this expression as follows: m1 h + 1 = = { h m1 } (m1 h + 1) 1 { f [m1 2 k + h] = m1 } (f [m1 2 k + h] h + 1) 1 This is the same expression we achieved in the case where f [m1 2k+h] was in the interval [0, m1 . The algorithm now becomes: for h : 0 h < 2 m1 1 f [h] := rof ; f [m1 1] := g(p); k := 0 ((i + 1) m1 1 n1 ); do k < m1 if r[k] = g(p) h, res := 0, true; do h = m1 res if f [m1 1 k + h] = f [m1 1 k + h] := g(T [i m1 1 k + h][j .. j + m2 )) [] f [m1 1 k + h] = skip f i; res := f [m1 1 k + h] = r[h]; h := h + 1 od; if res O := O {(i m1 1 k, j)}; [] res skip f i; k := k + (f [m1 2 k + h] h + 1) 1 [] r[k] = g(p) k := k + 1 fi od 38
5.3
Inspecting only matching pattern rows
So far, we have iterated over values of k in the range [0, m1 , even though we are only interested in those values of k for which P [k] is equal to a given string. With a little extra administration, we can make sure we only inspect such values of k. For this purpose, we introduce array e[0 .. m1 ), with the following specication (for 0 h < m1 ): e[h] = l : h < l < m1 P [h] = P [l] : l e[h] = m1 A shorter, equivalent specication is the following: e[h] = l : h < l < m1 P [h] = P [l] : l m1 Note that we have, for 0 h < m1 : h < e[h]. Using our earlier denition of function g ((3) on page 34), we can conclude that all occurrences of a string x as a row in our pattern are g(x), e[g(x)], e[e[g(x)]], . . . , until we encounter m1 . To prove this claim formally, we denote this set of row numbers by D(g(x)), where D is dened as follows: D(m1 ) = ? D(h) = {h} D(e[h]) (6) if l : h < l < m1 : P [h] = P [l] if l : h < l < m1 : P [h] = P [l]
(5)
for 0 h < m1
We need to prove: D(g(x)) = set l : 0 l < m1 x = P [l] : l . First we prove the following lemma. Lemma 5.1 For 0 h m1 , we have: D(h) = set l : h l < m1 P [h] = P [l] : l Technically, since h = m1 is included in this lemma, this expression is dened only if P [m1 ] exists. For this purpose we can assume that there is an imaginary row P [m1 ]; its value is irrelevant. Proof We prove this theorem by induction, with decreasing values of h. Case h = m1 : set l : m1 l < m1 P [h] = P [l] : l = =
{ empty domain } { def. D }
D(m1 ) Case 0 h < m1 : Induction hypothesis: D(e[h]) = set l : e[h] l < m1 P [e[h]] = P [l] : l . Recall that h < e[h], for 0 h < m1 , and that we assumed that an imaginary row P [m1 ] exists. 39
set l : h l < m1 P [h] = P [l] : l = = = = = { h < m1 ; split o: l = h } {h} set l : h < l < m1 P [h] = P [l] : l { from spec. e: l : h < l < e[h] : P [h] = P [l] ; h < e[h] } {h} set l : e[h] l < m1 P [h] = P [l] : l { case 0 e[h] < m1 : P [h] = P [e[h]]; case e[h] = m1 : empty domain } {h} set l : e[h] l < m1 P [e[h]] = P [l] : l { induction hypothesis } {h} D(e[h]) { def. D } D(h)
Theorem 5.2 For any string x, we have: D(g(x)) = set l : 0 l < m1 x = P [l] : l Proof Case x P R: set l : 0 l < m1 x = P [l] : l = = = { from spec. g: l : 0 l < g(x) : x = P [l] ; 0 g(x) } set l : g(x) l < m1 x = P [l] : l { spec. g, x P R: x = P [g(x)] } set l : g(x) l < m1 P [g(x)] = P [l] : l { lemma 5.1 } D(g(x)) Case x P R: / set l : 0 l < m1 x = P [l] : l = = =
{ x PR } / { def. D }
D(m1 ) { x P R, thus g(x) = m1 } / D(g(x))
So in conclusion, the only values of k we need to inspect are the ones in the set D(g(p)); that is: g(p), e[g(p)], e[e[g(p)]], . . . , until we nd a value that is equal to m1 . This leads to the following version of the inner repition of our Baeza-Yates and Rgnier algorithm: e 40
for h : 0 h < 2 m1 1 f [h] := rof ; f [m1 1] := g(p); k := g(p); do k = m1 if (i + 1) m1 1 k n1 h, res := 0, true; do h = m1 res if f [m1 1 k + h] = f [m1 1 k + h] := g(T [i m1 1 k + h][j .. j + m2 )) [] f [m1 1 k + h] = skip f i; res := f [m1 1 k + h] = r[h]; h := h + 1 od; if res O := O {(i m1 1 k, j)}; [] res skip fi [] n1 < (i + 1) m1 1 k skip f i; k := e[k] od The improvement we introduced in section 5.2 can be applied here as well. This allows us to adjust the algorithm as follows: for h : 0 h < 2 m1 1 f [h] := rof ; f [m1 1] := g(p); k := g(p); do k = m1 if (i + 1) m1 1 k n1 h, res := 0, true; do h = m1 res if f [m1 1 k + h] = f [m1 1 k + h] := g(T [i m1 1 k + h][j .. j + m2 )) [] f [m1 1 k + h] = skip f i; res := f [m1 1 k + h] = r[h]; h := h + 1 od; if res O := O {(i m1 1 k, j)}; [] res skip f i; k := k; k := e[k]; do k < (k + f [m1 2 k + h] h + 1) m1 k := e[k] od [] n1 < (i + 1) m1 1 k k := e[k] fi 41
od Computation of e So far we have ignored the question how to initialise the values of array e. We can of course easily come up with an implementation that, for each pattern row, searches for that rows next occurence in the pattern using a bounded linear search. Using array r, this would give rise to an O(m1 2 ) algorithm. However, we can give an O(m1 ) algorithm. Our postcondition is the specication of e, as given in (5) and, equivalently, (6) on page 39. We introduce integer variable i, along with the following invariants: P0 : P1 : 0 i m1 h : 0 h < m1 : e[h] = l : h < l < i P [h] = P [l] : l m1
Then the postcondition is established when i = m1 . Initially, we can choose i = 0 and e[h] = m1 , for 0 h < m1 . Ad P 1(i := i + 1): For h: i h < m1 : l : h < l < i + 1 P [h] = P [l] : l m1 = = = { i h: empty domain } m1 { i h: empty domain } l : h < l < i P [h] = P [l] : l m1 { P1 } e[h] For h: 0 h < i: l : h < l < i + 1 P [h] = P [l] : l m1 = { h < i; split o l = i } l : h < l < i P [h] = P [l] : l m1 if P [h] = P [i] l : h < l < i P [h] = P [l] : l m1 i if P [h] = P [i] = { P1 } e[h] e[h] i = e[h] i = e[h] i if P [h] = P [i] if P [h] = P [i]
{ def. , predicate calculus } if P [h] = P [i] e[h] < i if P [h] = P [i] i e[h] if P [h] = P [i] e[h] = m1 if P [h] = P [i] e[h] = m1
{ P 1: e[h] = m1 e[h] < i; i m1 }
So the only values of h for which e[h] needs to be updated are the ones for which the following holds: P [h] = P [i] e[h] = m1 .
42
P [h] = P [i] e[h] = m1 { P1 } P [h] = P [i] l : h < l < i P [h] = P [l] : l m1 = m1 { transitivity = } P [h] = P [i] l : h < l < i P [i] = P [l] : l m1 = m1 { i m1 , denition } P [h] = P [i] l : h < l < i : P [i] = P [l] { def. , 0 h < i } h = l : 0 l < i P [i] = P [l] : l { g(P [i]) = m1 , theorem 5.0 (see page 34) } h = l : 0 l < i g(P [i]) = g(P [l]) : l { spec. r } h = l : 0 l < i r[i] = r[l] : l We conclude that there is at most one h for which e[h] needs to be updated, specically: l : 0 l < i r[i] = r[l] : l , if this value is greater than . To nd this value, we introduce auxiliary array aux[0 .. m1 ), with the following invariant. P 2 : j : 0 j < m1 : aux[j] = l : 0 l < i j = r[l] : l Initially, we have i = 0 and therefore aux[j] = , for 0 j < m1 . Ad P 2(i := i + 1): l : 0 l < i + 1 j = r[l] : l = { split o: l = i } l : 0 l < i j = r[l] : l if j = r[i] l : 0 l < i j = r[l] : l i if j = r[i] = { P 2, math } aux[j] i if j = r[i] if j = r[i]
In conclusion, this gives rise to the following initialisation algorithm for e: for i : 0 i < m1 e[i], aux[i] := m1 , rof ; i := 0; do i = m1 if aux[r[i]] = e[aux[r[i]]] := i [] aux[r[i]] = skip f i; aux[r[i]] := i; i := i + 1 od 43
In an implementation we can represent by any negative number (or even any number outside of the range [0, m1 ), for example: 1.
5.4
Computation of unique row indices
As we have seen in section 5.1, the value of string variable p is no longer of interest to us. Instead, we only use g(p) (dened in (3) on page 34). For this reason, we want to introduce a new specication for one-dimensional multipattern matching, which returns the value of g(p) immediately. M P M1 (P S, t) = set l, p, r : p P S t = lpr : (|l|, g(p)) The question remains, how to compute g-values. The method presented in [BYR93] is by building the Aho-Corasick automaton based on set P R, with an output function that returns g-values. More formally, our Aho-Corasick automaton of P R is a Moore machine (Q, , , , , q0 ). As in section 4.3, we will not go into every detail on how an Aho-Corasick automaton can be constructed; instead we refer to [WZ92], [WZ93] or [NR02]. Our output alphabet is equal to [0, m1 ]. For each state q, we want to establish: (q) = g(wq ), where wq is the string consisting of the labels of the shortest path from start state q0 to q. For non-accepting states q we can simply dene (q) = m1 ; for nal states q, (q) needs to be equal to the index of the recognised row of P . If we construct our Aho-Corasick automaton like a trie, by adding the rows of P R one by one, in increasing order, we can establish this quite easily. Whenever we add a new nal state q while processing row l of our pattern matrix, we set (q) = l. Now we have: g(p) = ( (q0 , p)). We can also use this Aho-Corasick automaton for our implementation of M P M1 . During the construction of the automaton we can also ll array r (as specied in (4) on page 35). After processing a row l of the pattern, the value of g(P [l]) is known and, by specication, this is the required value of r[l].
5.5
Generalisations
The Baeza-Yates and Rgnier algorithm can be generalised to be used for two-dimensional multie pattern matching. Say we have l m1 m2 two-dimensional patterns: P0 , P1 , , Pl1 . Then our postcondition becomes: O = set i1 , i2 , j : 0 i1 n1 m1 0 i2 n2 m2 0 j < l T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = Pj : (i1 , i2 , j)
We introduce a new auxiliary pattern matrix Q[0 .. l m1 )[0 .. m2 ), where (for 0 i < l m1 ): Q[i] = Pi
div m1 [i
mod m1 ]
Similarly to P R in the single-pattern searching algorithm, we introduce the set QR: QR = set i : 0 i < l m1 : Q[i] 44
m 0
2 m1 1
3 m1 1 n1
Figure 3: The Baeza-Yates algorithm idea, applied in three dimensions Or, equivalently: QR = set i, j : 0 i < m1 0 j < l : Pj [i] Now we can apply one-dimensional multipattern matching to search for the strings in QR. Once a match with one of these strings q QR has been found in one of the texts rows, we need to check whether there is a match on that position in the text, with any of the pattern matrices in which q occurs as a row. That is, for each row Q[i] that is equal to q, we check if there is a match with pattern matrix Pi div m1 . This boils down to only a slight modication of the Baeza-Yates and Rgnier algorithm (or any e of the variations discussed). We will however need a slightly dierent denition for our g function and auxialiary array r (and e, for the two algorithms presented in section 5.3). For any string x and 0 i < l m1 : g(x) = j : 0 j < l m1 x = Q[j] : j g(x) = l m1 r[i] = g(Q[i]) e[i] = j : i < j < l m1 Q[i] = Q[j] : l e[i] = l m1 if j : i < j < l m1 : Q[i] = Q[j] if j : i < j < l m1 : Q[i] = Q[j] if x QR if x QR /
The values of g, r and e can still be computed in the ways presented in sections 5.3 and 5.4. Matching in multiple dimensions is also possible using the Baeza-Yates and Rgnier approach. For e matching in n + 1 dimensions (1 n) we use an n-dimensional multipattern matching algorithm. Here, too, we need to choose a dierent denition of g, namely a function that works on ndimensional shapes. Also, if we are matching in more than two dimensions, we cannot simply use an Aho-Corasick automaton for multipattern matching and for computing our g-function. For example, if we are matching in three dimensions, we regard the m1 m2 m3 pattern as a sequence of m1 two-dimensional matrices and the n1 n2 n3 text as a sequence of n1 two45
dimensional matrices. We then use two-dimensional multipattern matching to match every mth 1 matrix of the text against the matrices of the pattern. (See also gure 3.) For every match found we check if there are any actual matches in three dimensions on that position.
46
47
6
6.0
Polcar
Introduction
In this section, we will present the algorithm that was presented by Tom Polcar in [Pol04]; a as shorter version of the same article can be found in [MP04]. The algorithm uses tessellation automata, a data structure introduced in [IN77]. In this presentation however, we will attempt to derive the algorithm by manipulating sets of submatrices of the text.
6.1
Derivation
Figure 4: A pattern occurrence as a sux of a prex of the text We start by rewriting our postcondition, so that we view an occurrence of the pattern as a sux of a prex of the text (see gure 4). set i1 , i2 : 0 i1 n1 m1 0 i2 n2 m2 T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P : (i1 , i2 ) = { dummy transformation } set i1 , i2 : m1 i1 n1 m2 i2 n2 T [i1 m1 .. i1 )[i2 m2 .. i2 ) = P : (i1 m1 , i2 m2 ) = { 0 i1 m1 , 0 i2 m2 , def. su } set i1 , i2 : m1 i1 n1 m2 i2 n2 P su(T [0 .. i1 )[0 .. i2 )) : (i1 m1 , i2 m2 ) = { P pref(P ) } set i1 , i2 : m1 i1 n1 m2 i2 n2 P su(T [0 .. i1 )[0 .. i2 )) pref(P ) : (i1 m1 , i2 m2 ) = { theorem C.0: ESm1 ,m2 su(T [0 .. i1 )[0 .. i2 )) pref(P ) } set i1 , i2 : m1 i1 n1 m2 i2 n2 P (su(T [0 .. i1 )[0 .. i2 )) pref(P )) ESm1 ,m2 : (i1 m1 , i2 m2 ) = { introduction of function f } set i1 , i2 : m1 i1 n1 m2 i2 n2 P f (i1 , i2 ) : (i1 m1 , i2 m2 )
48
We introduce function f : [0, n1 ] [0, n2 ] P(M2 ()), with the following specication: f (i1 , i2 ) = (su(T [0 .. i1 )[0 .. i2 )) pref(P )) ESm1 ,m2 We have introduced this function because we can dene f recursively and then (hopefully) construct an algorithm to compute f . Before we get to that, we introduce the following lemmas. Lemma 6.0 concerns how the set of suxes of a prex of T can be split into a set of empty matrices and a set of nonempty matrices. Lemma 6.1 shows how we can conclude whether a submatrix of the text is a prex of the pattern. Lemma 6.0 For 0 i1 < n1 and 0 i2 < n2 : su(T [0 .. i1 + 1)[0 .. i2 + 1)) = set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) ESi1 +1,i2 +1 Proof su(T [0 .. i1 + 1)[0 .. i2 + 1)) = = { def. su } set j1 , j2 : 0 j1 i1 + 1 0 j2 i2 + 1 : T [j1 .. i1 + 1)[j2 .. i2 + 1) { split o j1 = i1 + 1 and j2 = i2 + 1 } set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) set j2 : 0 j2 i2 + 1 : E0,i2 +1j2 set j1 : 0 j1 i1 + 1 : Ei1 +1j1 ,0 = { dummy transformation: j1 := i1 + 1 j } set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) set j : 0 j i2 + 1 : E0,j set j : 0 j i1 + 1 : Ej,0 = { def. ES } set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) ESi1 +1,i2 +1
Lemma 6.1 For i1 , j1 , i2 , j2 N, with j1 i1 < n1 and j2 i2 < n2 : if i1 + 1 j1 m1 i2 + 1 j2 m2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P ) T [j1 .. i1 )[j2 .. i2 + 1) pref(P ) T [j1 .. i1 + 1)[j2 .. i2 ) pref(P ) T [i1 , i2 ] = P [i1 j1 , i2 j2 ] if m1 < i1 + 1 j1 m2 < i2 + 1 j2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P ) / Proof Case: i1 + 1 j1 m1 i2 + 1 j2 m2 (our submatrix is not larger than the pattern) 49
T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P ) { def. pref } T [j1 .. i1 + 1)[j2 .. i2 + 1) set l1 , l2 : 0 l1 m1 0 l2 m2 : P [0 .. l1 )[0 .. l2 ) { i1 j1 < m1 , i2 j2 < m2 } T [j1 .. i1 + 1)[j2 .. i2 + 1) = P [0 .. i1 + 1 j1 )[0 .. i2 + 1 j2 ) { j1 i1 , j2 i2 , def. equality of matrices } T [j1 .. i1 )[j2 .. i2 + 1) = P [0 .. i1 j1 )[0 .. i2 + 1 j2 ) T [j1 .. i1 + 1)[j2 .. i2 ) = P [0 .. i1 + 1 j1 )[0 .. i2 j2 ) T [i1 , i2 ] = P [i1 j1 , i2 j2 ] { i1 j1 < m1 , i2 j2 < m2 , def. pref } T [j1 .. i1 )[j2 .. i2 + 1) pref(P ) T [j1 .. i1 + 1)[j2 .. i2 ) pref(P ) T [i1 , i2 ] = P [i1 j1 , i2 j2 ] Case: m1 < i1 + 1 j1 m2 < i2 + 1 j2 (our submatrix is larger than the pattern) T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P ) { def. pref } T [j1 .. i1 + 1)[j2 .. i2 + 1) set l1 , l2 : 0 l1 m1 0 l2 m2 : P [0 .. l1 )[0 .. l2 ) { m1 i1 j1 m2 i2 j2 } false
Now we can get back to nding a recursive denition for f . For 0 i2 n2 we derive: f (0, i2 ) = = = = = = { spec. f } (su(T [0 .. 0)[0 .. i2 )) pref(P )) ESm1 ,m2 { def. submatrix } (su(E0,i2 ) pref(P )) ESm1 ,m2 { def. su, one point rule, def. submatrix } ( set j : 0 j i2 : E0,j pref(P )) ESm1 ,m2 { def. ES } (ES0,i2 pref(P )) ESm1 ,m2 { theorem C.1 } ES0,i2 m2 ESm1 ,m2 { 0 m1 , i2 m2 m2 } ESm1 ,m2 Symmetrically, we know that f (i1 , 0) = ESm1 ,m2 (for 0 i1 n1 ). We will now examine expression f (i1 + 1, i2 + 1), for 0 i1 < n1 and 0 i2 < n2 . f (i1 + 1, i2 + 1) = { spec. f } 50
(su(T [0 .. i1 + 1)[0 .. i2 + 1)) pref(P )) ESm1 ,m2 = { lemma 6.0 } (( set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) ESi1 +1,i2 +1 ) pref(P )) ESm1 ,m2 = { over ; theorem C.1 } ( set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P )) ES(i1 +1)m1 ,(i2 +1)m2 ESm1 ,m2 = = { def. ES } ( set j1 , j2 : 0 j1 i1 0 j2 i2 : T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P )) ESm1 ,m2 { property } set j1 , j2 : 0 j1 i1 0 j2 i2 T [j1 .. i1 + 1)[j2 .. i2 + 1) pref(P ) : T [j1 .. i1 + 1)[j2 .. i2 + 1) ESm1 ,m2 = { lemma 6.1 } set j1 , j2 : 0 j1 i1 0 j2 i2 i1 j1 < m1 i2 j2 < m2 T [j1 .. i1 )[j2 .. i2 + 1) pref(P ) T [j1 .. i1 + 1)[j2 .. i2 ) pref(P ) T [i1 , i2 ] = P [i1 j1 , i2 j2 ] : T [j1 .. i1 + 1)[j2 .. i2 + 1) ESm1 ,m2 = { prop. matrices } set j1 , j2 : 0 j1 i1 0 j2 i2 i1 j1 < m1 i2 j2 < m2 T [j1 .. i1 )[j2 .. i2 ) T [j1 .. i1 )[i2 ] pref(P ) T [j1 .. i1 )[j2 .. i2 ) pref(P ) T [i1 ][j2 .. i2 ) T [j1 .. i1 )[j2 .. i2 ) T [j1 .. i1 )[i2 ] ESm1 ,m2 T [i1 , i2 ] = P [i1 j1 , i2 j2 ] : T [i1 ][j2 .. i2 ) T [i1 , i2 ] = { one-point rule } set j1 , j2 , A, b, c : 0 j1 i1 0 j2 i2 i1 j1 < m1 i2 j2 < m2 A = T [j1 .. i1 )[j2 .. i2 ) b = T [j1 .. i1 )[i2 ] c = T [i1 ][j2 .. i2 ) A A b pref(P ) pref(P ) T [i1 , i2 ] = P [i1 j1 , i2 j2 ] : c A b ESm1 ,m2 c T [i1 , i2 ] = { prop. matrices } set j1 , j2 , A, b, c : 0 j1 i1 0 j2 i2 i1 j1 < m1 i2 j2 < m2 A b = T [j1 .. i1 )[j2 .. i2 + 1) row(b) = row(A) col(b) = 1 A = T [j1 .. i1 + 1)[j2 .. i2 ) row(c) = 1 col(c) = col(A) c A A b pref(P ) pref(P ) T [i1 , i2 ] = P [i1 j1 , i2 j2 ] : c A b ESm1 ,m2 c T [i1 , i2 ] = { one-point rule: j1 = i1 row(A), j2 = i2 col(A); def. su } set A, b, c : row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A) A A b su(T [0 .. i1 )[0 .. i2 + 1)) su(T [0 .. i1 + 1)[0 .. i2 )) c A A b pref(P ) pref(P ) T [i1 , i2 ] = P [row(A), col(A)] : c
51
A c =
b T [i1 , i2 ]
ESm1 ,m2
{ set calculus } set A, b, c : row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A) A b su(T [0 .. i1 )[0 .. i2 + 1)) pref(P ) A su(T [0 .. i1 + 1)[0 .. i2 )) pref(P ) T [i1 , i2 ] = P [row(A), col(A)] : c A b ESm1 ,m2 c T [i1 , i2 ]
{ (?) } set A, b, c : row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A) A b (su(T [0 .. i1 )[0 .. i2 + 1)) pref(P )) ESm1 ,m2 A (su(T [0 .. i1 + 1)[0 .. i2 )) pref(P )) ESm1 ,m2 T [i1 , i2 ] = P [row(A), col(A)] : c A b ESm1 ,m2 c T [i1 , i2 ]
{ def. f } set A, b, c : row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A) A A b f (i1 , i2 + 1) f (i1 + 1, i2 ) T [i1 , i2 ] = P [row(A), col(A)] : c A b ESm1 ,m2 c T [i1 , i2 ]
{ introduction of } (f (i1 , i2 + 1), f (i1 + 1, i2 ), T [i1 , i2 ])
Ad (?): we still need to justify that we can introduce ESm1 ,m2 in this step of the derivation. We assume we have A, b and c, satisfying: row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A) A b (su(T [0 .. i1 )[0 .. i2 + 1)) pref(P )) ESm1 ,m2 A (su(T [0 .. i1 + 1)[0 .. i2 )) pref(P )) ESm1 ,m2 c We derive: A A A b b b ESm1 ,m2 set j : 0 j m1 : Ej,0 set j : 0 j m2 : E0,j b )} set j : 0 j m2 : E0,j b ) m2
{ denition ES } { col(b) = 1, therefore 0 < col( A { set calculus, denition E } row( A b ) = 0 0 < col( A { def. row, col; col(b) = 1 } 52
row(A) = 0 0 col(A) < m2 { def. row, col; row(c) = 1 } row( { A c A c row( } row( A c ) = 1 0 col( A c ) < i2 + 1 m2 ) = 1 0 col( A c ) < m2
(su(T [0 .. i1 + 1)[0 .. i2 )) pref(P )) ESm1 ,m2 ; A c )=1 A c ESm1 ,m2 col( A c )=0
{ def. row, col; row(c) = 1 } row(A) = 0 0 col(A) < i2 + 1 m2 { def. row, col; col(b) = 1 } row( A A A b b b ) = 0 0 < col( A ES0,i2 +1m2 su(T [0 .. i1 )[0 .. i2 + 1)) pref(P ) A c b ) i2 + 1 m2 { denition ES } { theorem C.0 }
Symmetrically,
A ESm1 ,m2 c marked (?) is therefore correct.
su(T [0 .. i1 + 1)[0 .. i2 )) pref(P ). The step
We introduce function : P(pref(P )) P(pref(P )) P(pref(P )), with the following specication: (Q, R, ) = set A, b, c : SIZE (A, b, c) = P [row(A), col(A)] : A c b Q A c R
ESm1 ,m2
Here we have used the auxiliary predicate SIZE : M2 () M2 () M2 () B, which is dened as follows: SIZE (A, b, c) row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A)
The expression SIZE (A, b, c) means that matrices A, b and c have such sizes that, for any additional A b symbol , the matrix is well-dened and not larger, in either dimension, than the c pattern P .
53
6.2
Algorithm structure
In summary of what we have seen so far, we have specied function f and derived the correctness of the following way to compute f : f (0, i2 ) = f (i1 , 0) = f (i1 + 1, i2 + 1) = ESm1 ,m2 ESm1 ,m2 (f (i1 , i2 + 1), f (i1 + 1, i2 ), T [i1 , i2 ]) (0 i2 n2 ) (0 i1 n1 ) (0 i1 < n1 , 0 i2 < n2 )
We are interested in those values of i1 and i2 , with m1 i1 n1 and m2 i2 n2 , where the following holds: P f (i1 , i2 ). A solution to the two-dimensional pattern matching problem is to compute all values of f (i1 , i2 ) (for 0 i1 n1 , 0 i2 n2 ) and then do a membership test on the relevant values. This gives rise to the following algorithm: O := ?; for i1 : 0 i1 n1 f (i1 , 0) := ESm1 ,m2 rof ; for i2 : 0 i2 n2 f (0, i2 ) := ESm1 ,m2 rof ; i1 := 0; do i1 = n1 i2 := 0; do i2 = n2 f (i1 + 1, i2 + 1) := (f (i1 , i2 + 1), f (i1 + 1, i2 ), T [i1 , i2 ]); i2 := i2 + 1 od; i1 := i1 + 1 od; for i1 , i2 : m1 i1 n1 m2 i2 n2 if P f (i1 , i2 ) O := O {(i1 m1 , i2 m2 )} [] P f (i1 , i2 ) skip / fi rof However, if we assume that the pattern is not an empty matrix2 , we can report the matches while computing f . To show this we will further rewrite the postcondition. First we introduce a lemma, stating that there can be no occurrence of P in a prex of T that is too small in either dimension. Lemma 6.2 For 0 i1 < m1 and 0 i2 < m2 : P f (i1 , i2 ) / Proof P f (i1 , i2 ) { denition f } P (su(T [0 .. i1 )[0 .. i2 )) pref(P )) ESm1 ,m2 { P ES } /
2 If P is an empty matrix, the two-dimensional pattern matching problem is fairly trivial in itself, so (as in previous algorithms) we can safely assume that P ES; we will not consider the case of an empty pattern in the / remainder of this section.
54
P su(T [0 .. i1 )[0 .. i2 )) pref(P ) { P pref(P ) } P su(T [0 .. i1 )[0 .. i2 )) { i1 < m1 i2 < m2 } false
Now we can derive: set i1 , i2 : m1 i1 n1 m2 i2 n2 P f (i1 , i2 ) : (i1 m1 , i2 m2 ) = { P ES, therefore: 1 m1 and 1 m2 ; lemma 6.2 } / set i1 , i2 : 1 i1 n1 1 i2 n2 P f (i1 , i2 ) : (i1 m1 , i2 m2 ) This gives rise to the following algorithm: O := ?; for i1 : 0 i1 n1 f (i1 , 0) := ESm1 ,m2 rof ; for i2 : 0 i2 n2 f (0, i2 ) := ESm1 ,m2 rof ; i1 := 0; do i1 = n1 i2 := 0; do i2 = n2 f (i1 + 1, i2 + 1) := (f (i1 , i2 + 1), f (i1 + 1, i2 ), T [i1 , i2 ]); if P f (i1 + 1, i2 + 1) O := O {(i1 + 1 m1 , i2 + 1 m2 )} [] P f (i1 + 1, i2 + 1) skip / f i; i2 := i2 + 1 od; i1 := i1 + 1 od Here we can introduce a space improvement, similarly to what we have done for the lter-based algorithms in section 4.2. Instead of storing all (n1 +1)(n2 +1) values of f , an array of length n2 +1 suces. Let us call this array e and introduce the following invariant: j : 0 j i2 : e[j] = f (i1 + 1, j) j : i2 < j n2 : e[j] = f (i1 , j) This gives us the following algorithm: O := ?; for j : 0 j n2 e[j] := ESm1 ,m2 rof ; i1 , i2 := 0, 0; do i1 = n1 do i2 = n2 e[i2 + 1] := (e[i2 + 1], e[i2 ], T [i1 , i2 ]); if P e[i2 + 1] O := O {(i1 + 1 m1 , i2 + 1 m2 )} [] P e[i2 + 1] skip / 55
f i; i2 := i2 + 1 od; i2 := 0; i1 := i1 + 1 od
6.3
6.3.0
Precomputation
Representing sets of matrices by lists of maximal elements
The algorithm given in section 6.2 is correct, if we can nd a way to compute . However, computation of (Q, R, ) can be rather inecient, so repeating this for each element of the, potentially very large, text is undesirable. We would prefer precomputing the values of . We do not need to precompute (Q, R, ) for all Q, R pref(P ), since not all such sets Q and R can actually occur. We only need to examine reachable sets of prexes of P . Denition 6.3 (Reachability) The set of all reachable sets of prexes of P is the smallest set V satisfying: ESm1 ,m2 V ; if Q, R V and then (Q, R, ) V . We will now prove two properties of reachable sets of prexes of P . Theorem 6.4 Let S be a reachable set of prexes of P . Then we have: A, B : A S B su(A) pref(P ) : B S Proof We prove this by structural induction. S = ESm1 ,m2 . A ESm1 ,m2 B su(A) pref(P ) { def. ES, def. su } ((row(A) = 0 col(A) m2 ) (row(A) m1 col(A) = 0)) row(B) row(A) col(B) col(A) { pred. calc, math } (row(B) = 0 col(B) m2 ) (row(B) m1 col(B) = 0) { def. ES } B ESm1 ,m2 S = (Q, R, ), for some reachable Q, R pref(P ) and . Our induction hypothesis is: A, B : A Q B su(A) pref(P ) : B Q A, B : A R B su(A) pref(P ) : B R 56
We must prove: A, B : A (Q, R, ) B su(A) pref(P ) : B (Q, R, ) We assume A (Q, R, ) B su(A) pref(P ) and prove B (Q, R, ). B ESm1 ,m2 . Then we have, by denition of : B (Q, R, ). B ESm1 ,m2 . This means that B is not an empty matrix and can therefore be written / B b0 as . From B su(A) we know that A is not an empty matrix either and b1 A a0 . We derive: can be written as a1 A a1 a0 (Q, R, ) A a1 a0 B b1 b0 su( A a1 a0 ) pref(P )
{ def. , A B b1 a0 b0
ESm1 ,m2 ; property } / A a1 R = B b1 b0 su( A a1 a0 )
pref(P )
{ property pref } A B a0 b0 Q A a1 R = B b1 B b1 b0 su( A a1 a0 )
pref(P )
pref(P ) P [row(B ), col(B )] =
{ property su } A B B a0 b0 b0 Q A a1 R = a0 ) B b1 B b1 su( A a1 )
su( A pref(P )
pref(P ) P [row(B ), col(B )] =
{ property } a0 Q B b0 su( A a0 ) pref(P ) A B A R su( ) pref(P ) P [row(B ), col(B )] = a1 b1 a1 = A
{ induction hypothesis } B b0 Q B b1 R P [row(B ), col(B )] = =
{ def. } B b1 b0 b0 (Q, R, ) =
{ substitution } B b1 (Q, R, )
57
In other words, theorem 6.4 tells us that every reachable set S pref(P ) is a so-called ideal of the partial order (pref(P ), s ) (where A s B A su(B)). Theorem 6.5 Let S be a reachable set of prexes of P . Then we have: A, B : A, B S A pref(B) : A su(B) Proof Again we use structural induction. S = ESm1 ,m2 . In this case A is an empty matrix. The denitions of pref and su then give us A pref(B) A su(B). S = (Q, R, ), for some reachable Q, R pref(P ) and . Our induction hypothesis is: A, B : A, B Q A pref(B) : A su(B) A, B : A, B R A pref(B) : A su(B) Then we should still prove: A, B : A, B (Q, R, ) A pref(B) : A su(B) We use the same proof structure as in theorem 6.4: we assume we have A, B (Q, R, ), with A pref(B), and try to prove A su(B). Again we distinguish two cases. A ESm1 ,m2 . In this case the theorem holds trivially, again because the denitions of pref and su tell us: A pref(B) A su(B). A ESm1 ,m2 . As in theorem 6.4, we can then write the following: / A= B= A a1 B b1 a0 b0 B b1 b0 B b1 ) b0
We get the following derivation: A a1 A a1 A a1 A a0 a0 a0 a0 (Q, R, ) pref( B b1 (Q, R, )
{ property pref } (Q, R, ) pref( B b0 (Q, R, ) A a1 pref( B b1 )
b0 )
{ def. } A A a0 a0 Q A a1 R b0 ) B A a1 A a1 b0 Q pref( B b1 B b1 ) R
pref( B
{ induction hypothesis } A a0 su( B b0 ) 58 su( B b1 )
{ property su } A a1 a0 su( B b1 b0 )
These properties suggest that a reachable set of prexes of P can be represented by its maximal elements. Denition 6.6 (Maximal element) A maximal element of a reachable set S pref(P ) is a matrix A S for which the following holds: B : B S B = A : A su(B) Note that it is possible (and likely) for a set to have more than one maximal element, since s is not a total order. For A, B in some reachable set of prexes of P , we derive: A su(B) { theorem 6.5 } A pref(B) { A, B pref(P ) } row(A) row(B) col(A) col(B) Since the denition of su also tells us: A su(B) row(A) row(B) col(A) col(B), we can now conclude that for A, B in some reachable set S, we have: A su(B) row(A) row(B) col(A) col(B) So we note that for each pair of dierent maximal elements A, B: (row(A) < row(B) col(B) < col(A)) (row(B) < row(A) col(A) < col(B)) We want to represent a reachable set by a list of its maximal elements, ordered by increasing number of columns and decreasing number of rows. We can then express in terms of these lists of only maximal elements, instead of entire sets of matrices. The advantage of this is that a set of prexes of P can have (m1 + 1) (m2 + 1) elements, whereas such a list has at most m1 m2 elements (which follows from the way it is ordered). Also, we can (hopefully) use knowledge about the list to come up with an ecient way to compute . Besides the improvements in space and time complexity of the precomputation step, there is another good reason to examine this approach, where sets of matrices are represented by a list of their maximal elements. The Polcar algorithm is essentially rather similar to the Aho-Corasick and Knuth-Morris-Pratt algorithms ([AC75, KMP77]). As seen in [WZ92, WZ93], there too we reason in terms of strings of maximal length, as opposed to sets of strings. The main dierence is that in the one-dimensional case, there is one unique string of maximum length, for each nonempty set of prexes of the pattern. But as we have seen, there can be several dierent maximal elements in a set of prexes of our two-dimensional pattern. 59
For an overview of the notation we use for lists and list operations, we refer to appendix D. We introduce abstraction function q : L(M2 ()) P(M2 ()), with the following recursive denition: q([ ]) q(A As) = =
?
(su(A) pref(P )) q(As) (7)
It is easy to see that the following properties hold for function q (proof omitted): q(As A) = q(As + Bs) = + q(As) (su(A) pref(P )) q(As) q(Bs) (8) (9)
Next we introduce the predicate OL, to express that a list is ordered. OL([ ]) OL([A]) OL(A0 A1 As) true true row(A1 ) < row(A0 ) col(A0 ) < col(A1 ) OL(A1 As) (10)
Again, we have several useful properties of OL (and again, we omit their proof): OL(As OL((As A0 A1 ) Bs)) OL(As A0 ) row(A1 ) < row(A0 ) col(A0 ) < col(A1 ) OL(As A) row(B) < row(A) col(A) < col(B) OL(B Bs)
(11) (12)
A) + (B +
So for each reachable set S pref(P ) we want to nd a list As, with q(As) = S and OL(As). Using our assumption that P is not an empty matrix, the list corresponding to ESm1 ,m2 is [Em1 ,0 , E0,m2 ]: q([Em1 ,0 , E0,m2 ]) = = { def. q } (su(Em1 ,0 ) pref(P )) (su(E0,m2 ) pref(P )) { def. su, def. pref, def. ES, row(P ) = m1 col(P ) = m2 } ESm1 ,m2 OL([Em1 ,0 , E0,m2 ]) { def. OL } 0 < m1 0 < m2 { P ESm1 ,m2 } / true Note that if the pattern P were an empty matrix, then the list [P ] would suce.
60
6.3.1
Precomputation algorithm structure
Now what remains is to nd an algorithm to compute a list corresponding to (Q, R, ), given the lists corresponding to Q and R and the symbol . More formally, we initially have two lists Bs, Cs L(pref(P )) with: q(Bs) = Q OL(Bs) q(Cs) = R OL(Cs) We now want to compute a list Ds L(pref(P )), satisfying: q(Ds) = (Q, R, ) OL(Ds) We introduce the following invariants: P0 : P1 : P2 : P3 : q(Ds) (q(Bs), q(Cs), ) = (Q, R, ) OL(Bs) OL(Cs) OL(Ds)
Initially, we can choose Ds = [ ]. Invariants P 1, P 2 and P 3 then all hold trivially; ad P 0: q([ ]) (q(Bs), q(Cs), ) = = { def. q } (q(Bs), q(Cs), ) { q(Bs) = Q, q(Cs) = R } (Q, R, ) We can terminate the computation when Bs = [ ] Cs = [ ]: q(Ds) (q([ ]), q(Cs), ) = = q(Ds) (?, q(Cs), ) { def. } q(Ds) ESm1 ,m2 q(Ds) (q(Bs), q([ ]), ) = = q(Ds) (q(Bs), ?, ) { def. } q(Ds) ESm1 ,m2 61 { def. q } { (7) (def. q) }
So after termination of the algorithm we have: q(Ds) ESm1 ,m2 = (Q, R, ). Then we still need to ensure that Ds also represents ESm1 ,m2 . There are two possible cases: Ds = [ ] and Ds = [ ]. If Ds = [ ] we can obviously conclude: ESm1 ,m2 * q(Ds). We have already seen that in that case (since we assumed that P is not an empty matrix), ESm1 ,m2 can be represented by the list [Em1 ,0 , E0,m2 ], so we can then simply write Ds := [Em1 ,0 , E0,m2 ]. If Ds = [ ], then we can write Ds as D note: ESm1 ,m2 q(Ds) { def. ES, property } ESm1 ,0 q(Ds) ES0,m2 q(Ds) ESm1 ,0 q(D Ds ) Ds , but also as Ds D . We make the following
{ (7) (def. q), property , } ESm1 ,0 (su(D ) pref(P )) q(Ds ) { set calculus } ESm1 ,0 su(D ) ESm1 ,0 pref(P ) { def. pref, row(P ) = m1 } ESm1 ,0 su(D ) { def. ES, def. su } m1 row(D )
If this does not hold, we need to replace Ds by Em1 ,0 Ds: ESm1 ,0 = = = { def. ES , def. su } su(Em1 ,0 ) { row(P ) = m1 } su(Em1 ,0 ) pref(P ) { set calculus } (su(Em1 ,0 ) pref(P )) q(D { def. q } q(Em1 ,0 D OL(Em1 ,0 D Ds ) Ds ) Ds ) Ds )
{ (10) (def. OL) } row(D ) < m1 0 < col(D ) OL(D { row(D ) < m1 , OL(D 0 < col(D ) Ds ) }
We still need to be sure that 0 < col(D ) holds. In the remainder of the derivation of our algorithm we will see that this is not a problem, since we will only add nonempty matrices to Ds. 62
The observation for ES0,m2 q(Ds D ) is completely symmetrical; instead of the denitions of q and OL we can use properties (8) and (11). We conclude that if col(D ) < m2 , we need to replace Ds by Ds E0,m2 . So in conclusion, after termination we need to add the following program fragment: if Ds = [ ] Ds := [Em1 ,0 , E0,m2 ] [] Ds :: D Ds Ds :: Ds D if row(D ) < m1 Ds := Em1 ,0 Ds [] m1 row(D ) skip f i; if col(D ) < m2 Ds := Ds E0,m2 [] m2 col(D ) skip fi fi So we have seen initialisation and termination of our algorithm; what remains is the update of Bs, Cs and Ds. As we have seen, the algorithm terminates when Bs or Cs is empty, so we can assume Bs = B Bs and Cs = C Cs , for some B , C M2 () and Bs , Cs L(pref(P )). We will examine the following cases: 0. row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 P [row(B ), col(C )] = ; 1. row(C ) < row(B ) + 1; 2. row(B ) + 1 < row(C ); 3. row(B ) + 1 = row(C ) col(C ) + 1 < col(B ); 4. row(B ) + 1 = row(C ) col(B ) < col(C ) + 1; 5. row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 P [row(B ), col(C )] = .
6.3.2 Case 0
Case analysis
The rst case is: row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 P [row(B ), col(C )] = . Informally, this is the case where a combination of B , C and actually occurs as a prex of P . First we examine the left-hand side of invariant P 0: q(Ds) (q(B = { def. } q(Ds) set A, b, c : SIZE (A, b, c) A b q(B Bs ) A A b q(C Cs ) = P [row(A), col(A)] : ESm1 ,m2 c c = { (7) (def. q) } q(Ds) set A, b, c : SIZE (A, b, c) A b (su(B ) pref(P )) q(Bs ) A A b (su(C ) pref(P )) q(Cs ) = P [row(A), col(A)] : c c 63 Bs ), q(C Cs ), )
ESm1 ,m2 = { set calculus } q(Ds) set A, b, c : SIZE (A, b, c) A b su(B ) pref(P ) A b A (su(C ) pref(P )) q(Cs ) = P [row(A), col(A)] : c c set A, b, c : SIZE (A, b, c) A b (su(B ) pref(P )) q(Bs ) A A b su(C ) pref(P ) = P [row(A), col(A)] : c c set A, b, c : SIZE (A, b, c) A A b q(Bs ) q(Cs ) = P [row(A), col(A)] : c A b ESm1 ,m2 c = { def. } q(Ds) set A, b, c : SIZE (A, b, c) A b su(B ) pref(P ) A A b (su(C ) pref(P )) q(Cs ) = P [row(A), col(A)] : c c set A, b, c : SIZE (A, b, c) A b (su(B ) pref(P )) q(Bs ) A A b su(C ) pref(P ) = P [row(A), col(A)] : c c (q(Bs ), q(Cs ), )
To further simplify this complex expression, we make the following observation, for matrices A, b A R: and c with SIZE (A, b, c) and c A b su(B ) pref(P ) b ) row(B ) col( A b ) col(B )
{ denition su } row( A { denition row, col } row(A) row(B ) col(A) < col(B ) { row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 } row(A) < row(C ) col(A) col(C ) { def. row, col } row( A c A c ) row(C ) col( , C R, therefore A c A c ) col(C ) , C pref(P ) }
{ A c
pref(C ) A c , C R; R is a reachable set }
{ theorem 6.5; A c su(C ) A c
pref(P ) }
64
A c
su(C ) pref(P )
And symmetrically, with A c
Q:
su(C ) pref(P )
{ def. su, row, col } row(A) < row(C ) col(A) col(C ) { row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 } row(A) row(B ) col(A) < col(B ) { def. row, col } row( A { A A { A A b b A b b ) row(B ) col( A b ) col(B ) b , B Q, therefore A b A b , B pref(P ) } pref(B ) , B Q; Q is a reachable set } su(B ) b pref(P ) } su(B ) pref(P )
{ theorem 6.5;
So we know that in this case the left hand side of P 0 is equal to: q(Ds) set A, b, c : SIZE (A, b, c) A b su(B ) pref(P ) A A b su(C ) pref(P ) = P [row(A), col(A)] : c c (q(Bs ), q(Cs ), ) Again, for the following derivation, we look at a subexpression, assuming we have matrices A, b and c with SIZE (A, b, c): A c
su(B ) pref(P )
su(C ) pref(P ) = P [row(A), col(A)]
{ prop. } A b su(B ) A c su(C ) A b pref(P ) A c pref(P )
= P [row(A), col(A)] { property pref } A b su(B ) A c su(C ) A c b pref(P )
{ B , C pref(P ) } A b su(P [0 .. row(B ))[0 .. col(B ))) A su(P [0 .. row(C ))[0 .. col(C ))) c A c b
pref(P )
{ row(B ) + 1 = row(C ), col(B ) = col(C ) + 1 } 65
b su(P [0 .. row(C ) 1)[0 .. col(B ))) A su(P [0 .. row(C ))[0 .. col(B ) 1)) c b b
A c A b c
pref(P )
{ P [row(B ), col(C )] = ; property su } A c su(P [0 .. row(C ))[0 .. col(B ))) pref(P )
{ property } A c su(P [0 .. row(C ))[0 .. col(B ))) pref(P )
Now we can get back to our main derivation: q(Ds) set A, b, c : SIZE (A, b, c) A b su(P [0 .. row(C ))[0 .. col(B ))) pref(P ) : c (q(Bs ), q(Cs ), ) = { def. SIZE } q(Ds) set A, b, c : row(A) < m1 col(A) < m2 row(b) = row(A) col(b) = 1 row(c) = 1 col(c) = col(A) A b A b su(P [0 .. row(C ))[0 .. col(B ))) pref(P ) : c c (q(Bs ), q(Cs ), ) = { row(C ) m1 , col(B ) m2 } q(Ds) set A : A ES A su(P [0 .. row(C ))[0 .. col(B ))) pref(P ) : A / (q(Bs ), q(Cs ), ) = { row(C ) m1 and col(B ) m2 , therefore ESrow(C
),col(B )
A b c
(q(Bs ), q(Cs ), ) }
q(Ds) set A : A su(P [0 .. row(C ))[0 .. col(B ))) pref(P ) : A (q(Bs ), q(Cs ), ) = = { set calculus } q(Ds) su(P [0 .. row(C ))[0 .. col(B ))) pref(P ) (q(Bs ), q(Cs ), ) { property (8) } q(Ds P [0 .. row(C ))[0 .. col(B ))) (q(Bs ), q(Cs ), )
This leads to the following program fragment: Ds := Ds Bs := Bs ; Cs := Cs P [0 .. row(C ))[0 .. col(B ));
Of course we still need to verify that this does not violate the other invariants P 1, P 2 and P 3. From (10), the denition of OL, it is not hard to see that P 1 and P 2 still hold after removing the rst elements of Bs and Cs. For P 3, there are two relevant cases: Ds = [ ] and Ds = Ds D. OL([ ] true 66 Ds)
{ (10) }
OL(Ds
P [0 .. row(C ))[0 .. col(B )))
{ property (11) } OL(Ds D ) row(P [0 .. row(C ))[0 .. col(B ))) < row(D ) col(D ) < col(P [0 .. row(C ))[0 .. col(B )))
{ Ds
D = Ds, P 2; def. row, col }
row(C ) < row(D ) col(D ) < col(B ) { idempotence of , row(B ) = row(C ) + 1, col(B ) + 1 = col(C ) } row(B ) + 1 < row(D ) col(D ) < col(B ) row(C ) < row(D ) col(D ) < col(C ) + 1 { OLr (Ds true We introduce the following two extra invariants: P4 : P5 : OLr (Ds, Bs) OLc (Ds, Cs) D ,B Bs ), OLc (Ds D ,C Cs ) }
Predicates OLr and OLc are dened as follows: OLr ([ ], As) OLr (Es, [ ]) OLr (Es E, A As) true true row(A) + 1 < row(E) col(E) < col(A)
(13)
OLc ([ ], As) OLc (Es, [ ]) OLc (Es E, A As)
true true row(A) < row(E) col(E) < col(A) + 1
(14)
From these denitions we can see that extra invariants P 4 and P 5 hold initially, since [ ] is the initial value of Ds. It should also be obvious that OLr (Ds P [0 .. row(C ))[0 .. col(B )), Bs ) follows from row(B ) + 1 = row(C ) and OL(B Bs ) (which is invariant P 1). Symmetrically, we know that OLc (Ds P [0 .. row(C ))[0 .. col(B )), Cs ) follows from col(B ) = col(C ) + 1 and OL(C Cs ). In case 0 we have seen how to update Ds, Bs and Cs when a combination of B , C and actually occurs as a prex of the text. The remaining cases are the various situations in which there is no way to combine B , C and into such a prex. Case 1 Here we will take a look at the case where row(C ) < row(B ) + 1. Again we start by examining the left-hand side of invariant P 0. q(Ds) (q(B = { denition } q(Ds) set A, b, c : SIZE (A, b, c) A b q(B Bs ) Bs ), q(C Cs ), )
67
A q(C c ESm1 ,m2
Cs ) = P [row(A), col(A)] :
A c
Here too, we will take a look at a subexpression in detail. We assume we have A, b and c satisfying SIZE (A, b, c) and we derive: A c A c
q(C
Cs )
{ denition q } (su(C ) pref(P )) q(Cs ) Cs ) }
{ property su, invariant P 2: OL(C row( A c A c ) row(C )
{ row(C ) < row(B ) + 1 } row( ) < row(B ) + 1
{ denition row } row( A A b b ) < row(B ) =B { property of functions }
We conclude: A b q(B Bs )
{ (7) (denition q) } A b { A A b A b (su(B ) pref(P )) q(Bs ) b =B } ((su(B ) pref(P )) \ {B }) q(Bs ) q(r(B )) q(Bs )
{ spec. r }
Here we introduce function r : pref(P ) L(pref(P )), which, given A pref(P ), returns the list of maximal elements of (su(A) pref(P )) \ {A}. In other words, r(A) is the list satisfying the following properties: q(r(A)) = (su(A) pref(P )) \ {A} OL(r(A)) The values of function r can be precomputed for every prex of P . We will get back to this later, in section 6.3.4; for now we just assume we have such a function r. We can now conclude that the left-hand-side of P 0 is in this case equal to:
68
q(Ds) set A, b, c : SIZE (A, b, c) A b q(r(B )) q(Bs ) A A b q(C Cs ) = P [row(A), col(A)] : ESm1 ,m2 c c This suggests that we might want to replace Bs by r(B ) + Bs . That would be sucient + to ensure that P 0 holds again. However, there is no way to guarantee that OL(r(B ) + Bs ) + and OLr (Ds, r(B ) + Bs ) hold. + For example, r(B ) may very well contain a matrix R that is a strict sux of an element of Bs . Then R is not a maximal element of q(r(B ) + Bs ) and that means that OL(r(B ) + Bs ) does + + not hold. However, in this case we also know that every matrix in su(R) is already represented by Bs , so we can simply omit matrix R. Because of invariant OL(B Bs ) and OL(r(B )), we know that the elements of r(B ) that are also a sux of some element of Bs form a tail of r(B ). So eliminating these elements of r(B ) consists of a linear search starting at the last element of r(B ), eliminating every element until one is found that is not a sux of some element of Bs . Similarly, we cannot guarantee that OLr (Ds, r(B ) + Bs ) holds, but we do know that we can + omit the matrices in r(B) which have already been considered in the computation of Ds. We also know that those matrices occur at the beginning of r(B ). First we will focus on OL and try to compute a list Rs, such that q(Rs + +Bs ) = q(r(B ) + +Bs ), but also OL(Rs + +Bs ). We distinguish two possible cases for Bs : Bs = [ ] and Bs = [ ]. If Bs = [ ], we can make the following observation: OL(r(B ) + [ ]) + { denition + } + OL(r(B )) { specication r } true So in this case, the choice Rs = r(B ) does suce. If Bs is not empty, we can write it as B Bs , for some matrix B and list Bs . To compute a list Rs satisfying q(Rs + +Bs ) = q(r(B ) + +Bs ) and OL(Rs + +Bs ), we initially choose Rs = r(B ) and introduce the following invariants: Q0 Q1 OL(Rs) q(Rs + +Bs ) = q(r(B ) + +Bs )
Concerning termination, we again look at two dierent cases: Rs = [ ] and Rs = [ ]. Assuming Rs = [ ], we derive: OL([ ] + +(B OL(B OL(B true 69 Bs ) B Bs ) Bs )) { denition + } + { denition OL } { P1 }
So if Rs is empty, we can terminate this computation. If R = [ ], we can write Rs as Rs OL((RS OL(RS R ) + (B + R ) OL(B R.
Bs )) Bs ) row(B ) < row(R ) col(R ) < col(B )
{ property (12) } { Q0, P 1 } row(B ) < row(R ) col(R ) < col(B ) { R r(B ), therefore R su(B ) and col(R ) col(B ) } row(B ) < row(R ) col(B ) < col(B ) { P 1 : OL(B B Bs ) } row(B ) < row(R )
If row(B ) < row(R ), we can terminate the computation as well. Otherwise, we can derive: row(R ) row(B ) { col(R ) < col(B ) (see previous derivation); R , B pref(P ) } R pref(B ) { R , B Q, theorem 6.5 } R su(B ) { property su } su(R ) su(B ) { set calculus } su(R ) pref(P ) su(B ) pref(P ) We can use this in the following derivation, which starts with the left-hand side of invariant Q1: q((Rs = = = R ) + (B + Bs ))
{ (7), (8) and (9) } q(Rs ) (su(R ) pref(P )) (su(B ) pref(P )) q(Bs ) { su(R ) pref(P ) su(B ) pref(P ) } q(Rs ) (su(B ) pref(P )) q(Bs ) { (7), (9) } q(Rs + (B + Bs ))
So the algorithm to compute Rs becomes: Rs := r(B ); if Bs = [ ] skip [] Bs :: B Bs do Rs :: Rs R cand row(R ) < row(B ) Rs := Rs od fi
70
Now we still need to establish OLr (Ds, Rs + Bs ). We do this by an algorithm similar to the + one presented above. We will only eliminate elements of Rs, so we can be sure that this algorithm will not falsify OL(Rs + Bs ). + Again, we will consider two cases: Ds = [ ] and Ds = [ ]. If Ds = [ ], we can simply observe: OLr ([ ], Rs + +Bs ) { (13) } true If Ds = [ ], it is of the form Ds of Rs. If Rs = [ ], we get: OLr (Ds OLr (Ds true If Rs = [ ], we can write Rs as R OLr (Ds { (13) } row(R ) + 1 < row(D ) col(D ) < col(R ) { R su(B ) } row(B ) + 1 < row(D ) col(D ) < col(R ) { P4 } col(D ) < col(R ) If this does not hold (that is, in the case that col(R ) col(D )), we want to eliminate R from the list, much like in the previous algorithm. We will now show that this does not falsify P 0, with Bs replaced by Rs + +Bs . We get the following derivation: q(Ds = D ) (q(R Rs + +Bs ), q(C Cs ), ) D , (R Rs . We get the following derivation: D ,[] + +Bs ) D , Bs ) D . We distinguish two dierent cases for the structure
{ denition + } + { P 1, P 4 }
Rs ) + +Bs )
{ denition } q(Ds D ) set A, b, c : SIZE (A, b, c) A b q(R Rs + +Bs ) A A b q(C Cs ) = P [row(A), col(A)] : c c ESm1 ,m2
{ domain split, denition } q(Ds D ) set A, b, c : SIZE (A, b, c) A b su(R ) pref(P ) A A b q(C Cs ) = P [row(A), col(A)] : c c
71
(q(Rs + +Bs ), q(C = { see below } q(Ds
Cs ), ) Cs ), )
D ) (q(Rs + +Bs ), q(C
For the step that we still need to prove, we assume we have matrices A, b and c, which satisfy: SIZE (A, b, c) A q(C c A b su(R ) pref(P )
Cs ) = P [row(A), col(A)]
We get the following derivation: A b su(R ) pref(P ) b ) row(R ) col( A b ) col(R ) b ) row(R ) col( A b ) col(D ) b ) row(B ) col( A D ,B Bs ) } b ) col(D ) b ) col(D )
{ property su } row( A row( A row( A row( A { col(R ) col(D ) } { R su(B ) } { P 4 : OLr (Ds { def. row, col } row( A b c A b ) < row(D ) col( pref(P ), A c b A c A b c ) col(D )
b ) < row(D ) 1 col( A
pref(P ) and = P [row(A), col(A)],
therefore: } A c { b A c
pref(P ); D pref(P )
pref(D ) b , D (Q, R, ); theorem 6.5 }
A b c { A c A c A c
su(D ) pref(P ), A c pref(P ) and = P [row(A), col(A)] }
A b b b b
su(D ) pref(P )
{ set calculus } q(Ds ) (su(D ) pref(P ))
{ property (8) } q(Ds D)
72
So the following program fragment suces: if Ds = [ ] skip [] Ds :: Ds D do Rs :: R Rs cand col(R ) col(D ) Rs := Rs od fi Case 2 We will now examine the case: row(B ) + 1 < row(C ). q(Ds) (q(B = { denition } q(Ds) set A, b, c : SIZE (A, b, c) A b q(B Bs ) A A b q(C Cs ) = P [row(A), col(A)] : c c ESm1 ,m2 Here, we can derive, somewhat symmetrically to case 1: A b q(B Bs ) (su(B ) pref(P )) q(Bs ) Bs ) } b ) row(B ) Bs ), q(C Cs ), )
{ (7) (denition q) } A b { OL(B row( A
{ row(B ) + 1 < row(C ) } row( A b ) < row(C ) 1 { def. row } row( A c ) < row(C )
{ property functions } A c =C
And we can conclude: A c A c { q(C Cs )
{ (7) (denition q) } (su(C ) pref(P )) q(Cs ) A c =C }
73
A c A c
((su(C ) pref(P )) \ {C }) q(Cs )
{ specication r } r(C ) q(Cs )
To ensure that P 2 and P 5 hold, we need to eliminate certain elements of r(C ). The derivation of such an algorithm is symmetrical to case 1 and will be omitted here; instead we just present the entire algorithm immediately. Rs := r(C ); if Cs = [ ] skip [] Cs :: C Cs do Rs :: Rs R cand row(R ) row(C ) Rs := Rs od f i; if Ds = [ ] skip [] Ds :: Ds D do Rs :: R Rs cand col(R ) < col(D ) Rs := Rs od fi After computing Rs using this algorithm, we can safely replace Cs with Rs + +Cs . Case 3 Here we examine the case: row(B ) + 1 = row(C ) col(C ) + 1 < col(B ). Then: q(Ds) (q(B = { denition } q(Ds) set A, b, c : SIZE (A, b, c) A b q(B Bs ) A A b q(C Cs ) = P [row(A), col(A)] : c c ESm1 ,m2 A c Bs ), q(C Cs ), )
As in case 1, we can now show: satisfying SIZE (A, b, c)). A c A c
q(C
Cs )
= B (for A, b and c
q(C
Cs )
{ (7) } (su(C ) pref(P )) Cs
{ property } 74
A c
su(C ) pref(P )
A c
Cs
We proceed to show that both disjuncts imply A c
=B.
su(C ) pref(P )
{ property su } col( A c ) col(C )
{ denition col } col( A b ) col(C ) + 1 { col(C ) + 1 < col(B ) } col( A b ) < col(B ) { property functions } A b =B
A c
q(Cs ) Cs ) }
{ P 2: OL(C row( A c A c
) < row(C )
{ row(B ) + 1 = row(C ) } row( ) < row(B ) + 1
{ denition row } row( A A b b ) < row(B ) =B { property functions }
Now we can draw the same conclusion as in case 1: A A { b b A q(B Bs )
{ (7) } (su(B ) pref(P )) q(Bs ) b =B } ((su(B ) pref(P )) \ {B }) q(Bs ) q(r(B )) q(Bs )
A b A b
{ spec. r }
Note that in the rest of the derivation in case 1 we never use the assumption row(C ) row(B ). That means that we can now use the same algorithm for computing a list Rs from list r(B ), such that we can replace Bs by Rs + +Bs .
75
Case 4 Case 4 is: row(B ) + 1 = row(C ) col(B ) < col(C ) + 1. We derive: q(Ds) (q(B = Bs ), q(C Cs ), )
{ denition , (7) (denition q) } q(Ds) set A, b, c : SIZE (A, b, c) A b q(B Bs ) A b A q(C Cs ) = P [row(A), col(A)] : c c ESm1 ,m2
For A, b and c satisfying SIZE (A, b, c), we derive: A b q(B Bs ) A b q(Bs ) A c
{ (7), property } A b su(B ) pref(P )
We examine both disjuncts seperately to show that both imply
=C:
su(B ) pref(P ) b ) col(B ) b ) col(C )
{ property su } col( A col( A A c { col(B ) col(C ) } { denition col } col( ) < col(C )
{ property functions } A c =C b q(Bs ) Bs ) } b ) < row(B )
{ OL(B row( A A c A c
{ denition row } row( ) < row(B ) + 1
{ row(B ) + 1 = row(C ) } row( ) < row(C )
{ property functions } A c =C
76
Therefore we have the same conclusion as in case 2: A c A c { A c A c q(C Cs )
{ (7) } (su(C ) pref(P )) q(Cs ) A c =C }
((su(C ) pref(P )) \ {C }) q(Cs )
And again, we can use the same algorithm presented in case 2 to compute a list Rs, so we can replace Cs by Rs + +Cs . Case 5 Our nal case is the case where the sizes of B and C do match, but does not occur in the right spot in the pattern: row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 P [row(B ), col(C )] = . q(Ds) (q(B = { denition } q(Ds) set A, b, c : SIZE (A, b, c) A b q(B Bs ) A A b q(C Cs ) = P [row(A), col(A)] : c c ESm1 ,m2 We assume we have matrices A, b and c, satisfying: SIZE (A, b, c) A q(C c A b q(B Bs ) Bs ), q(C Cs ), )
Cs ) = P [row(A), col(A)]
Then the assumption P [row(B ), col(C )] = gives us the following: P [row(B ), col(C )] = { P [row(A), col(A)] = } row(A) = row(B ) col(A) = col(C ) We examine both of these disjuncts: row(A) = row(B ) { idempotence , denition row } 77
row( A
b ) = row(B ) row(
A c A c
) = row(B ) + 1
{ row(B ) + 1 = row(C ) } row( A b ) = row(B ) row( ) = row(C )
{ property functions } A b =B A c =C
col(A) = col(C ) { idempotence , denition col } col( A b ) = col(C ) + 1 col( A c ) = col(C )
{ col(B ) = col(C ) + 1 } col( A b ) = col(B ) col( A c ) = col(C )
{ property functions } A b =B A c =C
Using
A A b b
= B , we can draw the same conclusion we did in case 1 and case 3: q(B Bs )
{ (7) } A { (su(B ) pref(P )) q(Bs ) b =B } ((su(B ) pref(P )) \ {B }) q(Bs ) q(r(B )) q(Bs ) A c A
A b A b
{ spec. r }
However, using A c A c { A c A c
= C , the observation seen in case 2 and case 4 is also correct:
q(C
Cs )
{ (7) } (su(C ) pref(P )) q(Cs ) A c =C }
((su(C ) pref(P )) \ {C }) q(Cs )
78
Now it is possible that to either update Bs as in case 1, or to update Cs as in case 2. However, a third option is to update both Bs and Cs here, by replacing Bs by Qs + +Bs and Cs by Rs + +Cs , where Qs is the result of the algorithm we presented in case 1 and Rs the result of the algorithm in case 2. We will not provide the full proof why, after replacing Cs like this, we can still replace Bs as well (or, symmetrically, the other way around). Referring to the correctness proof of the algorithm A A presented in case 1, we can see that the only property of that we have used is: R. c c A in the set q(Rs + Cs ) instead + This property is not violated by referring to matrices c of q(C Cs ), because q(Rs + Cs ) is in fact a subset of q(C Cs ). +
6.3.3
Entire precomputation algorithm
In this section we have derived a method of computing a list Ds which represents (Q, R, ), given lists Bs and Cs, representing respectively Q and R, and symbol . Here we will give the entire program text. Ds := [ ]; do Bs :: B Bs Cs :: C Cs if row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 P [row(B ), col(C )] = Ds := Ds P [0 .. row(C ))[0 .. col(B )); Bs := Bs ; Cs := Cs [] row(C ) row(B ) (row(B ) + 1 = row(C ) col(C ) + 1 < col(B )) Bs := reduce r (B , Ds, Bs ) [] row(B ) + 1 < row(C ) (row(B ) + 1 = row(C ) col(B ) < col(C ) + 1) Cs := reduce c (C , Ds, Cs ) [] row(B ) + 1 = row(C ) col(B ) = col(C ) + 1 P [row(B ), col(C )] = Bs := reduce r (B , Ds, Bs ); Cs := reduce c (C , Ds, Cs ) od; if Ds = [ ] Ds := [Em1 ,0 , E0,m2 ] [] Ds :: D Ds Ds :: Ds D if row(D ) < m1 Ds := Em1 ,0 Ds [] m1 row(D ) skip f i; if col(D ) < m2 Ds := Ds E0,m2 [] m2 col(D ) skip fi fi Here the auxiliary function reduce r (B , Ds, Bs ) is the algorithm from case 1 in section 6.3.2: Rs := r(B ); if Bs = [ ] skip [] Bs :: B Bs do Rs :: Rs R cand row(R ) < row(B ) Rs := Rs od
79
f i; if Ds = [ ] skip [] Ds :: Ds D do Rs :: R Rs cand col(R ) col(D ) Rs := Rs od f i; return(Rs) Function reduce c (C , Ds, Cs ) is, of course, the algorithm from case 2: Rs := r(C ); if Cs = [ ] skip [] Cs :: C Cs do Rs :: Rs R cand row(R ) row(C ) Rs := Rs od f i; if Ds = [ ] skip [] Ds :: Ds D do Rs :: R Rs cand col(R ) < col(D ) Rs := Rs od f i; return(Rs) We have now seen how to compute the list corresponding to (Q, R, ) given symbol and lists Bs and Qs, with q(Bs) = Q and q(Cs) = R. We have not given this function a name yet; let us call it : L(pref(P )) L(pref(P )) L(pref(P )). Precomputing for all lists that represent reachable sets can be done using the following algorithm. We will not give a full correctness proof for this algorithm. Informally, the invariant of the main repetition is that (Bs, Cs, ) has been computed for all pairs Bs, Cs in Qclosed and all symbols . Qopened := {[Em1 ,0 , E0,m2 ]}; Qclosed := ?; do Qopened = ? Bs : Bs Qopened ; Qnew := ?; for : compute (Bs, Bs, ); Qnew := Qnew { (Bs, Bs, )}; for Cs : Cs Qclosed compute (Bs, Cs, ); Qnew := Qnew { (Bs, Cs, )}; compute (Cs, Bs, ); Qnew := Qnew { (Cs, Bs, )} rof rof ; Qclosed := Qclosed {Bs}; Qopened := (Qopened Qnew ) \ Qclosed od 80
Referring to [IN77, MP04, Pol04] for the denition of tessellation automata, we can say that this corresponds to precomputing a two-dimensional deterministic online tessellation automaton with the following properties: alphabet: ; state set: Qclosed ; transition function: ; initial state: [Em1 ,0 , E0,m2 ]; set of nal states: {[P ]}. (Note that su(P ) pref(P ) is indeed a reachable set and therefore [P ] Qclosed .) The matrices that occur in the lists in these algorithms are all prexes of P . If pattern P is known, we can represent a prex of P uniquely by its size. This means that in an implementation, it is not necessary to store lists of matrices; lists of integer pairs suce.
6.3.4
Computation of the failure function
As we have seen in section 6.3.2, we still need to nd a way to (pre)compute the values of auxiliary function r : pref(P ) L(pref(P )), specied as follows, for all A pref(P ): q(r(A)) = (su(A) pref(P )) \ {A} OL(r(A)) First we will consider the cases where A is an empty matrix in isolation. row(A) = 0 col(A) = 0: A = E0,0 . Then we can derive: (su(E0,0 ) pref(P )) \ {E0,0 } = = = { denition su } ({E0,0 } pref(P )) \ {E0,0 } { set calculus } { denition q } q([ ]) row(A) = 0 0 < col(A): A = E0,col(A) . (su(E0,col(A) ) pref(P )) \ {E0,col(A) } = = = { denition su, ES } (ES0,col(A) pref(P )) \ {E0,col(A) } { col(A) m1 } ES0,col(A) \ {E0,col(A) } { denition ES, set calculus, 0 < col(A) } 81
ES0,col(A)1 = = { denition pref, ES; col(A) m1 } ES0,col(A)1 pref(P ) { denition q } q([E0,col(A)1 ]) 0 < row(A) col(A) = 0. This is completely symmetrical to the previous case and gives us: (su(A) pref(P )) \ {A} = q([Erow(A)1,0 ]). The case where A is a nonempty matrix remains. That is, 0 < row(A) 0 < col(A). Informally, we want to nd a way to compute the list r(A). This list consists of the maximal elements of the set (su(A) pref(P )) \ {A}, ordered by decreasing number of rows and increasing number of columns. The method we will describe consists of running a (bounded) linear search for each j : 0 j row(A), to nd the maximal element with exactly j rows, if such a maximal element exists. We start with the following observation: (su(A) pref(P )) \ {A} = { property su, set calculus, 0 < row(A) 0 < col(A) } ({A} pref(P )) \ {A} (su(A[1 .. row(A))[0 .. col(A))) pref(P )) \ {A} (su(A[0 .. row(A))[1 .. col(A))) pref(P )) \ {A} = { set calculus, denition su } (su(A[1 .. row(A))[0 .. col(A))) pref(P )) (su(A[0 .. row(A))[1 .. col(A))) pref(P )) = { introduction of g } g(row(A) 1) (su(A[0 .. row(A))[1 .. col(A))) pref(P )) Here we introduce the following auxiliary function: g(j) = su(A[row(A) j .. row(A))[0 .. col(A))) pref(P ) It may seem that at this point we could simply conclude that A[0 .. row(A))[1 .. col(A)) is the rst element of r(A). However, we know that r(A) L(pref(P )) and in general, A[0 .. row(A))[1 .. col(A)) does not need to be a prex of P . We now introduce variables Rs : L(pref(P )) and i1 : N and the following tail invariant: P0 : q(r(A)) = q(Rs) g(i1 )
We could initially establish P 0 by choosing Rs = [ ] and i1 = row(A). However, it will be useful to consider the case of i1 = row(A) seperately. Therefore, we choose to initialise i1 = row(A)1. The initialisation of Rs then consists of a repetition; we introduce integer variable i2 and the following tail invariant: Q0 : q(r(A)) = g(row(A) 1) (su(A[0 .. row(A))[col(A) i2 .. col(A))) pref(P )) 82
As we have seen, we can initially choose i2 = col(A) 1. We can terminate the computation when A[0 .. row(A))[col(A) i2 .. col(A)) = P [0 .. row(A))[0 .. i2 ), because then the assignment Rs := [P [0 .. row(A))[0 .. i2 )] establishes P 0. We know that a value of i2 exists for which this termination condition holds, namely: i2 = 0. When A[0 .. row(A))[col(A) i2 .. col(A)) = P [0 .. row(A))[0 .. i2 ), we get: g(row(A) 1) (su(A[0 .. row(A))[col(A) i2 .. col(A))) pref(P )) = { 0 < row(A), 0 < i2 } g(row(A) 1) ({A[0 .. row(A))[col(A) i2 .. col(A))} pref(P )) (su(A[1 .. row(A))[col(A) i2 .. col(A))) pref(P )) (su(A[0 .. row(A))[col(A) i2 + 1 .. col(A))) pref(P )) = { A[0 .. row(A))[col(A) i2 .. col(A)) = P [0 .. row(A))[0 .. i2 ) } g(row(A) 1) (su(A[1 .. row(A))[col(A) i2 .. col(A))) pref(P )) (su(A[0 .. row(A))[col(A) i2 + 1 .. col(A))) pref(P )) = { denition g } g(row(A) 1) (su(A[0 .. row(A))[col(A) i2 + 1 .. col(A))) pref(P )) So the initialisation step of our algorithm now becomes: i2 := col(A) 1; { invariant: Q0 } do A[0 .. row(A))[col(A) i2 .. col(A)) = P [0 .. row(A))[0 .. i2 ) i2 := i2 1 od; R := P [0 .. row(A))[0 .. i2 ); Rs := [R ]; i1 := row(A) 1 { P0 } As we can see, we initialise Rs as a list containing one element, R . This list will only grow in the rest of the algorithm text; it will never be an empty list. As an extra invariant we ensure that R is always the last element of Rs: P1 : Rs : Rs L(pref(P )) : Rs = Rs R
Additionally, we also have as an invariant: P2 : i1 < row(R )
Now we can see that we can terminate the computation when col(A) col(R ): q(r(A)) = = { P0 } q(Rs) g(i1 ) { denition g } 83
q(Rs) (su(A[row(A) i1 .. row(A))[0 .. col(A))) pref(P )) = = { col(A) col(R ) } q(Rs) (su(A[row(A) i1 .. row(A))[col(A) col(R ) .. col(A))) pref(P )) { P2 } q(Rs) (su(A[row(A) row(R ) .. row(A))[col(A) col(R ) .. col(A))) pref(P )) { R su(A) } q(Rs) (su(R ) pref(P )) { P 1; (8) on page 60 } q(Rs) Invariant P 0 tells us q(Rs) q(r(A)), so we have q(Rs) = q(r(A)). And since q(r(A)) is the unique list corresponding to r(A) (as we have seen in section 6.3.0), we can then conclude: Rs = r(A). All that remains is to see how to update Rs and i1 . For this we again introduce a repetition, much like the one used for initialisation of i1 and Rs, with the following invariant: R0 : q(r(A)) = q(Rs) g(i1 1) (su(A[row(A) i1 .. row(A))[col(A) i2 .. col(A))) pref(P ))
Initially we choose i2 = col(A) and, like in the initialisation of i1 and Rs, we can stop the computation when we nd A[row(A) i1 .. row(A))[col(A) i2 .. col(A)) = P [0 .. i1 )[0 .. i2 ). However, we have an additional bound; we should also stop the computation when i2 col(R ): q(r(A)) = = = { R0 } q(Rs) g(i1 1) (su(A[row(A) i1 .. row(A))[col(A) i2 .. col(A))) pref(P )) { i2 col(R ) } q(Rs) g(i1 1) (su(A[row(A) i1 .. row(A))[col(A) col(R ) .. col(A))) pref(P )) { P2 } q(Rs) g(i1 1) (su(A[row(A) row(R ) .. row(A))[col(A) col(R ) .. col(A))) pref(P )) { R su(A) } q(Rs) g(i1 1) (su(R ) pref(P )) { P 1; (8) on page 60 } q(Rs) g(i1 1) All this gives rise to the following algorithm to compute r(A): if row(A) = 0 col(A) = 0 Rs := [ ] [] row(A) = 0 0 < col(A) Rs := [E0,col(A)1 ] [] 0 < row(A) col(A) = 0 Rs := [Erow(A)1,0 ] [] 0 < row(A) 0 < col(A) i2 := col(A) 1; 84
do A[0 .. row(A))[col(A) i2 .. col(A)) = P [0 .. row(A))[0 .. i2 ) i2 := i2 1 od; R := P [0 .. row(A))[0 .. i2 ); Rs := [R ]; i1 := row(A) 1; do col(R ) < col(A) i2 := col(A); do col(R ) < i2 A[row(A) i1 .. row(A))[col(A) i2 .. col(A)) = P [0 .. i1 )[0 .. i2 ) i2 := i2 1 od; if col(r ) < i2 R := P [0 .. i1 )[0 .. i2 ); Rs := Rs R [] i2 col(R ) skip f i; i1 := i1 + 1 od fi
6.4
Entire algorithm
In section 6.2 we have presented a preliminary version of Polcars algorithm, expressed in terms of sets of prexes of P . To make use of function , precomputed as described in section 6.3, these sets will need to be represented by the ordered lists of their maximal elements. To replace array e, we introduce array e : [0 .. n2 ] of L(pref(P )), with q(e [j]) = e[j] and OL(e [j]) (for 0 j n2 ). Furthermore, we note: P q(e [j]) { OL(e [j]); e L(pref(P )) } e [j] = [P ] We get the following algorithm: O := ?; for j : 0 j n2 e [j] := [Em1 ,0 , E0,m2 ] rof ; i1 , i2 := 0, 0; do i1 = n1 do i2 = n2 e [i2 + 1] := (e [i2 + 1], e [i2 ], T [i1 , i2 ]); if e [i2 + 1] = [P ] O := O {(i1 + 1 m1 , i2 + 1 m2 )} [] e [i2 + 1] = [P ] skip f i; i2 := i2 + 1 od; i2 := 0; i1 := i1 + 1 od
85
6.5
Remarks
There are a few dierences between this version of the algorithm and the original presentations in [Pol04, MP04]. In the original presentations, a so-called (nondeterministic) two-dimensional online tessellation automaton (2OTA) is constructed, where each state in the 2OTA corresponds to a prex of the pattern. This 2OTA is transformed into a two-dimensional deterministic online tessellation automaton (2DOTA), using an algorithm very similar to the subset construction.3 As a result, the states in the resulting 2DOTA (implicitly) each correspond to a set of prexes of P . In our presentation, we do not need to introduce tessellation automata. Nor do we rst construct an equivalent of the 2OTA; we immediately derive an algorithm to precompute the values of , which is the equivalent of the 2DOTAs transition function. As we have seen in section 6.3.0, we do start out our derivation of the transition function by examining sets of prexes of P , but we represent these sets by a list of their maximal elements. This approach improves the precomputations performance and is more similar to the one-dimensional pattern matching algorithms which Polcars algorithm is based on: Aho-Corasick and KnuthMorris-Pratt ([AC75, KMP77]). Even with these improvements, the precomputation for this algorithm can be very unecient. However, the precomputations performance depends only on the size of the pattern. As a result, Polcars algorithm is most useful when the pattern is relatively very small, compared to the text.
3 The subset construction is an algorithm used for transforming a (one-dimensional) nondeterministic nite automaton (NFA) into a deterministic nite automaton (DFA). See [ASU96], pages 117121.
86
87
7
7.0
Conclusions and future work

Conclusions
We have described several, very dierent, two-dimensional pattern matching algorithms. Some are very well-known, such as the Baker-Bird algorithm ([Bak78, Bir77]), the rst known solution to the problem. On the other hand we have also described the very recent Polcar algorithm ([Pol04, MP04]), which was rst presented in 2004. All of these algorithms have been formally derived. The derivations are formal proofs that these algorithms are correct solutions to the two-dimensional pattern matching problem, while their presentations in existing literature usually lack such full correctness proofs. The derivations also show where we have taken some of the major design decisions, such as the choice of a lter function in the lter-based algorithms (see section 4.0 on page 20). This can be very useful in discovering new two-dimensional pattern matching algorithms. We have described the similarities between the Baker-Bird and Takaoka-Zhu ([TZ89, TZ94]) algorithms, by presenting both using a uniform description, as so-called lter-based algorithms. Because of this it became immediately obvious that a space complexity improvement, rst suggested in the original presentation of Takaoka-Zhu, was also applicable to the Baker-Bird algorithm (and almost any other possible lter-based algorithm; see section 4.2 on page 22). We have derived Baeza-Yates and Rgniers algorithm ([BYR93]) and also proposed two improvee ments over its original presentation (section 5.2 on page 36 and section 5.3 on page 39). Both of these improvements use information obtained during a failed matching attempt to speed up the computation by skipping unnecessary row comparisons. Polcars algorithm was derived without using the tessellation automata data structure or any of its properties. In addition, we have shown an improvement in the precomputation step of this algorithm: where Polcars approach uses sets of matrices, we have represented these sets by lists of the maximal elements of these sets (see section 6.3 on page 56). This is an improvement of both the time and space complexity of the precomputation. It is also an interesting improvement because it is so similar to the approach of the one-dimensional Aho-Corasick and Knuth-MorrisPratt algorithms ([AC75, KMP77]): there too we reason in terms of strings of maximal length, as opposed to sets of strings (see also [WZ92, WZ93]). Part of the original goal of this research was to create a taxonomy of two-dimensional pattern matching algorithms. We have not done this yet, because the algorithms we have seen dier greatly from each other, with the exception of the lter-based algorithms: Baker-Bird and Takaoka-Zhu. Given these great dierences a taxonomy would only have a very course structure, not giving any additional information or insight.
7.1
Future work
Although we have derived several two-dimensional pattern matching algorithms, there are still known algorithms that have not been described in this thesis. Future research can include the derivation of more known algorithms. Also, there may be many solutions to the two-dimensional pattern matching problem that have not been discovered yet. In particular, exploring other choices for the lter function in the lter-based approach (see section 4 on page 20) may lead to new solutions. When more algorithms have been derived, constructing a taxonomy will be useful. Existing pattern matching toolkits, such as SPARE Time and SPARE Parts (see [Cle03]), can be expanded to include the two-dimensional pattern matching algorithms described in this thesis.
88
However, matrices are a rather dierent data structure than strings, so generalising existing toolkits to include two-dimensional pattern matching strategies may be dicult; in that case, a seperate toolkit for two-dimensional pattern matching can be developed, in the same style as SPARE Time and SPARE Parts. Once an ecient and practical implementation has been constructed, it is possible to perform a benchmarking; that is, a thorough practical performance analysis. A complete theoretical performance analysis (of memory and time complexity) can be done for the algorithms described here as well. We can refer to the original presentations of the algorithms for their space and time complexity; however, it is unknown how the improvements made to the Baeza-Yates and Rgnier algorithm and the Polcar precomputation algorithm aect their average-case theoretical e performance exactly. Further generalisations of the two-dimensional pattern matching problem are also possible and worth exploring. We have already briey discussed generalisation to multipattern matching and matching in more than two dimensions for most algorithms. Other generalisations include approximate two-dimensional pattern matching (where we need to dene a distance function in terms of matrices) and matching patterns of non-rectangular shapes. (This last generalisation can be solved using the algorithms presented here, by introducing a dont care symbol, which matches with every symbol of the text, and expanding the pattern to a bounding rectangle with this symbol. However, more ecient solutions may be possible.)
89
Properties of div and mod
Denition A.0 For a N and q N+ we dene div and mod by: a = (a div q) q + a mod q 0 a mod q < q Theorem A.1 For all a, b N and q N+ we have: (a q + b) mod q = b mod q Proof (a q + b) mod q = = = { def. div and mod : b = (b div q) q + b mod q } (a q + (b div q) q + b mod q) mod q { over + } ((a + b div q) q + b mod q) mod q { 0 b mod q < q; def. div and mod : X = Y q + Z 0 Z < q Z = X mod q } b mod q
Theorem A.2 For all a, b, c N and q N+ we have: (a b + c) mod q = ((a mod q) b + c) mod q Proof (a b + c) mod q = = = { a = (a div q) q + a mod q } (((a div q) q + a mod q) b + c) mod q { over + } ((a div q) q b + (a mod q) b + c) mod q { theorem A.1 } ((a mod q) b + c) mod q
Theorem A.3 For a N and q N+ : a (a div q + 1) q 1 < a + q
90
Proof First we note: (a div q + 1) q 1 = (a div q) q + q 1 Now we can prove the theorem with the following derivation: a = = < { denition div and mod } (a div q) q + a mod q { a mod q < q } (a div q) q + q 1 { 0 a mod q } (a div q) q + a mod q + q 1 { denition div and mod } a+q1 { math } a+q
91
Properties of pref and su (for strings)
Theorem B.0 For x, y : su(xy) = su(x)y su(y) Proof In the following derivation, bound variables v and w are both strings over . We will omit v, w in our set quantications. su(x)y su(y) = = = = = { su, twice } set v, w : vw = x : w y set v, w : vw = y : w { concatenation over set } set v, w : vw = x : wy set v, w : vw = y : w { dummy transformation, twice } set v, w : |y| |w| vw = xy : w set v, w : |w| |y| vw = xy : w { set calculus (note overlap of |w| = |y|) } set v, w : vw = xy : w { su } su(xy)
Theorem B.1 For x, y and a : su(xay) = su(x)ay su(y) Proof su(xay) = = = { theorem B.0 } su(xa)y su(y) { property su } su(x)ay {y} su(y) { y su(y) } su(x)ay su(y)
92
93
Properties of pref and su (for matrices)
Theorem C.0 For matrix A and integers i1 and i2 , with i1 row(A) and i2 col(A), we have: ESi1 ,i2 pref(A) and: ESi1 ,i2 su(A) Proof ESi1 ,i2 = = = { def. ES } set j1 , j2 : 0 j1 i1 0 j2 i2 (j1 = 0 j2 = 0) : Ej1 ,j2 { i1 row(A), i2 col(A), def. submatrix } set j1 , j2 : 0 j1 i1 0 j2 i2 (j1 = 0 j2 = 0) : A[0 .. j1 )[0 .. j2 ) { set calculus } set j1 , j2 : 0 j1 i1 0 j2 i2 : A[0 .. j1 )[0 .. j2 ) { set calculus, i1 row(A), i2 col(A) } set j1 , j2 : 0 j1 row(A) 0 j2 col(A) : A[0 .. j1 )[0 .. j2 ) { def. pref } pref(A) ESi1 ,i2 = = { def. ES } set j1 , j2 : 0 j1 i1 0 j2 i2 (j1 = 0 j2 = 0) : Ej1 ,j2 { i1 row(A), i2 col(A), def. submatrix } set j1 , j2 : 0 j1 i1 0 j2 i2 (j1 = 0 j2 = 0) : A[row(A) j1 .. row(A))[col(A) j2 .. col(A)) { set calculus } set j1 , j2 : 0 j1 i1 0 j2 i2 : A[row(A) j1 .. row(A))[col(A) j2 .. col(A)) { set calculus, i1 row(A), i2 col(A) } set j1 , j2 : 0 j1 row(A) 0 j2 col(A) : A[row(A) j1 .. row(A))[col(A) j2 .. col(A)) = = { dummy transformation: j1 := row(A) j1 , j2 := col(A) j2 } set j1 , j2 : 0 j1 row(A) 0 j2 col(A) : A[j1 .. row(A))[j2 .. col(A)) { def. su } su(A)
94
Theorem C.1 For all A M2 () and k1 , k2 N: ESk1 ,k2 pref(A) = ESk1 row(A),k2 col(A) Proof ESk1 ,k2 pref(A) = { def. ES, def. pref } ( set j : 0 j k1 : Ej,0 set j : 0 j k2 : E0,j ) set i1 , i2 : 0 i1 row(A) 0 i2 col(A) : A[0 .. i1 )[0 .. i2 ) = { over , set calculus } set j, i1 , i2 : 0 j k1 0 i1 row(A) 0 i2 col(A) A[0 .. i1 )[0 .. i2 ) = Ej,0 : Ej,0 set j, i1 , i2 : 0 j k2 0 i1 row(A) 0 i2 col(A) A[0 .. i1 )[0 .. i2 ) = E0,j : E0,j = = { def. submatrix, math } set j : 0 j k1 row(A) : Ej,0 set j : 0 j k2 col(A) : E0,j { def. ES } ESk1 row(A),k2 col(A)
95
Lists
Here we will briey describe the notation we use for lists. [ ]: the empty list. As + Bs: the concatenation of lists As and Bs. + A As: the list which consists of element A, followed by As; equivalently: [A] + As. In + other terminology, A is the head of the list and As the tail. As A: the list which consists of As, followed by element A; equivalently: As + [A]. +
96
97
References
[AC75] Alfred V. Aho and Margaret J. Corasick. Ecient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333340, June 1975.
[ASU96] Alfred V. Aho, Ravi Sethi, and Jerey D. Ullman. Compilers: principles, techniques and tools. Addison-Wesley, 1996. [Bak78] [Bir77] Theodore P. Baker. A technique for extending rapid exact-match string matching to arrays of more than one dimension. SIAM Journal on Computing, 7(4):533541, 1978. Richard Bird. Two dimensional pattern matching. Information Processing Letters, 6(5):168170, 1977.
[BYR93] Ricardo Baeza-Yates and Mireille Rgnier. Fast two-dimensional pattern matching. e Information Processing Letters, 45(1):5157, 1993. [Cle03] [CR02] [IN77] Loek Cleophas. Towards SPARE Time. Masters thesis, Technische Universiteit Eindhoven, August 2003. Maxime Crochemore and Wojciech Rytter. Jewels of stringology. World Scientic, 2002. Katsushi Inoue and Akira Nakamura. Some properties of two-dimensional online tessellation acceptors. Information sciences, 43:169184, 1977.
[KMP77] Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6(2):323350, 1977. [KR87] [MP04] Richard M. Karp and Michael O. Rabin. Ecient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249260, March 1987. Boivoj Melichar and Tom Polcar. A two-dimensional online tessellation automata r as approach to two-dimensional pattern matching. In Loek Cleophas and Bruce W. Watson, editors, Proceedings of the Eindhoven FASTAR days 2004. Computer science report 04/40, Technische Universiteit Eindhoven, December 2004. a Boivoj Melichar and Jan Zdrek. On two-dimensional pattern matching by nite r automata. In J. Farr, I. Litovsky, and S. Schmitz, editors, CIAA 2005 pre-proceedings, e pages 185194, 2005. Gonzalo Navarro and Mathieu Ranot. Flexible pattern matching in strings. Cambridge University Press, 2002. Tom Polcar. Two-dimensional pattern matching. Postgraduate study report DC-PSRas 04-05, Czech Technical University, January 2004. G. Rozenberg and A. Salomaa. Handbook of formal languages, volume 3: Beyond words, chapter 4: Two-dimensional languages, pages 215267. Springer-Verlag, Berlin, 1997. Tadao Takaoka and Rui Feng Zhu. A technique for two-dimensional pattern matching. Communications of the ACM, 32(9):11101120, 1989. Tadao Takaoka and Rui Feng Zhu. A technique for two-dimensional pattern matching. In Jun-Ichi Aoe, editor, String pattern matching strategies, pages 220230. IEEE Computer Society Press, 1994. Bruce W. Watson. Taxonomies and toolkits of regular language algorithms. PhD thesis, Technische Universiteit Eindhoven, 1995.
[MZ05]
[NR02] [Pol04] [RS97] [TZ89] [TZ94]
[Wat95]
98
[WZ92] [WZ93]
Bruce W. Watson and Gerard Zwaan. A taxonomy of keyword pattern matching algorithms. Computing science report 92/27, Technische Universiteit Eindhoven, 1992. Bruce W. Watson and Gerard Zwaan. A taxonomy of keyword pattern matching algorithms. In H. A. Wijsho, editor, Proceedings Computing Science in the Netherlands 93, pages 2539, SION, Stichting Mathematish Centrum, 1993. Bruce W. Watson and Gerard Zwaan. A taxonomy of sublinear keyword pattern matching algorithms. Computing science report 95/13, Technische Universiteit Eindhoven, 1995. Bruce W. Watson and Gerard Zwaan. A taxonomy of sublinear multiple keyword pattern matching algorithms. Science of Computer Programming, 27(2):85118, September 1996.
[WZ95]
[WZ96]
99

Two-Dimensional Pattern Matching: Technische Universiteit Eindhoven Department of Mathematics and Computer Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Two-Dimensional Pattern Matching: Technische Universiteit Eindhoven Department of Mathematics and Computer Science

Uploaded by

Copyright:

Available Formats

TECHNISCHE UNIVERSITEIT EINDHOVEN Department of Mathematics and Computer Science

Two-dimensional pattern matching

dr. ir. G. Zwaan prof. dr. B.W. Watson

Eindhoven, August 2005

2 Problem 3 Naive algorithm 3.0 3.1 Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Match function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Polcar 6.0 6.1 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Entire algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Conclusions and future work 7.0 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

written as A  B or [A B]. The other special case is

pref(M ) = set A, B, C, D : M = su(M ) = set A, B, C, D : M =

One-dimensional pattern matching

{ matchT,P (i1 , i2 ) T [i1 .. i1 + m1 )[i2 .. i2 + m2 ) = P }

One-dimensional pattern matching in one direction

Ecient computation and storage of the reduced text

Baker and Bird

r := (q0 , x) { r = (q0 , x) } Vector p p[i] =

{ spec. p, def. fP } (q0 , P [i])

Takaoka and Zhu

Baeza-Yates and Rgnier e

i1 : 0 i1 n1 m1 : M P M1 (P R, T [(i1 div m1 + 1) m1 1]) = ? i : 1 i n1 div m1 : M P M1 (P R, T [i m1 1]) = ? { property ; dummy transformation: i = i1 div m1 + 1 }

Baeza-Yates and Rgniers CheckMatch approach e

Inspecting fewer pattern rows

Inspecting only matching pattern rows

{ empty domain } { def. D }

D(m1 ) { x P R, thus g(x) = m1 } / D(g(x))

{ P 1: e[h] = m1 e[h] < i; i m1 }

Computation of unique row indices

{ introduction of } (f (i1 , i2 + 1), f (i1 + 1, i2 ), T [i1 , i2 ])

A ESm1 ,m2 c marked (?) is therefore correct.

su(T [0 .. i1 + 1)[0 .. i2 )) pref(P ). The step

P su(T [0 .. i1 )[0 .. i2 )) pref(P ) { P pref(P ) } P su(T [0 .. i1 )[0 .. i2 )) { i1 < m1 i2 < m2 } false

ESm1 ,m2 ; property } / A a1 R = B b1 b0 su( A a1 a0 )

{ property pref } A B a0 b0 Q A a1 R = B b1 B b1 b0 su( A a1 a0 )

pref(P ) P [row(B ), col(B )] =

pref(P ) P [row(B ), col(B )] =

{ property } a0 Q B b0 su( A a0 ) pref(P ) A B A R su( ) pref(P ) P [row(B ), col(B )] = a1 b1 a1 = A

{ induction hypothesis } B b0 Q B b1 R P [row(B ), col(B )] = =

We get the following derivation: A a1 A a1 A a1 A a0 a0 a0 a0 (Q, R, ) pref( B b1 (Q, R, )

{ property pref } (Q, R, ) pref( B b0 (Q, R, ) A a1 pref( B b1 )

{ induction hypothesis } A a0 su( B b0 ) 58 su( B b1 )

Precomputation algorithm structure

pref(C ) A c , C R; R is a reachable set }

{ theorem 6.5; A c su(C ) A c

And symmetrically, with A c

su(C ) pref(P ) = P [row(A), col(A)]

{ prop. } A b su(B ) A c su(C ) A b pref(P ) A c pref(P )

= P [row(A), col(A)] { property pref } A b su(B ) A c su(C ) A c b pref(P )

{ row(B ) + 1 = row(C ), col(B ) = col(C ) + 1 } 65

{ P [row(B ), col(C )] = ; property su } A c su(P [0 .. row(C ))[0 .. col(B ))) pref(P )

{ property } A c su(P [0 .. row(C ))[0 .. col(B ))) pref(P )

P [0 .. row(C ))[0 .. col(B )))

D = Ds, P 2; def. row, col }

OLc ([ ], As) OLc (Es, [ ]) OLc (Es E, A As)

true true row(A) < row(E) col(E) < col(A) + 1

A q(C c ESm1 ,m2

{ denition q } (su(C ) pref(P )) q(Cs ) Cs ) }

{ property su, invariant P 2: OL(C row( A c A c ) row(C )

{ row(C ) < row(B ) + 1 } row( ) < row(B ) + 1

{ denition row } row( A A b b ) < row(B ) =B { property of functions }

Bs )) Bs ) row(B ) < row(R ) col(R ) < col(B )

(q(Rs + +Bs ), q(C = { see below } q(Ds

D ) (q(Rs + +Bs ), q(C

written as A B or [A B]. The other special case is