2014
03-23

# String-Matching Automata

The finite state automaton (FSA) is an important model of behavioral investigations in computer science, linguistics and many other areas. A FSA can be typically modeled as a string pattern recognizer described by a quintuple <Σ, S, s0, δ, F>, where:

Σ is the input alphabet (a finite nonempty set of symbols).
S is a finite nonempty set of states.
s0 is an element in S designated as the initial state.
δ is a function δ: S × Σ → S known as the transition function.
F is a (possibly empty) subset of S whose elements are designated as the final states.

An FSA with the above description operates as follows:

At the beginning, the automaton starts in the initial state s0.
The automaton continuously reads symbols from its input, one symbol at a time, and transits between states according to the transition function δ. To be specific, let s be the current state and w the symbol just read, the automaton moves to the state given by δ(s, w).
When the automaton reaches the end of the input, if the current state belongs to F, the string consisting sequentially of the symbols read by the automaton is declared accepted, otherwise it is declared rejected.

Just as the name implies, a string-matching automaton is a FSA that is used for string matching and is very efficient: they examine each character exactly once, taking constant time per text character. The matching time used (after the automaton is built) is therefore Θ(n). However, the time to build the automaton can be large.

Precisely, there is a string-matching automaton for every pattern P that you search for in a given text string T. For a given pattern of length m, the corresponding automaton has (m + 1) states {q0, q1, …, qm}: q0 is the start state, qm is the only final state, and for each i in {0, 1, …, m}, if the automaton reaches state qi, it means the length of the longest prefix of P that is also a suffix of the input string is i. When we reaches state qm, it means P is a suffix of the currently input string, which suggest we find an occurrence of P.

The following graph shows a string-matching automaton for the pattern “ababaca”, and illustrates how the automaton works given an input string “abababacaba”.

Apparently, the matching process using string-matching automata is quite simple (also efficient). However, building the automaton efficiently seems to be tough, and that’s your task in this problem.

Several lines, each line has one pattern consist of only lowercase alphabetic characters. The length of the longest pattern is 10000. The input ends with a separate line of ‘0’.

Several lines, each line has one pattern consist of only lowercase alphabetic characters. The length of the longest pattern is 10000. The input ends with a separate line of ‘0’.

ababaca
0

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 4 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

#include<cstdio>
#include<cstring>
#include<algorithm>
#include<iostream>
#include<cmath>
#include<queue>
#define inf 111111111
using namespace std;
const int maxn=10005;
struct node
{
int id,cnt;
int next[26],fail;
void init()
{
cnt=fail=0;
memset(next,0,sizeof(next));
}
}trie[maxn*3];
char str[maxn];
int n,m;
int cnt;
void insert(char *str)
{
int p=0,x;
for(int i=0;str[i];i++)
{
x=str[i]-'a';
if(!trie[p].next[x])
{
trie[++cnt].init();
trie[p].next[x]=cnt;
}
p=trie[p].next[x];
}
trie[p].cnt++;
}
void build_ac()
{
queue<int> q;
int p,tmp,child;
q.push(0);
while(!q.empty())
{
p=q.front();
q.pop();
for(int i=0;i<26;i++)
{
child=trie[p].next[i];
if(child)
{
q.push(child);
if(p)
{
tmp=trie[p].fail;
trie[child].fail=trie[tmp].next[i];
}
}
else
trie[p].next[i]=trie[trie[p].fail].next[i];
}
}
}
void solve()
{
for(int i=0;i<=cnt;i++)
{
printf("%d",i);
for(int j=0;j<26;j++)
{
printf(" %d",trie[i].next[j]);
}
printf("\n");
}

}
int main()
{
char tmp[maxn];
while(scanf("%s",tmp)&&tmp[0]!='0')
{
trie[0].init();
cnt=0;
insert(tmp);
build_ac();

solve();
}
return 0;
}

1. 代码是给出了，但是解析的也太不清晰了吧！如 13 abejkcfghid jkebfghicda
第一步拆分为 三部分 (bejk, cfghi, d) * C(13,3)，为什么要这样拆分，原则是什么？

2. 代码是给出了，但是解析的也太不清晰了吧！如 13 abejkcfghid jkebfghicda
第一步拆分为 三部分 (bejk, cfghi, d) * C(13,3)，为什么要这样拆分，原则是什么？

3. L（X [0 .. M-1]，Y [0 .. N-1]）= 1 + L（X [0 .. M-2]，Y [0 .. N-1]）这个地方也也有笔误
应改为L（X [0 .. M-1]，Y [0 .. N-1]）= 1 + L（X [0 .. M-2]，Y [0 .. N-2]）