首页 > ACM题库 > HDU-杭电 > HDU 4644-BWT-KMP-[解题报告]HOJ
2015
09-17

HDU 4644-BWT-KMP-[解题报告]HOJ

BWT

问题描述 :

When the problem to match S string in T string is mentioned, people always put KMP, Aho-Corasick and Suffixarray forward. But Mr Liu tells Canoe that there is an algorithm called Burrows�CWheeler Transform(BWT) which is quite amazing and high-efficiency to solve the problem.
But how does BWT work to solve the matching S-in-T problem? Mr Liu tells Canoe the firstly three steps of it.
Firstly, we append the ‘$’ to the end of T and for convenience, we still call the new string T. And then for every suffix of T string which starts from i, we append the prefix of T string which ends at (i �C 1) to its end. Secondly, we sort these new strings by the dictionary order. And we call the matrix formed by these sorted strings Burrows Wheeler Matrix. Thirdly, we pick characters of the last column to get a new string. And we call the string of the last column BWT(T). You can get more information from the example below.

GSM

Then Mr Liu tells Canoe that we only need to save the BWT(T) to solve the matching problem. But how and can it? Mr Liu smiles and says yes. We can find whether S strings like “aac” are substring of T string like “acaacg” or not only knowing the BWT(T)! What an amazing algorithm BWT is! But Canoe is puzzled by the tricky method of matching S strings in T string. Would you please help Canoe to find the method of it? Given BWT(T) and S string, can you help Canoe to figure out whether S string is a substring of string T or not?

输入:

There are multiple test cases.
First Line: the BWT(T) string (1 <= length(BWT(T)) <= 100086).
Second Line: an integer n ( 1 <=n <= 10086) which is the number of S strings.
Then n lines comes.
There is a S string (n * length(S) will less than 2000000, and all characters of S are lowercase ) in every line.

输出:

There are multiple test cases.
First Line: the BWT(T) string (1 <= length(BWT(T)) <= 100086).
Second Line: an integer n ( 1 <=n <= 10086) which is the number of S strings.
Then n lines comes.
There is a S string (n * length(S) will less than 2000000, and all characters of S are lowercase ) in every line.

样例输入:

gc$aaac
2
aac
gc

样例输出:

YES   
NO
Hint
A naive method will not be accepted.

看完题目你很容易想到,这个题目的关键点就是如何把给出的数组还原成原数组。

还原的原数组之后不管是AC自动机 还是 kmp都可以解决 – -虽然我觉得kmp会超时的感觉。

那么如何还原这个字符串就是在个题目的难点。。。

gc$aaac

1234567

排序之后变成了

$aaaccg

 3456271

然后你按照排序后的下标依次走过去

会发现

$->a->c->a->a->c->g 

  3     5   2   4    6    7

也就恢复了原串。

#include <cstdio>
#include <iostream>
#include <cstring>
#include <algorithm>
#define maxn 100186
using namespace std;

struct node
{
    char ch;
    int index;
    bool operator < (const node & cmp)const
    {
        return ch<cmp.ch;
    }
}save[maxn];
char t[maxn];
char str[maxn],txt[maxn];
int next[maxn];

void getnext(int len)
{
    next[0]=0;next[1]=0;
    for(int i=1;i<len;i++)
    {
        int j=next[i];
        while(j && txt[j]!=txt[i])j=next[j];
        next[i+1]=txt[j]==txt[i]?j+1:0;
    }
}

void find(int n,int m)
{
    int j=0;
    for(int i=0;i<n;i++)
    {
        while(j&&txt[j]!=str[i])j=next[j];
        if(txt[j]==str[i])j++;
        if(j==m){printf("YES\n");return ;}
    }
    printf("NO\n");
}

int main()
{
    while(scanf("%s",t)!=EOF)
    {
        int len=strlen(t);
        for(int i=0;i<len;i++)
        {
            save[i].ch=t[i];
            save[i].index=i;
        }

        stable_sort(save,save+len);

        int now=save[0].index;
        for(int i=0;i<len-1;i++)
        {
            str[i]=save[now].ch;
            now=save[now].index;
        }

        int q;
        scanf("%d",&q);
        while(q--)
        {
            scanf("%s",txt);
            int m=strlen(txt);
            getnext(m);

            find(len-1,m);
        }
    }
    return 0;
}

版权声明:本文为博主原创文章,未经博主允许不得转载。

参考:http://blog.csdn.net/u010709592/article/details/37669323