1
00:00:00,000 --> 00:00:04,505
Hello, every one!

2
00:00:04,505 --> 00:00:11,820
This lecture of data structure and algorithm is about KMP algorithm in the fourth chapter. 

3
00:00:11,820 --> 00:00:19,880
In the KMP algorithm, the main point is to reduce backtracking as possible as we can.

4
00:00:19,880 --> 00:00:27,660
KMP is, in fact, an algorithm without backtracking.

5
00:00:27,660 --> 00:00:34,220
Then which character pk in P should we use to continue to compare with ti when there is mismatch?

6
00:00:34,220 --> 00:00:39,020
Knuth et al

7
00:00:39,020 --> 00:00:43,520
found the value k only depending on the pattern itself

8
00:00:43,520 --> 00:00:49,470
but not related to the target. Let us discuss the process.

9
00:00:49,470 --> 00:00:55,590
For this target string and pattern string,

10
00:00:55,590 --> 00:00:59,400
the result of their comparison are all equal at first,

11
00:00:59,400 --> 00:01:05,720
but ti is not equal to pj.

12
00:01:05,720 --> 00:01:11,410
If it is simple matching, it goes back to

13
00:01:11,410 --> 00:01:17,390
ti-j+1 and continue to compare with p0. This is the matching process.

14
00:01:17,390 --> 00:01:22,580
If we know ahead of time that

15
00:01:22,580 --> 00:01:27,620
this string is not equal to this one. In fact,


16
00:01:27,620 --> 00:01:33,170
they are real substring of the head and tail. That is, when we find they are not equal,

17
00:01:33,170 --> 00:01:40,870
we will cover it. Then we will focus on the substring of the pattern P.

18
00:01:40,870 --> 00:01:48,740
If they are not equal, there is no point for the next comparison, as we will soon find they are not equal.

19
00:01:48,740 --> 00:01:54,620
Then let us go on. Here is this string.

20
00:01:54,620 --> 00:01:59,890
The length of the string shortens by one.

21
00:01:59,890 --> 00:02:05,410
But in fact we will slide to the right for two.

22
00:02:05,410 --> 00:02:11,750
If they are still not equal, there is still no need to compare.

23
00:02:11,750 --> 00:02:16,520
Until the k,

24
00:02:16,520 --> 00:02:21,060
which means the length of the substring of its head and tail,

25
00:02:21,060 --> 00:02:29,450
so that the substring of its head and tail are all equal for k characters correspondingly

26
00:02:29,450 --> 00:02:34,440
In this case,

27
00:02:34,440 --> 00:02:39,940
as they are equal for k characters and

28
00:02:39,940 --> 00:02:44,950
the substring of p's tail has

29
00:02:44,950 --> 00:02:49,070
been compared with target string just now, we then know

30
00:02:49,070 --> 00:02:55,850
the p0 and the ti-k as well as pk-1 and the ti-1

31
00:02:55,850 --> 00:03:01,210
are also equal correspondingly. As a result, we only need to

32
00:03:01,210 --> 00:03:05,660
compare with pk and ti.

33
00:03:05,660 --> 00:03:11,310
The whole pattern can slide to the right for j-k.

34
00:03:11,310 --> 00:03:16,630
Then let us focus on the feature vector of the string.

35
00:03:16,630 --> 00:03:21,230
In fact, it is a set of integers.

36
00:03:21,230 --> 00:03:27,230
It is also called next array by some documents.

37
00:03:27,230 --> 00:03:33,380
Let us look at the construction method. In fact, 

38
00:03:33,380 --> 00:03:38,335
it is to find out the longest one of

39
00:03:38,335 --> 00:03:44,520
the substring of its head and tail before it.

40
00:03:44,520 --> 00:03:52,180
In other special case, such as j=0, we stipulate it is equal to -1 just for convenience.

41
00:03:52,180 --> 00:03:57,970
If we can not find out substrings like this, the k is 0.

42
00:03:57,970 --> 00:04:05,410
Why should we find out the longest one? 

43
00:04:05,410 --> 00:04:11,925
Here is its characteristic value. The value at N=4

44
00:04:11,925 --> 00:04:18,440
is originally equal to 3 

45
00:04:18,440 --> 00:04:20,540
according to the substring of its head and tail.

46
00:04:20,540 --> 00:04:26,930
Now let it be 1. It is still right, as there are

47
00:04:26,930 --> 00:04:31,060
matching substring for the length of 1.

48
00:04:31,060 --> 00:04:35,750
However, if it is not the maximum value 3,

49
00:04:35,750 --> 00:04:40,530
we will miss some match in some cases.

50
00:04:40,530 --> 00:04:49,130
For example when we match this one and there is no match at P4

51
00:04:49,130 --> 00:04:53,430
we will see,

52
00:04:53,430 --> 00:05:00,050
because we give it as 1 but not the maximum one,

53
00:05:00,050 --> 00:05:05,140
we will compare with P1

54
00:05:05,140 --> 00:05:09,530
and the subscript 7 in the target string.

55
00:05:09,530 --> 00:05:15,150
As a result, it is a very long slide, which is j minus k,

56
00:05:15,150 --> 00:05:22,830
so it will slide for 3 when j=4 and k=1.

57
00:05:22,830 --> 00:05:29,430
In fact, it should slide for 1, which is 4 minus 3, so we miss a probable match.

58
00:05:29,430 --> 00:05:34,205
Let us look at the process of the KMP pattern match.

59
00:05:34,205 --> 00:05:38,980
We find out the feature vector for this pattern.

60
00:05:38,980 --> 00:05:43,560
When there is mismatch at a certain position,

61
00:05:43,560 --> 00:05:48,540
we will look up the feature vector directly

62
00:05:48,540 --> 00:05:53,300
and find out the k value.

63
00:05:53,300 --> 00:05:58,060
This Pk will compare with Ti directly.

64
00:05:58,060 --> 00:06:05,570
The match process is completed. It is very fast. Let us look at the KMP pattern match algorithm.

65
00:06:05,570 --> 00:06:15,570
We assume it has provided with a feature vector and start at the 0 position of pattern and a specific start position of target.

66
00:06:21,790 --> 00:06:29,940
The ij will move forward.

67
00:06:29,940 --> 00:06:36,260
If the corresponding characters are equal,

68
00:06:36,260 --> 00:06:40,340
i++, j++ we will compare for the next one.

69
00:06:40,340 --> 00:06:44,775
When at a position

70
00:06:44,775 --> 00:06:49,210
j==-1, in fact it is when next is 0.

71
00:06:49,210 --> 00:06:53,880
So this one is not equal.

72
00:06:53,880 --> 00:06:59,420
Then j+1 and this is equal to 0

73
00:06:59,420 --> 00:07:07,980
then i+1. That is move forward for another one to continue.

74
00:07:07,980 --> 00:07:14,360
There is also no backtracking. When there is mismatch, then the j is the next j in feature vector,

75
00:07:14,360 --> 00:07:19,520
which is k.

76
00:07:19,520 --> 00:07:25,770
The k value is less then j value.

77
00:07:25,770 --> 00:07:30,800
The P, j and T, i continue to compare.

78
00:07:30,800 --> 00:07:35,775
When this circulation is completed. If the j

79
00:07:35,775 --> 00:07:40,750
is bigger than or equal to the length of the pattern, the match succeeds,

80
00:07:40,750 --> 00:07:48,450
or the match fails. The algorithm to find out the feature vector

81
00:07:48,450 --> 00:07:53,990
is a process of recursion. For n0, we stipulate

82
00:07:53,990 --> 00:07:57,550
it is -1. We assume

83
00:07:57,550 --> 00:08:02,615
we have worked out some positions, which are before j.

84
00:08:02,615 --> 00:08:07,680
Assume we are nj, then it is equal to k.

85
00:08:07,680 --> 00:08:12,520
We want to work out n j+1, that is,

86
00:08:12,520 --> 00:08:17,060
this pj is not equal to pk.

87
00:08:17,060 --> 00:08:24,470
Then let k equal to nk. When there is mismatch

88
00:08:24,470 --> 00:08:31,440
we will continue to compare using the feature before it.

89
00:08:31,440 --> 00:08:38,410
So it is also a process of KMP. So it works out a child feature vector

90
00:08:38,410 --> 00:08:45,010
and use it to do the match.

91
00:08:45,010 --> 00:08:54,600
until the condition ｕｍ is not satisfied which means pj equal to pk.

92
00:08:54,600 --> 00:09:01,040
At this point, n j+1 is equal to k+1. For this feature vector, 

93
00:09:01,040 --> 00:09:07,470
the length plus one for the head and tail.

94
00:09:07,470 --> 00:09:14,810
For the feature vector, we should deal with the special cases at first. 

95
00:09:14,810 --> 00:09:22,590
For j=1, we can work it out using this circulation.

96
00:09:22,590 --> 00:09:27,140
For j=1, the feature vector value is equal to 0, 

97
00:09:27,140 --> 00:09:31,680
because there is no real substring.

98
00:09:31,680 --> 00:09:39,210
For this value got by it,

99
00:09:39,210 --> 00:09:47,430
the k value is consistent. If the P[j]

100
00:09:47,430 --> 00:09:53,490
is not equal to P[k], we will use the k value

101
00:09:53,490 --> 00:09:57,160
we have got before 

102
00:09:57,160 --> 00:10:04,100
to cover this k, k=next[k]. This must be smaller, because

103
00:10:04,100 --> 00:10:08,920
k is smaller than next[k] and j is also smaller than next[j].

104
00:10:08,920 --> 00:10:14,810
If they are equal, when j++, k++,

105
00:10:14,810 --> 00:10:21,030
we will continue to compare. The length of the matching string plus one.

106
00:10:21,030 --> 00:10:25,205
We let the value of next[j] equal to k.

107
00:10:25,205 --> 00:10:29,380
In fact, it is the k plus one.

108
00:10:29,380 --> 00:10:33,370
This is a demonstration to work out

109
00:10:33,370 --> 00:10:39,740
the feature vector. In this process, it always covers itself

110
00:10:39,740 --> 00:10:46,110
and looks at the head and tail of the substring.

111
00:10:46,110 --> 00:10:53,940
For this feature value, it changes gradually and plus one at most.

112
00:10:53,940 --> 00:10:58,460
If it drops down, it can drop down to 0

113
00:10:58,460 --> 00:11:02,980
or to a value smaller than the 3 and the k value.

114
00:11:02,980 --> 00:11:08,340
We look at the process to match the head and tail.

115
00:11:08,340 --> 00:11:14,870
Is there still room for opitimization?

116
00:11:14,870 --> 00:11:20,590
Let us have a look when Pj==Pk.

117
00:11:20,590 --> 00:11:24,260
Because Pj==Pk and Pj is not equal to ti just now.

118
00:11:24,260 --> 00:11:30,415
So as you can see, when you compare Pk with ti next time,

119
00:11:30,415 --> 00:11:36,570
it is certain they are not equal. So this can be omitted.

120
00:11:36,570 --> 00:11:44,280
For the case when they are not equal, we can record it in the feature array.

121
00:11:44,280 --> 00:11:51,290
Let us look at the example.

122
00:11:51,290 --> 00:11:56,170
The feature vector has been worked out. When there is mismatch at the P3,

123
00:11:56,170 --> 00:12:03,295
according to the feature vector,

124
00:12:03,295 --> 00:12:07,690
we use the K=0, that is the P[0].

125
00:12:07,690 --> 00:12:12,810
The P[0] is also a and equal to P[3].

126
00:12:12,810 --> 00:12:20,180
Obviously, those are not equal just now and these are still not equal. So this line

127
00:12:20,180 --> 00:12:27,930
marked red is redundancy. We can optimize at this point.

128
00:12:27,930 --> 00:12:34,060
If it is um, P[k]==P[j], we let the value of the next equal to next[k] directly.

129
00:12:34,060 --> 00:12:40,190
Because j is bigger than k, 

130
00:12:40,190 --> 00:12:44,840
we are sure to get a smaller k value.

131
00:12:44,840 --> 00:12:52,640
Then we have seen the difference of the non-optimized version and the optimized version.

132
00:12:52,640 --> 00:12:58,565
In fact, we can work out the k like this,

133
00:12:58,565 --> 00:13:04,490
the feature position of which is like this,namely the longest k.

134
00:13:04,490 --> 00:13:11,990
The optimizing is to find out whether Pk==Pj.

135
00:13:11,990 --> 00:13:18,090
If so, let next[j]=next[k].

136
00:13:18,090 --> 00:13:24,190
On the whole, some positions of the optimized next array

137
00:13:24,190 --> 00:13:30,760
are smaller than those of the non-optimized one.

138
00:13:30,760 --> 00:13:35,500
As for the time efficiency of the KMP algorithm,

139
00:13:35,500 --> 00:13:41,290
it looks like a double circulation.

140
00:13:41,290 --> 00:13:46,690
In the circulation, the most important sentence is j=Next[j],

141
00:13:46,690 --> 00:13:52,090
which can not be executed more than n.

142
00:13:52,090 --> 00:13:57,640
Because j minus 1 when this is executed.

143
00:13:57,640 --> 00:14:04,020
For the operation when j increases, there is only the j++.

144
00:14:04,020 --> 00:14:11,180
If j=N[j] is executed more than n, the value of j will be much smaller

145
00:14:11,180 --> 00:14:18,590
than -1.

146
00:14:18,590 --> 00:14:23,540
In j becomes -1 only by accident and will soon be back to 0.

147
00:14:23,540 --> 00:14:30,615
This is a dealing skill of the algorithm, so the process

148
00:14:30,615 --> 00:14:35,470
is going up to O(n) and the KMP algorithm is to O(n).

149
00:14:35,470 --> 00:14:43,280
The preprocess for the feature vector is also KMP.

150
00:14:43,280 --> 00:14:49,250
So the time is to m.

151
00:14:49,250 --> 00:14:55,085
So the whole time is n+m. Let us conclude about the efficiency.

152
00:14:55,085 --> 00:15:03,775
The simple algorithm does not need preprocess.

153
00:15:03,775 --> 00:15:08,150
But the time for matching is n*m.

154
00:15:08,150 --> 00:15:14,400
For the KMP algorithm, the preprocess is to m 

155
00:15:14,400 --> 00:15:18,897
and the whole matching is to n.

156
00:15:18,897 --> 00:15:25,070
Because it is the minimum. So it can be represented by theta in fact.

157
00:15:25,070 --> 00:15:31,260
You can study other pattern matching algorithm if you are interested.

158
00:15:31,260 --> 00:15:36,330
At last, leave this for you to think of.

159
00:15:36,330 --> 00:15:44,110
There is different stipulation as well as different methods for calculating the feature vector.

160
00:15:44,110 --> 00:15:49,500
The various methods fit various application.

161
00:15:49,500 --> 00:15:54,090
We can compare and realize them.

162
00:15:54,090 --> 00:16:00,190
Thank you!