1 00:00:00,000 --> 00:00:04,505 Hello, every one! 2 00:00:04,505 --> 00:00:11,820 This lecture of data structure and algorithm is about KMP algorithm in the fourth chapter. 3 00:00:11,820 --> 00:00:19,880 In the KMP algorithm, the main point is to reduce backtracking as possible as we can. 4 00:00:19,880 --> 00:00:27,660 KMP is, in fact, an algorithm without backtracking. 5 00:00:27,660 --> 00:00:34,220 Then which character pk in P should we use to continue to compare with ti when there is mismatch? 6 00:00:34,220 --> 00:00:39,020 Knuth et al 7 00:00:39,020 --> 00:00:43,520 found the value k only depending on the pattern itself 8 00:00:43,520 --> 00:00:49,470 but not related to the target. Let us discuss the process. 9 00:00:49,470 --> 00:00:55,590 For this target string and pattern string, 10 00:00:55,590 --> 00:00:59,400 the result of their comparison are all equal at first, 11 00:00:59,400 --> 00:01:05,720 but ti is not equal to pj. 12 00:01:05,720 --> 00:01:11,410 If it is simple matching, it goes back to 13 00:01:11,410 --> 00:01:17,390 ti-j+1 and continue to compare with p0. This is the matching process. 14 00:01:17,390 --> 00:01:22,580 If we know ahead of time that 15 00:01:22,580 --> 00:01:27,620 this string is not equal to this one. In fact, 16 00:01:27,620 --> 00:01:33,170 they are real substring of the head and tail. That is, when we find they are not equal, 17 00:01:33,170 --> 00:01:40,870 we will cover it. Then we will focus on the substring of the pattern P. 18 00:01:40,870 --> 00:01:48,740 If they are not equal, there is no point for the next comparison, as we will soon find they are not equal. 19 00:01:48,740 --> 00:01:54,620 Then let us go on. Here is this string. 20 00:01:54,620 --> 00:01:59,890 The length of the string shortens by one. 21 00:01:59,890 --> 00:02:05,410 But in fact we will slide to the right for two. 22 00:02:05,410 --> 00:02:11,750 If they are still not equal, there is still no need to compare. 23 00:02:11,750 --> 00:02:16,520 Until the k, 24 00:02:16,520 --> 00:02:21,060 which means the length of the substring of its head and tail, 25 00:02:21,060 --> 00:02:29,450 so that the substring of its head and tail are all equal for k characters correspondingly 26 00:02:29,450 --> 00:02:34,440 In this case, 27 00:02:34,440 --> 00:02:39,940 as they are equal for k characters and 28 00:02:39,940 --> 00:02:44,950 the substring of p's tail has 29 00:02:44,950 --> 00:02:49,070 been compared with target string just now, we then know 30 00:02:49,070 --> 00:02:55,850 the p0 and the ti-k as well as pk-1 and the ti-1 31 00:02:55,850 --> 00:03:01,210 are also equal correspondingly. As a result, we only need to 32 00:03:01,210 --> 00:03:05,660 compare with pk and ti. 33 00:03:05,660 --> 00:03:11,310 The whole pattern can slide to the right for j-k. 34 00:03:11,310 --> 00:03:16,630 Then let us focus on the feature vector of the string. 35 00:03:16,630 --> 00:03:21,230 In fact, it is a set of integers. 36 00:03:21,230 --> 00:03:27,230 It is also called next array by some documents. 37 00:03:27,230 --> 00:03:33,380 Let us look at the construction method. In fact, 38 00:03:33,380 --> 00:03:38,335 it is to find out the longest one of 39 00:03:38,335 --> 00:03:44,520 the substring of its head and tail before it. 40 00:03:44,520 --> 00:03:52,180 In other special case, such as j=0, we stipulate it is equal to -1 just for convenience. 41 00:03:52,180 --> 00:03:57,970 If we can not find out substrings like this, the k is 0. 42 00:03:57,970 --> 00:04:05,410 Why should we find out the longest one? 43 00:04:05,410 --> 00:04:11,925 Here is its characteristic value. The value at N=4 44 00:04:11,925 --> 00:04:18,440 is originally equal to 3 45 00:04:18,440 --> 00:04:20,540 according to the substring of its head and tail. 46 00:04:20,540 --> 00:04:26,930 Now let it be 1. It is still right, as there are 47 00:04:26,930 --> 00:04:31,060 matching substring for the length of 1. 48 00:04:31,060 --> 00:04:35,750 However, if it is not the maximum value 3, 49 00:04:35,750 --> 00:04:40,530 we will miss some match in some cases. 50 00:04:40,530 --> 00:04:49,130 For example when we match this one and there is no match at P4 51 00:04:49,130 --> 00:04:53,430 we will see, 52 00:04:53,430 --> 00:05:00,050 because we give it as 1 but not the maximum one, 53 00:05:00,050 --> 00:05:05,140 we will compare with P1 54 00:05:05,140 --> 00:05:09,530 and the subscript 7 in the target string. 55 00:05:09,530 --> 00:05:15,150 As a result, it is a very long slide, which is j minus k, 56 00:05:15,150 --> 00:05:22,830 so it will slide for 3 when j=4 and k=1. 57 00:05:22,830 --> 00:05:29,430 In fact, it should slide for 1, which is 4 minus 3, so we miss a probable match. 58 00:05:29,430 --> 00:05:34,205 Let us look at the process of the KMP pattern match. 59 00:05:34,205 --> 00:05:38,980 We find out the feature vector for this pattern. 60 00:05:38,980 --> 00:05:43,560 When there is mismatch at a certain position, 61 00:05:43,560 --> 00:05:48,540 we will look up the feature vector directly 62 00:05:48,540 --> 00:05:53,300 and find out the k value. 63 00:05:53,300 --> 00:05:58,060 This Pk will compare with Ti directly. 64 00:05:58,060 --> 00:06:05,570 The match process is completed. It is very fast. Let us look at the KMP pattern match algorithm. 65 00:06:05,570 --> 00:06:15,570 We assume it has provided with a feature vector and start at the 0 position of pattern and a specific start position of target. 66 00:06:21,790 --> 00:06:29,940 The ij will move forward. 67 00:06:29,940 --> 00:06:36,260 If the corresponding characters are equal, 68 00:06:36,260 --> 00:06:40,340 i++, j++ we will compare for the next one. 69 00:06:40,340 --> 00:06:44,775 When at a position 70 00:06:44,775 --> 00:06:49,210 j==-1, in fact it is when next is 0. 71 00:06:49,210 --> 00:06:53,880 So this one is not equal. 72 00:06:53,880 --> 00:06:59,420 Then j+1 and this is equal to 0 73 00:06:59,420 --> 00:07:07,980 then i+1. That is move forward for another one to continue. 74 00:07:07,980 --> 00:07:14,360 There is also no backtracking. When there is mismatch, then the j is the next j in feature vector, 75 00:07:14,360 --> 00:07:19,520 which is k. 76 00:07:19,520 --> 00:07:25,770 The k value is less then j value. 77 00:07:25,770 --> 00:07:30,800 The P, j and T, i continue to compare. 78 00:07:30,800 --> 00:07:35,775 When this circulation is completed. If the j 79 00:07:35,775 --> 00:07:40,750 is bigger than or equal to the length of the pattern, the match succeeds, 80 00:07:40,750 --> 00:07:48,450 or the match fails. The algorithm to find out the feature vector 81 00:07:48,450 --> 00:07:53,990 is a process of recursion. For n0, we stipulate 82 00:07:53,990 --> 00:07:57,550 it is -1. We assume 83 00:07:57,550 --> 00:08:02,615 we have worked out some positions, which are before j. 84 00:08:02,615 --> 00:08:07,680 Assume we are nj, then it is equal to k. 85 00:08:07,680 --> 00:08:12,520 We want to work out n j+1, that is, 86 00:08:12,520 --> 00:08:17,060 this pj is not equal to pk. 87 00:08:17,060 --> 00:08:24,470 Then let k equal to nk. When there is mismatch 88 00:08:24,470 --> 00:08:31,440 we will continue to compare using the feature before it. 89 00:08:31,440 --> 00:08:38,410 So it is also a process of KMP. So it works out a child feature vector 90 00:08:38,410 --> 00:08:45,010 and use it to do the match. 91 00:08:45,010 --> 00:08:54,600 until the condition um is not satisfied which means pj equal to pk. 92 00:08:54,600 --> 00:09:01,040 At this point, n j+1 is equal to k+1. For this feature vector, 93 00:09:01,040 --> 00:09:07,470 the length plus one for the head and tail. 94 00:09:07,470 --> 00:09:14,810 For the feature vector, we should deal with the special cases at first. 95 00:09:14,810 --> 00:09:22,590 For j=1, we can work it out using this circulation. 96 00:09:22,590 --> 00:09:27,140 For j=1, the feature vector value is equal to 0, 97 00:09:27,140 --> 00:09:31,680 because there is no real substring. 98 00:09:31,680 --> 00:09:39,210 For this value got by it, 99 00:09:39,210 --> 00:09:47,430 the k value is consistent. If the P[j] 100 00:09:47,430 --> 00:09:53,490 is not equal to P[k], we will use the k value 101 00:09:53,490 --> 00:09:57,160 we have got before 102 00:09:57,160 --> 00:10:04,100 to cover this k, k=next[k]. This must be smaller, because 103 00:10:04,100 --> 00:10:08,920 k is smaller than next[k] and j is also smaller than next[j]. 104 00:10:08,920 --> 00:10:14,810 If they are equal, when j++, k++, 105 00:10:14,810 --> 00:10:21,030 we will continue to compare. The length of the matching string plus one. 106 00:10:21,030 --> 00:10:25,205 We let the value of next[j] equal to k. 107 00:10:25,205 --> 00:10:29,380 In fact, it is the k plus one. 108 00:10:29,380 --> 00:10:33,370 This is a demonstration to work out 109 00:10:33,370 --> 00:10:39,740 the feature vector. In this process, it always covers itself 110 00:10:39,740 --> 00:10:46,110 and looks at the head and tail of the substring. 111 00:10:46,110 --> 00:10:53,940 For this feature value, it changes gradually and plus one at most. 112 00:10:53,940 --> 00:10:58,460 If it drops down, it can drop down to 0 113 00:10:58,460 --> 00:11:02,980 or to a value smaller than the 3 and the k value. 114 00:11:02,980 --> 00:11:08,340 We look at the process to match the head and tail. 115 00:11:08,340 --> 00:11:14,870 Is there still room for opitimization? 116 00:11:14,870 --> 00:11:20,590 Let us have a look when Pj==Pk. 117 00:11:20,590 --> 00:11:24,260 Because Pj==Pk and Pj is not equal to ti just now. 118 00:11:24,260 --> 00:11:30,415 So as you can see, when you compare Pk with ti next time, 119 00:11:30,415 --> 00:11:36,570 it is certain they are not equal. So this can be omitted. 120 00:11:36,570 --> 00:11:44,280 For the case when they are not equal, we can record it in the feature array. 121 00:11:44,280 --> 00:11:51,290 Let us look at the example. 122 00:11:51,290 --> 00:11:56,170 The feature vector has been worked out. When there is mismatch at the P3, 123 00:11:56,170 --> 00:12:03,295 according to the feature vector, 124 00:12:03,295 --> 00:12:07,690 we use the K=0, that is the P[0]. 125 00:12:07,690 --> 00:12:12,810 The P[0] is also a and equal to P[3]. 126 00:12:12,810 --> 00:12:20,180 Obviously, those are not equal just now and these are still not equal. So this line 127 00:12:20,180 --> 00:12:27,930 marked red is redundancy. We can optimize at this point. 128 00:12:27,930 --> 00:12:34,060 If it is um, P[k]==P[j], we let the value of the next equal to next[k] directly. 129 00:12:34,060 --> 00:12:40,190 Because j is bigger than k, 130 00:12:40,190 --> 00:12:44,840 we are sure to get a smaller k value. 131 00:12:44,840 --> 00:12:52,640 Then we have seen the difference of the non-optimized version and the optimized version. 132 00:12:52,640 --> 00:12:58,565 In fact, we can work out the k like this, 133 00:12:58,565 --> 00:13:04,490 the feature position of which is like this,namely the longest k. 134 00:13:04,490 --> 00:13:11,990 The optimizing is to find out whether Pk==Pj. 135 00:13:11,990 --> 00:13:18,090 If so, let next[j]=next[k]. 136 00:13:18,090 --> 00:13:24,190 On the whole, some positions of the optimized next array 137 00:13:24,190 --> 00:13:30,760 are smaller than those of the non-optimized one. 138 00:13:30,760 --> 00:13:35,500 As for the time efficiency of the KMP algorithm, 139 00:13:35,500 --> 00:13:41,290 it looks like a double circulation. 140 00:13:41,290 --> 00:13:46,690 In the circulation, the most important sentence is j=Next[j], 141 00:13:46,690 --> 00:13:52,090 which can not be executed more than n. 142 00:13:52,090 --> 00:13:57,640 Because j minus 1 when this is executed. 143 00:13:57,640 --> 00:14:04,020 For the operation when j increases, there is only the j++. 144 00:14:04,020 --> 00:14:11,180 If j=N[j] is executed more than n, the value of j will be much smaller 145 00:14:11,180 --> 00:14:18,590 than -1. 146 00:14:18,590 --> 00:14:23,540 In j becomes -1 only by accident and will soon be back to 0. 147 00:14:23,540 --> 00:14:30,615 This is a dealing skill of the algorithm, so the process 148 00:14:30,615 --> 00:14:35,470 is going up to O(n) and the KMP algorithm is to O(n). 149 00:14:35,470 --> 00:14:43,280 The preprocess for the feature vector is also KMP. 150 00:14:43,280 --> 00:14:49,250 So the time is to m. 151 00:14:49,250 --> 00:14:55,085 So the whole time is n+m. Let us conclude about the efficiency. 152 00:14:55,085 --> 00:15:03,775 The simple algorithm does not need preprocess. 153 00:15:03,775 --> 00:15:08,150 But the time for matching is n*m. 154 00:15:08,150 --> 00:15:14,400 For the KMP algorithm, the preprocess is to m 155 00:15:14,400 --> 00:15:18,897 and the whole matching is to n. 156 00:15:18,897 --> 00:15:25,070 Because it is the minimum. So it can be represented by theta in fact. 157 00:15:25,070 --> 00:15:31,260 You can study other pattern matching algorithm if you are interested. 158 00:15:31,260 --> 00:15:36,330 At last, leave this for you to think of. 159 00:15:36,330 --> 00:15:44,110 There is different stipulation as well as different methods for calculating the feature vector. 160 00:15:44,110 --> 00:15:49,500 The various methods fit various application. 161 00:15:49,500 --> 00:15:54,090 We can compare and realize them. 162 00:15:54,090 --> 00:16:00,190 Thank you!