1 00:00:04,681 --> 00:00:08,681 大家好。数据结构与算法 Hello, everyone, Data Structure and Algorithms. 2 00:00:08,681 --> 00:00:16,111 这一讲我们继续索引的学习。今天我们学习一种特别重要的索引结构,就是倒排索引。 Today we go on learning index. Today we will learn a very important structure of index—inverted index. 3 00:00:16,111 --> 00:00:27,298 倒排索引,它主要是针对我们经常要对一些非主码的这些信息进行检索的情况。 Inverted index mainly aims at retrieval of non-primary key information. 4 00:00:27,298 --> 00:00:38,046 对于这些非主码的信息,如果我们要进行检索,我们一般情况下,都是对主码进行了有效的一些索引。 Usually, when we retrieve these none primary key information, we index the primary key efficiently. 5 00:00:38,046 --> 00:00:46,531 如果我们对这些属性项要进行检索的话, 我们当然也要对这些属性进行索引。 So if we want to retrieve these attribute items, there’s no doubt that we should index these items 6 00:00:46,531 --> 00:00:51,951 对这个属性索引,我们其实有两大类。 In fact, there are two kinds of attribute indexes. 7 00:00:51,951 --> 00:01:06,176 一类是对这种数据库这样的结构化的信息文件,进行索引的这个情况。还有一类,就是我们常见的这个正本文本文件,我们可能也要进行相关的索引。 One is to index the structuralized information files, such as data base, the other is to index the usual main body text file. 8 00:01:06,176 --> 00:01:21,192 我们来看一下对这样的一个结构化的数据库文件,我们通常情况下可能对它的这个主码,也就是在这里是职工号进行了这样的索引。 Let’s see this structuralized data base file, usually, we index its primary key, the worker number here. 9 00:01:21,192 --> 00:01:36,117 如果我们要对一些不是主码的情况,比如说,对这个名字,我们要进行检索。这种情况,我们显然要对姓名,也进行一个相应的索引。 If we want to index the primary key information, for example, the name, we apparently need to index the names. 10 00:01:36,117 --> 00:01:45,992 但是这个索引,因为我们譬如说姓名它可能会重名,我们的这个索引,跟这个主码的索引可能就不一样。 But names of different people can be the same, thus the index may be different from the index of primary key. 11 00:01:45,992 --> 00:02:03,800 我们对属性的索引,往往是建立了一个属性 - 指针这样的对。而且这个指针,我们一般情况下,可能并不需要指向这个外存的这个主顺序文件里面个原始的记录的一个地址。 When we index the attributes, we usually create an attribute-pointer pair, and the pointers often don’t have to point the address of the original note in the main order file in the ROM. 12 00:02:03,800 --> 00:02:12,132 我们很可能就是对应到刚才的我们这样的一个主码的一个ID 就行。 We probably just need to correspond it to an ID of the primary key. 13 00:02:12,132 --> 00:02:26,568 也就是说,我们通过主码的这个 ID ,我再去找在主顺序文件里面它的相应的这个纪录的信息。这样的话,我们在维护的时候,也会比较方便一些。 That is to say, we search the correlative information of the note in the main order file through the ID of the primary key. This way, we make it easier to maintain the index. 14 00:02:26,568 --> 00:02:38,325 我们这样的一个按照属性来建立的这个索引表,你可以看到,是对这样的一个主顺序文件上建立了五花八门的各种各样的这个索引。 The index table we create based on attribute, as you can see, are various indexes created on the main order file. 15 00:02:38,325 --> 00:02:48,114 而且,它就是这个索引,颠覆了我们原来按照主关键码排序的这么一个性质。 Besides, this index subvert the nature—sorting according to primary key. 16 00:02:48,114 --> 00:02:51,756 因此,我们就把它叫做是倒排索引。 Therefore, we call it the inverted index. 17 00:02:51,756 --> 00:02:55,342 这个属性,我们很多情况下是离散型的。 And the attributes are usually discrete. 18 00:02:55,342 --> 00:03:08,263 例如,像这个姓名,还有像我们,比如说我们系别的这样的一个属性,计算机系电子微电子等等这样的。 For example, name and the attribute of our departments—computer science, electronics, micro-electronics and so on. 19 00:03:08,263 --> 00:03:19,820 如果是连续型的,也许我们这样的一个倒排的这个形式,可能不是很合适,需要用 B 树等等这样的结构。 If the attributes are continuous, the form of inverted index may not be appropriate, and we may need structures such as B tree. 20 00:03:19,820 --> 00:03:29,263 这一种倒排的这些信息,我们也是要存在文件里面的。这个文件,就称为倒排文件。 We need to store the information of the inverted index in files, and these files inversion files. 21 00:03:29,263 --> 00:03:33,958 我们来看一下,刚才这个教师的这个数据库表。 Let’s see the just data base table of teachers. 22 00:03:33,958 --> 00:03:41,200 我们可以根据各种检索的需求来建立这样的倒排表。 We can create inverted table according to various retrieval needs. 23 00:03:41,200 --> 00:03:47,284 ,这个是对这个姓名来建立的这么一个倒排。 And this is an inverted index created according to name. 24 00:03:47,284 --> 00:03:56,018 然后这边是对职称,然后还有就是这个教师它的章程,是从这个讲课的这个角度来看了。 Then here is one according to job title, and then the teacher constitution, it’s from the view of giving classes. 25 00:03:56,018 --> 00:04:12,380 我们可以看到每一个这个属性后面的这个倒排列表,它对应的其实是相应的这个职工,它的一个主码,也就是它的这个职工号这个主关键码。 As we can see, the inverted tables of every attribute are corresponding to the workers’ primary key—the worker number. 26 00:04:12,380 --> 00:04:24,068 倒排文件,它建立了以后,我们当然可以进行比较高效的检索。但是建立这个倒排文件本身,我们需要额外的一些空间代价。 When we create the inversion file, we can retrieve efficiently, but the inversion file itself will bring extra space cost. 27 00:04:24,068 --> 00:04:34,270 而且,当你这个数据库进行了修改的时候,我也需要对它进行相应的维护。所以也有额外的这种代价。 And when we modify the data base, we need to maintain inversion file accordingly, so it brings extra time cost, too. 28 00:04:34,270 --> 00:04:42,078 我们再来看,人们生活中经常接触到的这样的一个文本的倒排。 Then let’s see the inverted index of text file we have a lot of access of in our daily life. 29 00:04:42,078 --> 00:04:43,863 为什么说人们生活中经常接触到? Why dare I say we have a lot of access to it? 30 00:04:43,863 --> 00:04:51,702 就是我们平时跟搜索引擎打交道。我输入一个关键码,它给我返回一堆这个文档。 Because we have a lot of access to the search engines, when we input a key code, it returns many text files. 31 00:04:51,702 --> 00:05:06,480 其实,就是对这个关键码后面,它出现在哪些文档里面的这么一个倒排列表返回给我了。当然,是经过了一些排序处理,就把最相关的返回给我。 In fact, it returns a inverted table in which the key code exists. Of course, the results have been sorted, and it returns the most correlative. 32 00:05:06,480 --> 00:05:36,430 我们正文索引,或者是说就是正文的这个倒排,它其实就是以这个文章里面出现的这些词,就是word , 为一个索引项,然后,后面这个倒排列表是它在哪一个文章里面,出现在什么位置,或者是说,在哪一个文章里面出现了多少次等等。这些信息的一个索引结果。 The main body index, or the inversion of the main body, actually uses the words appear in the text as index items, and the inversion table is the index result of the information that the article it appears, the position it appears and the times it appears. 33 00:05:36,430 --> 00:05:45,812 我们在建立索引的时候,我们往往可以是以词为单位来建。就是建立这样的词索引。 When we create the index, we usually create it using words as units, that is to say, create a word index. 34 00:05:45,812 --> 00:05:52,040 这个词索引的话,我们往往会有一个有限的这么个词表。 To create a word index, we usually need a limited word table. 35 00:05:52,040 --> 00:06:13,430 现在这个计算机的处理速度越来越快,我们这个词表的限制就越来越少。像 Google 这样的搜索引擎,它连一些数值,这些比如说电话号码呀,或者是门牌号码,它都给你索引上,因此,它的词表你可以看作是非常非常大的。 And today processing speed of computers are faster and faster, so limits of word tables are less and less. Search engines like Google even index some figures, such as phone numbers and house numbers, thus the word table can be very huge. 36 00:06:13,430 --> 00:06:16,134 早期,还有一种叫做全文索引。 And there was a kind called full-text index in the early days. 37 00:06:16,134 --> 00:06:29,418 全文索引它意思就是说,我不受你这个词表限制,我在这个把整个这个文章看作是一个长字母串。我在每一个词的出现的轴都建立一个索引项。 What full-text index mean is that the index is not restricted by the word table, it regards the whole article as a long letter string, and creates index items where each word arrears. 38 00:06:29,418 --> 00:06:42,976 就这两类方法,但是其实现在我们也不怎么提全文索引了。可以说,你只要你的处理能力能够支持,我们这个词表不受什么限制,你都可以建立这样的一个大的这样一个索引的表。 There are those two ways, but we hardly talk about full-text index today. As long as the processing capacity is strong enough, the word table can never be restricted, and you can create an arbitrarily big table. 39 00:06:42,976 --> 00:06:52,937 我们可以看到这个词索引,它其实就是有一个候选的一个关键词的列表。 As we can see, the word index actually has an alternative key word table. 40 00:06:52,937 --> 00:07:20,714 然后,我们从这个文本当中,抽出对应到这个列表里面的这些词,然后我再在这个词的这样的一个索引项后面,我纪录这个词在某一个文档里面出现了什么样的信息,所有的这个文档,如果是有这个词的出现的话,它们的信息都挨个挨个给它列上,这就形成一个倒排表。 Then we pick out words corresponding to the table from the text, after the index item of the word, we record the information the word express in a certain article, so for all articles, if the word appears, we list the information one after another, and then we get an inverted table. 41 00:07:20,714 --> 00:07:31,432 对于英文的这样的一个情况,就是它的词的划分,就是通过这样的的空格给它隔开的,所以它是非常的明显的。 For English, words are parted by spaces, so it’s quite apparent. 42 00:07:31,432 --> 00:07:55,142 英文里面它还有一些相应的其他的操作。比如说,要取词干,比如说像computer, computing, 这种东西,当我们取出的共有词干是comp ,这样我们通过取词干的操作,能够把一些共性的东西给它连作起来,不要使得我们的词表太大,一些本来相关的东西被隔离开了。 There are other correlative operations, such as to extract the word stem, for example, for “computer”, ”computing”, the stem we extract is “comp”, and after the operation of extracting the stem, we can connect some general characters together, remember not to make the word table too huge to separate some correlative things. 43 00:07:55,142 --> 00:08:04,352 如果是对中文这样的东方文字,它可能还要经过这样的一个,就是切词的处理。 For eastern characters like Chinese, we need the operation of word segmentation. 44 00:08:04,352 --> 00:08:17,088 我要把一个连续的句子,这个句子里面,它没有任何的这种空格的分隔,我要知道怎么去断句,也就是哪个地方要断出一个词来,个地方要断出一个词。 In a continuous sentence there is no separation like spaces, we should know how to cut the sentence, that is to say, to pick out words from the sentences. 45 00:08:17,088 --> 00:08:32,522 ,全文索引,它是把这个把正文看作一个特别长的字符串,然后在每个的这个位置,我都做一个标记然后把它索引上。它需要的空间是更大的。 Full-text index regards the main body text as a long-long character string, then creates tabs everywhere and indexes them, which would need much more space. 46 00:08:32,522 --> 00:08:38,872 在早期,我们很多这种检索系统,它的这个词索引的词表非常有限。 In early days, when we build this kind of retrieval system, the word table of word index of which is quite restricted. 47 00:08:38,872 --> 00:08:45,353 ,这种情况,有一些词,可能没有收到它词表里面,也就没有索引,你也就检索不到。 Under this circumstance, some words may not be included in the word table, thus there is not correlative index, and they cannot be retrieved. 48 00:08:45,353 --> 00:08:55,565 后来,就是有很多检索系统号称是全能索引,也就是说,它索引得更全,每一个字符的可能位置都被它索引到。 After that, many retrieval systems claim to be full-text index, namely, they can index more roundly, even every possible position of every character. 49 00:08:55,565 --> 00:09:01,851 但其实现在,我们更主流的还是回到这个词索引。 But now, the mainstream is more likely to return word index. 50 00:09:01,851 --> 00:09:06,745 只是说这个词表是越来越大,哪一个词出现,就被它记录,然后它再被索引住。 However, the word table is bigger and bigger, every word, once appeared, will be recorded and indexed. 51 00:09:06,745 --> 00:09:14,290 我们,来看我们现在倒排文件,我们其实本质上可以说是词索引。 And we can say the inversion files are word indexes in essence. 52 00:09:14,290 --> 00:09:35,396 我们每一个关键词,它都对应到一个一些文档里面有这个关键词的出现,以及出现的是一些什么信息,比如说它的频率,位置,或者是权重等等这一些信息。我们这就一个关键词它拉出一个表,就是所谓倒排表。 Everyone key word is all corresponding to the appearance of itself, the correlative information, its frequency, its position and its weight. And then we can create a table according a key word, this is called the inversion table. 53 00:09:35,396 --> 00:09:41,287 然后所有的这个倒排表,再合起来就形成了这样的一个倒排文件。 And all inversion tables make up the inversion file. 54 00:09:41,287 --> 00:09:51,240 我们来看一个建倒排索引的一个例子。假设,这是一个英文的儿歌啦。 Let’s see an instance of creating inverted index. Assume there is an English nursery rhyme~~ Pease Porridge Hot,Pease "Please porridge cold, pease porridge 00:09:51,240 the pot, nine days old. Some like it hot, some like it cold, some like it in the pot, nine days old." pot, 00:10:48,923 --> 00:10:56,923 这个儿歌,我们根据它的这样的一行一行我们把切分成六个文档,这样我们是一个小例子,就假设这些词在六个文档里面分布着。然后我们怎么来建索引,我就是首先我得到这一个词表,这个词表你也可以是说我先是扫描一趟这个文档我就把所有的词都得到了,而且我给它排序,然后得到这么一个词表。也可以说如果你这个词表,你就是只需要检几个特殊的词,一个是受限的词表,我就只支持这里面的词也是可以的。 We can cut this nursery rhyme into 6 texts according to the line. Assume the words distribute in these 6 texts, then how to create the index? Firstly, we can create such a word table, and we can get all words by scanning the text, and should sort it, finally, we get the word table. Besides, if we merely need to retrieve several special words, it will be a restrained word table, and it is fine to only support these words. 00:11:18,032 --> 00:11:26,032 然后我在扫描这些相应的文档,每一个词在这个文档里面出现我就计到这样的一个相应的这个词的这么一个倒排的信息里面,然后我对这个文档各组词进行处理,就把这个文档的这个所有的词就给它解析到相应的这个词的倒排列表里面。 Secondly, we scan correlative texts, and we record every appeared word in an inversion information, then we process each group of words, this way, we can parse every word and put it to correlative inversion table. 00:11:35,574 --> 00:11:43,574 所谓倒排你可以看到我们现在这么处理以后把原来的这个在人类看来是一个自然的一个有语义的这么一串这样的信息,就被我颠覆的把它给它扔到这些词表里面去了。 Inversion, as you can see, after the procession mentioned above, throw the natural meaningful information in human’s eyes into those word tables subversively. 00:11:45,795 --> 00:11:50,862 然后我对这个文档二以及文档三也相应的进行处理。 Thirdly, we process text II, text III… accordingly. 00:11:55,460 --> 00:12:03,460 我们把整个这六个文档都处理完了以后,我们就得到了各个词的相应的这个倒排的列表。 After we have processed these 6 texts, we get correlative inversion tables of every words. 00:12:07,067 --> 00:12:14,734 然后这些倒排列表我组合起来它就是整个这个文档集的一个对应的倒排文件。 Then those inversion tables make up the inversion file of the whole text group. 00:12:16,665 --> 00:12:22,798 我们建倒排文件的这个过程我们可以总结一下它大概有这些步骤。 Now let’s sum up creating inversion files, there are 5 steps. 00:12:27,039 --> 00:12:35,039 首先就是说我们对这个文件进行分割。其实我刚才文件就被我们是按照一行一行给分割,你也可以有各种分割的方法,比如说按章节、按段落等等这样来分。 No.1, we should cut the file, for example, we cut the just file line after line. You can use various ways to cut it, for instance, you can cut according to chapter, paragraphs and so on. 00:12:38,097 --> 00:12:41,497 然后分到每一个这个文档我其实给这个文档计一个标号, No.2, number every text. 00:12:46,255 --> 00:12:51,255 然后我们在 对这个文档里面相应的这个词我再进行处理。 No.3, we process correlative word in each text. 00:13:19,706 --> 00:13:27,706 当然如果是对这个中英文的这个文档,我们都可以是去掉一些停用词,所谓停用词就是我不想要它出现在索引里面这些词,比如说英文里面的这种冠词、介词、连词,中文里面也是这种一些助词,还有一些连词,我觉得没有什么太多语义的我给它去掉。就是像英文的the、a、and,然后中文的的、得、地、一个、两个这种东西。 As for Chinese and English texts, we usually delete some stop words, which we don’t to appear in indexes, such as article, preposition and conjunction in English, and auxiliary word, conjunction and other words of nonsense in Chinese. For example, “the”, “a”, “and” in English and “的、得、地、一个、两个” in Chinese. 抽词干,就是对英文来说我们可以把它的这个词根给找出来,然后一些其它的这种时态的变化这些词去掉,比如说computer、computed、computing等等我就给它抽出来都是comp。这个经过抽词干的处理以后,我们使得这些在语义上有关联的词它都可以在并在一个词表里面。而且,就是我的这些词表里面的词,它也可以缩小我们什么地方都认为它们是一个词。 for extracting the word stem, for English we can find the word root, and then delete the words indicating tense, for example, we can extract “comp” from “computer”, “computed” and “computing”, after the operation of extracting word stem, we can connect words correlative in meaning together in a word table. And words in the word table can be the same everywhere. 66 00:14:00,126 --> 00:14:34,615 还有就是中文这样的东方语系,我们没有一个明显的切词的这样的一个过程,也就是没有空格可以把词给分割开来。这样的话,我们就需要用一些这个自然语言处理里面的切词的软件来进行处理。把这个文档切成一个个的这种词组成的序列。 extracting for eastern language systems such as Chinese, we don’t have an apparent operation to cut words, namely, there is no space to separate these words, this way, we need a nature language operating software to cut words, and we can change the text into some sequences made up of phrases. 67 00:14:34,391 --> 00:14:40,124 比如说我们看一下这样一个有趣的中文切词的例子。 eastern see an interesting example of cutting Chinese sentence. 68 00:14:43,584 --> 00:14:51,584 “我知道你不知道我知道你不知道我知道你不知道”。这个切词的话其实是非常难切的。 an know you don’t know I know you don’t know I know you don’t know.” It’s very difficult to cut this sentence. 69 00:14:56,692 --> 00:15:04,692 即使是我们自己人来切也是很困难的,你可以看到各种切法。这里这几种切法,我们可以看到其实它的这个语义还差不多都是在气对方的对吧。 you if who cuts it is us people, it’s still not easy, there are many different ways to cut. No matter which one we use, we can see this sentence is to annoy the listener. 70 00:15:05,014 --> 00:15:13,014 还有一些这个句子你可以看到如果是我不同的切法,可能切出来的语义完全不一样都是可能的。 who are other sentences which can express different meanings using different cutting ways. 71 00:15:26,543 --> 00:15:34,543 我们来看在正常的情况下,就是对这么一个普通的文档。我们用这么一个网上的一个软件,我们来切出的词可能就是可以看到是一些比较有意义的这种词。 other when we want to process a normal text, we can merely use a software from the internet to get relatively meaningful words. 72 00:15:40,674 --> 00:15:48,674 我们对这些词,我在建立倒排,我就将来在检索的时候可以有效的检索到这个文档。比如说我要检索“毕业生”,这里面它明显地切出了”毕业生“这个词。 we for these words, if we create inversions, we can retrieve the text efficiently when we do retrieval in the future. For example. If we want to retrieval the word “graduate”, apparently. It cuts the word “graduate”. 73 00:15:47,839 --> 00:15:55,839 它就可以索引到这个文档里面有“毕业生”这个词,你在检索的时候你就能够检到这个文件。 these it can index the word “graduate” in the text, and we can find this text when we retrieve. 74 00:16:16,988 --> 00:16:24,988 还有就是,我们得到这样的一个词表以及我对整个这个文档集都进行了这个词的划分以及去掉一些不太常用的这个停用词,还有我取出了这个主干的这样的词根以后我们就可以得到一个关键词的集合,但这个集合就会很大。 can we get such a word table, and we extract words and delete stop words in the whole text, and after we extract the word stems, we get a group of key words, but the group can be quite huge. 75 00:16:34,854 --> 00:16:42,854 如果是说你的处理能力有限,或者说你不需要检索一些没有关系的这些关键词,你也可以来一个限定的一个词表,就是我只索引也只检索我这些有限的这些词的信息。 get if the processing capacity is limited, or we needn’t retrieve some non-correlative key words, we can create a restricted word table, namely, we only index and retrieve the information of these restricted words. 76 00:16:44,680 --> 00:16:52,213 然后我得到这个词表以后我再处理整个文档集,就一个一个文档来处理。 the after we get the word table, we process the text groups one after another. 77 00:17:07,316 --> 00:17:15,316 如果这个文挡集里面有某一个词,我就找到这个词表的这个倒排的个表,我把这个文档里面出现这个词的一些相关信息,例如它的出现位置、它的频率等等我都记录进去,然后我就得到了每一个关键词它对应的一个倒排表。 we there is a word in a certain text, we find the inversion table of the word table, we recoding correlative information of the word, such as the position it appears and its frequency, then we get a correlative table of the key word. 78 00:17:15,028 --> 00:17:23,028 所有的这些关键词的倒排表我把它会合起来我就得到这个文档集的倒排文件。 is we get the inversion text of the whole text by combining all the inversion tables of the key words. 79 00:17:40,058 --> 00:17:48,058 然后我们对关键词进行检索的时候,我显然就是从这个倒排文件的这个词表里面我找到相应的它这么一个倒排列表,然后我找到了这个倒排列表以后,我就把这些相关的这个文档返回来。 get when we retrieve the key word, apparently, we find a correlative inversion list in the word table in the inversion file. After we have find the inversion list, we return the correlative texts. 80 00:17:57,370 --> 00:18:05,370 在搜索引擎里面它返回的时候它给了这个文档相应的一些这个快照或者是说它的一些摘要,你就可以通过这个来判断是不是相关,然后你再打开进去看看判断这个文档是不是你要的文档。 we the search engines, when it return, it show the correlative snapshot or abstracts, and you can judge whether they are correlative accordingly, then then you click and open it to see whether it’s what you need. 81 00:18:09,377 --> 00:18:17,377 我们还要注意就是,因为这个倒排文件它的样一个关键词表可能特别的大,如果你是循序地挨个挨个找你可能找的效率会比较低。 search we should pay attention to is that as the key word table of an inversion file can be quite huge, the efficiency can be quite low if you search circularly one after another. 82 00:18:16,602 --> 00:18:24,202 这种情况下,我们很可能是需要对这个关键词表再建立一个字典。 should this circumstance, we may need to create a dictionary for the key word table. 83 00:18:35,166 --> 00:18:43,166 这个建字典的方法你可以是散列或者是说trie等等这样的结构,我能够辅助人们快速地找到这样的相应的这个词,然后再根据这个词再找到它的一个倒排列表,然后把最相关的文件返回。 circumstance, way you can use to create a dictionary can be structures of hash table or trie. And we can help people find the correlative word quickly, and then find its inversion table according to this word, at last return the most correlative file. 84 00:18:43,715 --> 00:18:50,515 倒排文件它显然是在文本数据库系统里面非常重要的一个技术。 you file is apparently a very technology in text data base system. 85 00:18:52,826 --> 00:19:00,826 我们现代的这些信息检索,特别是跟文本相关的检索,例如搜寻引擎等等,都是非常依赖于倒排技术的。 is information retrieval, especially retrieval corresponding to text such as search engines, rely on inversion technology a lot. 86 00:19:11,513 --> 00:19:19,513 但是倒排文件如果是你没有组织好的话也会有一些问题,例如你这个检索的词表如果你的这个检索能力或者存贮能力或者是计算能力不够,如果它的词表有限,你有些东西你可能检不到。 retrieval, if you don’t organize the inversion files well, there could be questions, for example, if the search capacity or storage capacity or computing capacity is not enough, or the word table is restricted, something cannot be retrieved. 87 00:19:19,315 --> 00:19:23,648 还有就是,倒排文件它本身是非常庞大的。 you the inversion itself if quite huge. 88 00:19:33,815 --> 00:19:41,815 有可能是你这么一个文档集你建立倒排文件以后它整个倒排文件可能是原来文档集的好几倍。如果是组织得不是很好的话,这个倒排文件上头的可能也会是有一些困难的。 inversion after you create an inversion of a text group, the inversion file is times bigger than the original text group. If we don’t organize well, there could be difficulties above the inversion file. 89 00:19:41,674 --> 00:19:48,674 所以对倒排文件的这样的一个有效的组织和检索也是非常重要的。 you it’s important to organize and retrieve the inversion file efficiently. 90 00:19:50,566 --> 00:19:58,566 最后留给大家一些思考,就是我们怎么样更高效的组织我们的这样的属性的倒排记录。 important last, I have some questions for you to think about. First, how can we organize the inversion our records of the attributes. 91 00:20:20,662 --> 00:20:28,662 第二个思考就是,如果我一个关键词在同一个文本文件里面出现了很多次,我们在前面看的倒排列表里面,它可能是有好几个这样的记录项,我们是不是能够进行合并? 我合并的时候要注意保留一些信息,或者是我怎么样能够有一些有效的合并,能够支持上一集更好的运算? I if a key word appears many times in the same text file, there could be many such recording items in the inversion tables we saw just now, can we join them? And when I join them what information should I pay attention to reserve, to how could I join them efficiently to support better computing in the higher level? 92 00:20:28,567 --> 00:20:29,700 谢谢大家。 a you.