1
00:00:01,000 --> 00:00:06,700
大家好，数据结构与算法这一讲
Hello everyone! Welcome to this lecture.

2
00:00:07,000 --> 00:00:14,500
我们继续第十一章索引的学习。我们今天简单地介绍位索引技术，
Today, we continue the 11th chapter, index, and briefly introduce bit index .

3
00:00:14,800 --> 00:00:24,400
.在我们前面介绍过的B数B加树
它只适合于我们查找几个记录的这种情况，
B+ tree introduced in preceding lecture only fit the situation where we search for several items.

4
00:00:25,300 --> 00:00:32,500
如果这个数据库里面
有大量的这样的 重复关键码，而且我们一查要查一堆数据的情况，
When we search for lots of data in the database with plenty of the duplicated keys,

5
00:00:32,800 --> 00:00:37,300
用B树B加数做索引就不是太方便了。
It is not useful to use B or B+ tree to index data.

6
00:00:37,600 --> 00:00:49,600
而且如果我们要做一些分组，还有一些聚合等等操作，
我们应该寻找一个更高效的技术。
We should find a more efficient technic to meet the need for the operations such as grouping and aggregation.

7
00:00:49,900 --> 00:00:55,000
我们来看这样的一个销售的记录。
When we come to these records of retail,

8
00:00:55,300 --> 00:00:59,500
这个销售记录里面，我们有若干域，
With some domain,

9
00:00:59,800 --> 00:01:07,000
可以看到这里面这些信息域，它有相当多的重复关键码的这样的情况，
We know that there is a lot of duplicated key.

10
00:01:07,300 --> 00:01:18,700
但是，如果我们比如说经常要对这个州的
信息进行查询我们必须要对她建立一个有效的索引
We, however, need an efficient index for the retrieval of data labeled with a certain state.

11
00:01:20,300 --> 00:01:28,701
才能够高效的查询。这是我们在讲检索，索引的时候
就一直在强调的。我们经常要查的这些信息，一定要建索引
The data we often access need to be indexed, which we have emphasized.

12
00:01:30,700 --> 00:01:43,300
这个里面因为有大多重复的关键码，而且我知道这个州名
其实就没有几个，你建这个B树的话，你想想， 它其实没有什么太多的意义。
Due to the duplication of the keys, just several names of states, B tree indexing is not really meaningful.

13
00:01:43,600 --> 00:01:48,700
这个时候我们可以采用这个位图的这个技术。
We can adopt bit map to tackle this problem.

14
00:01:49,000 --> 00:02:00,563
比如说我对这个，纽约州建立这个位向量
在对应的这样一个记录上头。，如果是
For example, let us build bit vector for New York.

15
00:02:01,463 --> 00:02:08,963
它的取值是1，就表示这个记录是表示纽约州
相关的这个记录，如果是0，就不是。
Marking 1 for New York and 0 for not,

16
00:02:08,963 --> 00:02:21,863
我们就是对各个州 都建立一个这样位相量我们就
，建立了位图，其实就是位相量的这样一个集合。
We get the bit map for New York and, in the same way, the other states to build the set of bitmaps.

17
00:02:15,563 --> 00:02:22,563
我们对其他的这个
数据域我也可以建立一些相应的索引，比如说
We can also index data for other attributes, such as the grade of retail.

18
00:02:23,763 --> 00:02:40,863
我对于这样的销售的等级我也建立一个索引，
对于 A类的这样销售记录我可以是
For class A, we can do like this.

19
00:02:41,163 --> 00:02:54,063
，有这样一个位相量我在将来要查询的时候
我要，问 在纽约州，它的等级是A的
When a query for items in New York with class A, comes

20
00:02:54,363 --> 00:03:08,763
这个销售记录是有哪几条，
我就把这两个位相量进行
交的这个运算，我就是一个，，按位与的一个操作，
Just one bit operation, jointing the two bit vectors

21
00:03:10,670 --> 00:03:13,463
就很快的得到这个第一条记录，
Will find the first record

22
00:03:13,763 --> 00:03:19,163
还有我们倒数第二条记录，就是我们 ，需要的这个记录。
And the last but one, which are what we need.

23
00:03:19,463 --> 00:03:30,863
有了这个位相量的索引，我们的检索就能更高效的进行。
，我们再来看一个，，文本文件的情况。
Here is another example of texture file.

24
00:03:31,463 --> 00:03:40,163
这个文本文件里面我们采用向量来表达
这个文本文件，我们也叫做，，签名文件。
This text file is represented by vector, so-called signature file.

25
00:03:40,163 --> 00:03:48,563
假设，对于这个 文档30,40,50，这是我们的文档标号了
Supposing this file , 30, 40, 50 , we numbered,

26
00:03:48,863 --> 00:03:54,563
，他们分别拥有这样的一些相应的关键词，
Has these corresponding key words with this order,

27
00:03:55,163 --> 00:04:05,663
假设这个关键词它的出现的顺序就是这样的一个顺序。
在这样的一个位图索引里面，我们并不需要知道
Which we do not need to know the order and frequency,

28
00:04:05,963 --> 00:04:21,563
这些词在文档里面它出现的顺序，
也不需要知道这个词在文件里面，它出现了几次，我只是表达，
它出现了，没有出现。因此的我们把这些相应的词，，
The signature file only expresses the existence of key words,

29
00:04:21,863 --> 00:04:24,563
建立一个这样的字典一样。
Just like building a dictionary like this.

30
00:04:24,563 --> 00:04:37,463
有这样的一些槽的这个位置，对于每一个文档，我就去检查，这个相应的词，是不是出现，
We mark vectors by checking whether the corresponding word is in the file

31
00:04:37,763 --> 00:04:43,163
是的话，我让它是1，
然后不是的话， 我可以让它是0。
1 for yes and 0 for no.

32
00:04:43,463 --> 00:04:51,863
你如果样按这个位来看的话，
其实也挺像我们以前的样一个倒排文件。
This is something we do like an inversed file。

33
00:04:52,163 --> 00:05:03,563
我们把这些取值为1的
文档给它串起来就是我们，，这些相应的词它的个倒排列的列表。
The slots valued 1 consist of the inverted list for such word,

34
00:05:03,863 --> 00:05:10,763
，位相量它特别适合于按位来存取数据，
Whose bitmap suit the store and access by bit.

35
00:05:11,063 --> 00:05:17,663
其实我们在数据库里面我们传统的一个看法是按行来，也就是说，
A traditional way in database is scanning by row,

36
00:05:19,463 --> 00:05:24,863
一条一条的销售记录或是像我们刚才的文本文件一个文件一个文件这样子按行来看。
Just Checking records or files one by one.

37
00:05:25,163 --> 00:05:31,163
我们按位来看，就是按照一个一个的数据域，我们垂直的来看。
But we now scanning the database by attributes.

38
00:05:31,463 --> 00:05:43,763
列的数据，因为我们用这个位来表示就是用比特位
来表示这个信息。当然它就是更方便进行压缩。
This representation of information is easy to be compressed,

39
00:05:44,363 --> 00:05:49,763
就像我们图像信息可以非常好的压缩一样。
Just like image with 50% save for disks.

40
00:05:50,363 --> 00:05:56,063
这个压缩能够节省大约50%的这个磁盘空间。
Moreover, its space complexity is less than B or B + tree.

41
00:05:56,363 --> 00:06:03,563
而且，它的索引空间是比B树，B加数要小的。
它的操作也更加便捷。
With more convenient operations.

42
00:06:03,863 --> 00:06:14,063
我们思考一下，请调研在当前
这些列数据库的技术里面，位图它的作用。
Please think about and investigate the bitmap’s usage in database after class.

43
00:06:14,663 --> 00:06:15,263
谢谢大家。
Thanks!