1 00:00:00,000 --> 00:00:06,160 大家好。 Hello everyone. 2 00:00:06,160 --> 00:00:12,320 数据结构与算法这一讲我们开始第9章文件和外排序的学习。 This lecture of data structure and algorithm starts talking about Chapter 9 file and external sorting. 3 00:00:12,320 --> 00:00:17,420 那我们介绍一下主存外存还有文件的组织和管理。 Then we introduce the main memory, external memory and file organization and management. 4 00:00:17,420 --> 00:00:22,590 我们计算机的存储器主要分为主存 The storage of computer is mainly divided into the main memory, 5 00:00:22,590 --> 00:00:27,030 还有外存。主存里面我们常见的就是RAM, and external memory. Main memory is commonly RAM. 6 00:00:27,030 --> 00:00:30,800 那它是一种随机访问的存储器。 It is a random access memory . 7 00:00:30,800 --> 00:00:35,740 另外还有高速缓存,视频存储这也是属于主存。 Another one is cache. The video storage is also main memory. 8 00:00:35,740 --> 00:00:41,180 还有外存呢,它最主流的就是磁盘。 The mainstream of external memory is disk. 9 00:00:41,180 --> 00:00:49,940 那还有我们也有一些磁带等 这样用于存储大量信息以及永久存储的这样的设备。 We also have some other devices such as tapes used to store large amounts of information and permanently. 10 00:00:49,940 --> 00:00:55,780 那我们看一下存储器的这样的一个结构。 Then we look at the structure of memory. 11 00:00:55,780 --> 00:00:59,925 主存和高速缓存呢,它们是 Main memory and cache are 12 00:00:59,925 --> 00:01:04,070 基本的存储器,它是跟CPU可以直接打交道的。 the basic storage. It is possible to deal directly with the CPU . 13 00:01:04,070 --> 00:01:10,200 在有些数据,如果它的数据量特别大,主存 In some cases, especially if it is a large amount of data , the main memory 14 00:01:10,200 --> 00:01:16,330 放不下,那我们就需要用到磁盘这样的辅助的存储设备。 does not fit, we need to use disk as auxiliary devices. 15 00:01:16,330 --> 00:01:23,770 另外呢,我们因为主存它的数据是断电以后就会消失的, In addition, our data in main memory will disappear if power fails. 16 00:01:23,770 --> 00:01:28,030 那我们如果这个数据处理了以后,我需要比较长久的 If after this data is dealt with, I need to a long-term 17 00:01:28,030 --> 00:01:32,290 保存以及传输,我们要放到磁盘里面。 preservation and transmission, we need to put it into the disk. 18 00:01:32,290 --> 00:01:39,705 那有些特别大量的数据, 我们需要永久的保留,那还要用到3G的 For some particularly large amount of data, we need to retain permanent, and that also need 3G 19 00:01:39,705 --> 00:01:43,920 存储设备,像磁带啊,光盘这样的结构。 storage devices, such as tape and light disc. 20 00:01:43,920 --> 00:01:49,390 那我们看一下磁盘的物理结构。 Then we look at the physical structure of the disk. 21 00:01:49,390 --> 00:01:54,860 磁盘呢,它是由这么一个桌面以及若干个碟片 Disk is made up by a desktop, several discs 22 00:01:54,860 --> 00:01:58,930 还有机械手臂来组成的。那我们碟片呢 and a mechanical arm. The disc 23 00:01:58,930 --> 00:02:03,000 围着这个主轴这样不断的转。 rotates around the spindle continually. 24 00:02:03,000 --> 00:02:09,385 而这个主轴上头呢,有这么若干个碟片,每个碟片上呢它都是 And on this spindle, there are several discs, each disc 25 00:02:09,385 --> 00:02:15,770 有一圈一圈的这个磁道,数据就在这个磁道上面。 has some tracks. The data is on this track. 26 00:02:15,770 --> 00:02:21,020 那为了访问不同磁道上面的这些数据,那我们 In order to access the data in different tracks, we 27 00:02:21,020 --> 00:02:26,270 会有这样的一个活动臂,它带着这个磁盘的读写头 have such a movable arm. It moves left and right with the disk read-write head. 28 00:02:26,270 --> 00:02:31,320 这样的左右的移动,然后我们就可以 Then we can 29 00:02:31,320 --> 00:02:36,370 准确的定位到哪一个碟片的哪一个磁道以及磁道上的 accurately go to a certain block in a certain track on a certain disc, 30 00:02:36,370 --> 00:02:39,900 哪一块,我们一般称为扇区。 generally called sector. 31 00:02:39,900 --> 00:02:46,680 那我们看一下这个磁盘的这样的信息的组织。 Then we look at the organization of information on the disk. 32 00:02:46,680 --> 00:02:53,560 在早期呢,这个磁盘的这些磁道上头这一圈圈上面的这些数据呢, In early, the data on these tracks 33 00:02:53,560 --> 00:02:58,550 我们每一个磁道上头存的数据的量 on the disk has the same amount. 34 00:02:58,550 --> 00:03:03,540 是一样的。这样的话,我们在内圈它就更密集些,外圈呢更稀疏。 In this case, it is more dense in the inner circle, and more sparse in the outer circle. 35 00:03:03,540 --> 00:03:09,960 那现在的发展呢,我们可以在这个外圈呢存储 Now, we can see there are more data in the outer circle. 36 00:03:09,960 --> 00:03:16,400 更多的数据,那就是不是这样的一个原来这种形式了。 It is not such an original form. 37 00:03:16,400 --> 00:03:23,670 那我们来看一下,在每一个磁道上头这些数据块的组织。 Then we look at the organization of these data blocks on each track. 38 00:03:23,670 --> 00:03:28,100 那我们是以这样的一个一个的页块来组织的。 We use such a one page block to organize. 39 00:03:28,100 --> 00:03:32,530 每一个页块呢,我们根据这个磁盘的 Each page block differs from the disks, 40 00:03:32,530 --> 00:03:39,010 这样的不同,以及操作系统或者是数据库它的这个文件管理的这个组织不同, the operating system or database file management, 41 00:03:39,010 --> 00:03:45,590 我们可能分的大小不一样,那主流的呢就是有512字节和 We may divide into different sizes, and the mainstream is to have 512 bytes or 42 00:03:45,590 --> 00:03:53,590 1024字节这样的一些大小。那我们 1024 bytes. Then the advantage 43 00:03:53,590 --> 00:03:57,615 分成这样的页块的好处呢就是我们没必要去 for us to divide it into blocks is that we do not need to 44 00:03:57,615 --> 00:04:01,640 在磁盘上按照一个一个字节的这种访问单位, access unit of a byte but 45 00:04:01,640 --> 00:04:06,120 我们按这一个一个页块,每次读若干个字节, the unit of a page, which is several bytes. 46 00:04:06,120 --> 00:04:10,600 那我们可以节省这样的定位到页块的时间。 Then we can save the time of this page positioning. 47 00:04:10,600 --> 00:04:16,595 那在早期呢,它的旋转速度是比较低的, In early, the rotational speed is relatively low, 48 00:04:16,595 --> 00:04:22,590 那我们读完这一块呢,我可以紧接着在下一块读,好像没什么问题。 If we read this one yet, I can immediately read the next one, it seems no problem. 49 00:04:22,590 --> 00:04:27,690 那我们随着磁盘技术的发展,它的旋转速度越来越快, With the development of disk technology, it's rotating faster and faster. 50 00:04:27,690 --> 00:04:33,750 那也就是说它的读取的速度也是越来越快了。那这样的话呢, That means its speed to read is also faster and faster.In this case, 51 00:04:33,750 --> 00:04:41,240 如果我们还是按照它的相邻页块这样的相邻的物理存储的话呢, if we follow the adjacent physical storage as the adjacent page , 52 00:04:41,240 --> 00:04:45,830 就不划算,因为我们读是一个电子的活动, it is not worthwhile , because the read is an electronic activity. 53 00:04:45,830 --> 00:04:50,420 而这样的一个旋转呢,它是一个机械的这样的运动。 And such rotating is a mechanical movement . 54 00:04:50,420 --> 00:04:58,730 那我们读完以后,要等着它 转到这个下一块,那如果你速度太快了,它已经转过去了, After reading that we should wait for it to go to the next one , if you are too fast, it has turn passed, 55 00:04:58,730 --> 00:05:06,290 因此呢,我们就有这样的一个交错因子。我让它的下一个相邻的这个块呢, So then, we have such an interleave factor. I let the next adjacent block 56 00:05:06,290 --> 00:05:10,690 只隔了若干个这样的区域,然后 separated by only several such area, and then 57 00:05:10,690 --> 00:05:15,090 它读完以后,我转过来正好就轮到下一个页块。 after reading it , I just turn on and turn to the next page block . 58 00:05:15,090 --> 00:05:19,885 那内存它是可以直接跟 It is possible that memory directly exchange data with 59 00:05:19,885 --> 00:05:24,680 CPU来进行数据的交换,它的访问速度非常的快。 CPU, and its access speed is very fast. 60 00:05:24,680 --> 00:05:31,345 但是呢,它的存储量非常小,而且呢断电以后呢这个数据会丢失, But its storage capacity is very small, and it loses data after power fails. 61 00:05:31,345 --> 00:05:38,010 它造价也比较大。当然内存我们一般来说。我们认为它是随机访问的, Its cost is relatively large. Of course, generally, we think the memory is randomly accessed, 62 00:05:38,010 --> 00:05:42,305 也就是说,我对一个地址的数据进行一个 In other words, the time to read data 63 00:05:42,305 --> 00:05:46,600 读取的时候呢,它是一个常数的一个时间。 is a time of a constant. 64 00:05:46,600 --> 00:05:52,685 那内存呢有它的相应的缺点,那因此呢我们需要用到外存。 It has its respective memory disadvantage so that we need to use an external memory. 65 00:05:52,685 --> 00:05:58,770 那外存呢,它可以长久的保留这些信息,而且它的 The external memory can retain this information for a long time and its 66 00:05:58,770 --> 00:06:04,100 数据量更大。那用它来跟这个内存呢进行 amount of data is much larger. Then we use it to exchange data with the memory. 67 00:06:04,100 --> 00:06:09,430 数据的这样的交换呢,我们可以处理更大量的这样的数据。 We can handle larger amounts of data. 68 00:06:09,430 --> 00:06:13,940 但是外存呢它的访问速度非常的慢。 But the access speed of the external memory is very slow. 69 00:06:13,940 --> 00:06:18,450 我们读一个字节的这样的一个访问, The visit of reading a byte, 70 00:06:18,450 --> 00:06:25,310 在内存里面呢,我们需要的是纳秒级的时间,也就是10亿分之一的时间, in memory, we need a nanosecond of time, that is one billion of one second, 71 00:06:25,310 --> 00:06:33,640 而在外存里面呢,相应的它需要 毫秒的时间,也就是千分之一秒的时间。 In the external memory, the corresponding time it takes is milliseconds, which is a thousand of a second. 72 00:06:33,640 --> 00:06:37,640 这两个量级之间的差呢,大概有100万倍。 There are about 100 million times between these two orders of magnitude. 73 00:06:37,640 --> 00:06:45,070 因此我们涉及到外存的这些计算机程序的处理的时候呢,我们都需要 So the computer programs related to the external memory, we need 74 00:06:45,070 --> 00:06:52,390 想办法减少访外的次数,通过内存对数据进行一些整理,然后呢使得 to find ways to reduce the number of times visiting external memory through clearing up data 75 00:06:52,390 --> 00:06:59,710 它尽量的不要去频繁访外,然后使得我们整体的这个处理效率呢 and make our overall efficiency of this process higher. 76 00:06:59,710 --> 00:07:07,000 更高。那我们看一下这个跟存储设备相关的一些单位。 Let us look at the units associated with the storage device . 77 00:07:07,000 --> 00:07:12,430 那像这个KB,兆,然后G,T等等。 It was like KB, MB, then G, T and so on. 78 00:07:12,430 --> 00:07:17,590 那特别是有这么一个数据,叫Googol,这实际上 There is such a particular data, called Googol, which actually 79 00:07:17,590 --> 00:07:22,750 是一个天文的数字。那它是10的100次方。 is an astronomical figure. It is 10 to the 100. 80 00:07:22,750 --> 00:07:28,545 文件呢它是记录的一个汇集, The file is a collection of records, 81 00:07:28,545 --> 00:07:34,340 但记录在底层它其实也是二进制的这样的数据来构成的。 The record is actually saved in the binary format. 82 00:07:34,340 --> 00:07:40,710 那我们实际上是通过程序员编写这样的程序 In fact, the programmer writes such programs 83 00:07:40,710 --> 00:07:44,895 来使得我们这些二进制的这些信息呢。 to make the binary information 84 00:07:44,895 --> 00:07:49,080 具有一些逻辑的结构。那是这一种记录的形式。 in some logical structure. It was in the form of records. 85 00:07:49,080 --> 00:07:54,620 那文件它的本质呢是一种线性的结构,那也就是说它的这个记录 The file is a linear structure essentially, and that means the layout of the record 86 00:07:54,620 --> 00:08:00,160 之间的这个排放呢,在这个存储设备的这个外存里面呢, in the external memory of the storage device, 87 00:08:00,160 --> 00:08:07,585 它都是挨个挨个进行的。那我们也可以看作是像我们老式的这种 is carried out one by one. It can be seen as old-fashioned tape, 88 00:08:07,585 --> 00:08:15,010 录音带那它这样的一个顺序的这个结构,那我们要读到相应的这个音乐的话呢, such a sequence structure. When we have to read the corresponding words, 89 00:08:15,010 --> 00:08:21,530 我就需要定位到那个带子的一个位置,然后我再开始挨个的读数据。 I need to navigate to a location of the tape, and then I began to read the data one by one . 90 00:08:21,530 --> 00:08:27,240 那文件的这样的组织呢,我们分为 Such organizations of the file is divided 91 00:08:27,240 --> 00:08:32,950 逻辑和物理两个层面。我们从程序员的角度来说, into logical and physical levels. From programmer's point of view, 92 00:08:32,950 --> 00:08:38,050 我是逻辑的来看待它的。也就是说我可以 It is seen logically. That means 93 00:08:38,050 --> 00:08:43,150 是看作是由记录组成的这样的汇集。 this is seen as a collection of records. 94 00:08:43,150 --> 00:08:48,065 那这个记录呢,当然它就类似于我们在这个 This record , of course, is similar as 95 00:08:48,065 --> 00:08:52,980 C语言里面的这种结构体,那它是由若干个数据域组成的。 the "structure" in the C language, and that it is composed of a plurality of data fields. 96 00:08:52,980 --> 00:08:59,690 但是呢,在物理上来说呢,它还是本质上是这样的0,1的信息。 But in the physical level, it is essentially the information of 0,1. 97 00:08:59,690 --> 00:09:04,690 但是我们并不是一个字节一个字节的来管理它,我们是 But we do not manage it byte by byte. We 98 00:09:04,690 --> 00:09:09,690 成块的把它分布在整个磁盘中。 put it into blocks distributed across the disk. 99 00:09:09,690 --> 00:09:16,090 那我们怎么样来管理这些磁盘里面的这些文件的 Then how do we manage these files in these discs. 100 00:09:16,090 --> 00:09:22,490 页块呢?那就是通过操作系统或者是数据库的文件管理器来组织的。 Page block? That is, the operating system or the file manager of database to organize it. 101 00:09:22,490 --> 00:09:26,990 那它们的作用就是把这个逻辑的这样的 Their role is to map the logic organization 102 00:09:26,990 --> 00:09:31,490 一个组织映射为它的物理的这个地址。 to the its physical address. 103 00:09:31,490 --> 00:09:37,345 文件从逻辑上来说主要有3种形式, Files logically have mainly three kinds of forms, 104 00:09:37,345 --> 00:09:43,200 就是顺利定长的,顺序变长的,还有按照关键码存取 Is a fixed-length sequence, variable-length sequence, as well as access according to the key. 105 00:09:43,200 --> 00:09:49,610 这样的形式。那我们常见的物理组织形式呢, Such forms.Our common form of physical organization, 106 00:09:49,610 --> 00:09:54,800 本质上它都是顺序的结构,也就是顺序文件。 essentially is the order of the structure, which is a sequential file. 107 00:09:54,800 --> 00:10:00,930 但是我们为了要加快对文件的一个操作处理, However, we want to speed up the process of file. 108 00:10:00,930 --> 00:10:07,465 那我们还有一类呢是可以计算寻址,也就是散列文件。 There is another kind that is possible to calculate the address, which is the hash file. 109 00:10:07,465 --> 00:10:14,000 我能够通过你所需要的经常进行查找的这样的一个数据与, I was able to hash the data that you frequently access, 110 00:10:14,000 --> 00:10:18,710 我把它散列到相应的这个文件的一个四方块里面。 into a square of the corresponding file. 111 00:10:18,710 --> 00:10:23,420 我们把那个文件的这个四方块呢,我们也可以说是散列桶。 The square can also be called as the hash bucket. 112 00:10:23,420 --> 00:10:30,860 那还有我们如果经常需要对某一些这样的关键码进行访问存取, Maybe we often need to carry out some of these key. 113 00:10:30,860 --> 00:10:34,790 我们就可以把这个文件之上呢 We can establish an index structure 114 00:10:34,790 --> 00:10:41,480 建立一种索引的结构,例如我们数据库里面常用的b树,b+树, upon the file, for example, the B and B+ tree often used in database, 115 00:10:41,480 --> 00:10:46,240 那还有呢,在对于这些文本文件 Besides, in the process of these text files, 116 00:10:46,240 --> 00:10:51,000 处理的时候,我们以关键码的形式形成倒排文件, we have formed inverted file in the form of key codes, 117 00:10:51,000 --> 00:11:00,210 然后使得我们在检索相应的关键码的时候,不需要一个一个文件去读,只需要从倒排文件里面拿出相应的记录。 That allows us to retrieve the corresponding key without the need of reading files one by one.just take the corresponding record from the inverted file. 118 00:11:00,210 --> 00:11:04,960 文件上头的操作跟我们 File operations is similar to 119 00:11:04,960 --> 00:11:10,680 一般的数据结构的操作呢是类似的,就我们常见的是增删查改。 the operations of general data structure, it is our common CRUD. 120 00:11:10,680 --> 00:11:14,860 那还有一个特别重要的就是排序, There is a particularly important sort, 121 00:11:14,860 --> 00:11:21,870 因为我们在前面介绍了对于这样的文件我们如果是需要加快它的一个 Because we have previously introduced in order to speed up the read of such files 122 00:11:21,870 --> 00:11:28,880 读取的速度呢,我们可能要建立索引。那这个索引就是对一些相应的数据域进行的。 we might want to create the index. This index is upon some of the corresponding data domain. 123 00:11:28,880 --> 00:11:35,820 那这些数据域,我们最好是事先对它进行了排序整理,这样我们的索引呢就更加高效。 These data fields need to be sort in advance, so that our index is even more efficient. 124 00:11:35,820 --> 00:11:41,710 那我们看一下文件的标准输入输出流。 Then we look at the standard input and output streams of file. 125 00:11:41,710 --> 00:11:49,380 那所谓标准输入,就是我们键盘的这个输入;所谓标准输出呢,就是屏幕上头的输出。 That so-called standard input is our keyboard input ; so-called standard output is the screen output. 126 00:11:49,380 --> 00:11:54,405 那相应于这个标准的输入输出, That corresponding to the standard input and output, 127 00:11:54,405 --> 00:11:59,430 我们派生出对文件的相应的输入输出。 We derive the corresponding file input and output . 128 00:11:59,430 --> 00:12:06,370 那我们再来看一下相关的文件类。 Then we take another look at the relevant file classes. 129 00:12:06,370 --> 00:12:12,420 那对文件的操作,我们最主要的首先是一个定位。 For the file operation, the most important is to position. 130 00:12:12,420 --> 00:12:18,685 我们把文件指针给它定到相应的这个位置,那这个位置呢, We set the file pointer to the appropriate location, and at this position, 131 00:12:18,685 --> 00:12:24,950 是可以进行这样的相应的读和写的操作。 we can perform corresponding read and write operations. 132 00:12:24,950 --> 00:12:29,125 那我们可以看到,这个函数里面呢 Then we can see , in this function 133 00:12:29,125 --> 00:12:33,300 就有好多个函数都是关于定位的操作。 there are a good number of functions about the positioning operating. 134 00:12:33,300 --> 00:12:37,310 另外还有读和写的操作。 There are also read and write operations. 135 00:12:37,310 --> 00:12:44,840 为了加快这些外存数据的访问速度, In order to speed up the access of these external memory data 136 00:12:44,840 --> 00:12:49,605 那我们就考虑,有可能我正在访问的或者是说 It is possible that the data I am visiting or 137 00:12:49,605 --> 00:12:54,370 前面曾经访问过的这些数据呢可能将来还要访问。 I had visited before may be visited again in the future. 138 00:12:54,370 --> 00:13:00,430 那我们就可以把它在内存里面呢开辟一个缓冲区。 Then we can open a buffer in memory and put it in. 139 00:13:00,430 --> 00:13:09,240 那我把这些数据放在那。那我将来再要访问的时候呢,我就可以直接从缓冲区里面去读。那若干个这样的 When I visit again in the future,I can read directly from the buffer. Several such 140 00:13:09,240 --> 00:13:13,650 缓冲区呢,它就组成了这样的缓冲池。 buffers form a buffer pool. 141 00:13:13,650 --> 00:13:18,735 我们如果是还有一些信息,还需要放到这个 When there is some information that we also need to put in 142 00:13:18,735 --> 00:13:23,820 缓冲区里头来的时候,也就是我这个缓冲区 the buffer and the buffer is overflow, 143 00:13:23,820 --> 00:13:31,230 有可能会出现一个溢出的情况,那我们就要有这样的淘汰的策略, we have to have this elimination strategy, 144 00:13:31,230 --> 00:13:36,300 把这个缓冲区里面的一些不常用的这种块呢 and relieve such blocks those are visited less frequently. 145 00:13:36,300 --> 00:13:41,370 给它释放掉,然后来存到我们新的页的信息。 and then keep the information in our new page. 146 00:13:41,370 --> 00:13:48,870 那有几种策略。先进先出呢就是我最先进来的我最先给它淘汰。 There are several strategies. FIFO is that the first in will be eliminated first. 147 00:13:48,870 --> 00:13:55,710 然后但这种情况呢我可能前面最先进来的,但是我马上我又轮到要访问了, But there is a condition that for the first to come in, I'll have to visit it again, 148 00:13:55,710 --> 00:14:02,870 那怎么办呢?我就有一个新的办法就是说按照最不频繁使用来淘汰。 How to do it? I have a new approach that is according to the least frequently used. 149 00:14:02,870 --> 00:14:10,120 但是如果我有一个块,可能被恶狠狠的使用了10000次,那后面呢我又不怎么用了, But if I have a block that may be used 10,000 times ferociously, but I will seldom visit it later. 150 00:14:10,120 --> 00:14:15,435 那那个用的次数最多的块呢,它永远不淘汰,这也是有问题的。 This used most block will never be eliminated, and this is a problem. 151 00:14:15,435 --> 00:14:20,750 那我们最常用的就是,“最近最少使用”这样的策略,也就是说, Our most popular is the "least recently used" strategy, that is, 152 00:14:20,750 --> 00:14:25,010 我不断的轮循,那些没有使用的呢我就给它 I do the polling constantly, and for those not used, I 'll not give it a mark 153 00:14:25,010 --> 00:14:29,270 不标记。那使用的呢我给它标记上。那在 For those used I will mark it on. 154 00:14:29,270 --> 00:14:34,760 这个进行这样的一个寻找可淘汰的缓冲区的时候呢, In this way to find those can be eliminated, 155 00:14:34,760 --> 00:14:40,250 我们那些“最近正在使用”的就可以避免被淘汰。 those "being recently used" can be avoided eliminated. 156 00:14:40,250 --> 00:14:47,060 那我们最后留一下思考。大家查询一下这些 Then leave you some thinking. You see about 157 00:14:47,060 --> 00:14:50,770 存储设备它们的每字节的价格。 how much is such storage device per byte. 158 00:14:50,770 --> 00:14:56,825 另外呢,特别差询一下主流的硬盘的一些性能指标。 In addition, particularly see about the performance indicator of the mainstream hard drive. 159 00:14:56,825 --> 00:15:02,880 我们这里面比如说,像这个磁盘旋转速度啊,还有 For example, the disk rotation speed , 160 00:15:02,880 --> 00:15:06,770 这样的寻道时间啊,交错因子啊等等这些指标。 and track seeking time, interleave factor, etc. 161 00:15:06,770 --> 00:15:12,640 谢谢大家。 Thank you.