The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
Google?

【西周翻譯】

ABSTRACT 概述

????We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

???? 我們設計和實現(xiàn)了Google File System，簡稱GFS，一個可擴展的分布式文件系統(tǒng)，用于大型分布式數(shù)據(jù)相關應用。它提供了基于普通商用硬件上的容錯機制，同時對大量的客戶端提供高性能的響應。
????While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment,
both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

??? ? GFS與此前的分布式文件系統(tǒng)具有許多相同的目標，但我們的設計是基于對我們的應用負載和技術環(huán)境的觀察而來，包含當前狀況，也包含今后的發(fā)展，這與一些早期的文件系統(tǒng)的假定就有了分別。這驅(qū)使著我們?nèi)ブ匦驴紤]傳統(tǒng)的選擇和探索新的設計點。
????The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

???? 這個文件系統(tǒng)成功的滿足了我們的存儲需求。在Google它被廣泛的部署，我們的業(yè)務用其作為生成和處理數(shù)據(jù)的存儲平臺，同時也被用于節(jié)省在面對大量數(shù)據(jù)時的研究和開發(fā)成本。當前最大的集群已經(jīng)可以基于超過一千臺機器上的數(shù)千個磁盤，來存儲上萬TB的數(shù)據(jù)，同時它也支持來自于上萬個客戶端的訪問請求。
????In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both
micro-benchmarks and real world use.

???? 在這篇論文中，我們展示了文件系統(tǒng)的接口擴展，用以支持分布式應用，并且針對我們的設計進行的多個方面的討論，以及在真實環(huán)境中運行的度量數(shù)據(jù)。

?1. INTRODUCTION 簡介

????We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier
file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space.

???? 我們設計實現(xiàn)了GFS來應對來自Google快速增長的數(shù)據(jù)處理需求。GFS和此前的分布式文件系統(tǒng)具有某些相同的目標，如性能，可擴展型，可靠性和可用性。然而，GFS的設計被Google的應用負載情況及技術環(huán)境所驅(qū)動，具有和以往的分布式文件系統(tǒng)不同的方面。我們從設計角度重新考慮了傳統(tǒng)的選擇，針對這些不同點進行了探索。
First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the? components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application
bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.

???? 第一，組件的失效比異常更加常見。文件系統(tǒng)包含了成百上千的基于普通硬件的存儲機器，同時被大量的客戶端機器訪問，組件的數(shù)量和質(zhì)量決定了在某個時刻一些組件會失效而其中的一些無法從失效狀態(tài)中恢復。我們曾經(jīng)見到過由于下面的原因引發(fā)的實效：應用缺陷，OS缺陷，人為錯誤，磁盤/內(nèi)存/連接器/網(wǎng)絡/電源錯誤等等，因此系統(tǒng)必須包含狀態(tài)監(jiān)視、錯誤檢測、容錯、自動恢復等能力。
????Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited.

???? 第二，傳統(tǒng)標準的文件量十分巨大，總量一般都會達到GB級別。文件通常包含許多應用對象，諸如Web文檔等。當我們在工作中與日益增長的包含大量對象的TB級的數(shù)據(jù)進行交互時，管理數(shù)以億計的KB大小的文件是非常困難的。所以，設計假定和參數(shù)需要重新定義，如I/O操作和塊大小等。
????Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.

??? ? 第三，多數(shù)的文件變化是因為增加新的數(shù)據(jù)，而非重寫原有數(shù)據(jù)。在一個文件中的隨機寫操作其實并不存在。一旦完成寫入操作，文件就變成只讀，通常也是順序存儲。多種數(shù)據(jù)擁有這樣的特征。構(gòu)造大型存儲區(qū)以供數(shù)據(jù)分析程序操作；運行應用產(chǎn)生的連續(xù)數(shù)據(jù)流；歷史歸檔數(shù)據(jù)；一臺機器產(chǎn)生的會被其他機器使用的中間數(shù)據(jù)；對于巨大文件的訪問模式，“增加”變成了性能優(yōu)化的焦點，與此同時，在客戶端進行數(shù)據(jù)塊緩存逐漸失去了原有的意義。

????Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility. For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper.
???? 第四，統(tǒng)一設計應用和文件系統(tǒng)API對提升靈活性有著好處。例如，我們將GFS的一致性模型設計的盡量輕巧，使得文件系統(tǒng)得到極大的簡化，應用系統(tǒng)也不會背上沉重的包袱。我們還引入了一個原子Append操作，這樣多個客戶端可以同時向一個文件增加內(nèi)容，而不會出現(xiàn)同步問題。這些將會在論文的后續(xù)章節(jié)進行討論。 ?

????Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of disk storage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis.

???? 多個GFS集群被部署用于不同的用途。最大的一個擁有1000個存儲節(jié)點，300TB的磁盤存儲，被上萬個用戶持續(xù)的密集訪問。

2. DESIGN OVERVIEW 設計概覽

2.1 Assumptions 假定

????In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details.

???? 在設計符合我們需求的文件系統(tǒng)的時候，我們制定了下述的假定，有挑戰(zhàn)也有機會。前面我們提到過一些關鍵的觀察，現(xiàn)在我們將其具體化。
? The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.

系統(tǒng)由許多便宜常見的組件構(gòu)成，它們經(jīng)常出現(xiàn)錯誤。必須定期進行監(jiān)視、檢測、容錯、以及從錯誤狀態(tài)恢復到例行工作狀態(tài)。
? The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.

系統(tǒng)存儲了一定數(shù)目的大型文件。我們期望是數(shù)百萬個文件，每個大概是100MB以上。GB級文件是常見情形，需要被有效的管理起來。小文件也必須支持，但是我們無需為其優(yōu)化。
? The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.

系統(tǒng)的負荷來自于兩種讀操作：大型順序讀，以及小型隨機讀。在大型順序讀的情況中，單個操作通常讀取MB級別以上的數(shù)據(jù)。來自相同客戶端的連續(xù)操作通常讀取一個文件的連續(xù)區(qū)間。小型隨機讀通常讀取若干KB的數(shù)據(jù)據(jù)。關注性能的應用往往會將小型讀操作進行打包和排序，從而使得在文件中平穩(wěn)的讀取，而非反復前后跳轉(zhuǎn)。
? The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient.

系統(tǒng)的負荷也有許多大型的連續(xù)的Append寫操作。通常操作的大小與讀取相似。一旦完成寫入，文件幾乎不會再被修改。系統(tǒng)也會支持小型隨機寫入操作，但是效率不會很高。
? The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently
append to a file. Atomicity with minimal synchronization overhead is essential. The file may be
read later, or a consumer may be reading through the file simultaneously.

對于多個客戶端并發(fā)向同一個文件進行Append操作的情況，系統(tǒng)必須有效的實現(xiàn)良好定義的語義。我們的文件常被用作“生產(chǎn)者-消費者隊列“或者“多路合并”。數(shù)以百計的生產(chǎn)者，每個運行于單獨的機器，并行向同一個文件添加數(shù)據(jù)。降低同步的困擾必不可少。文件可能后續(xù)被讀取，也許一個消費者會同時讀取。
? High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.

持續(xù)的高帶寬比低延遲更為重要。多數(shù)目標應用期望以高速率對塊數(shù)據(jù)進行處理，同時只有少量應用對單個讀寫操作的響應時間有嚴格的要求。

2.2 Interface 接口

????GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files.

???? GFS提供了一套常見的文件系統(tǒng)接口，雖然它并沒有實現(xiàn)諸如POSIX這樣的標準API。文件在目錄中以層次化的形式進行組織，可以通過路徑名稱進行標識。我們提供了諸如創(chuàng)建、刪除、打開、關閉、讀和寫文件這樣的常見操作。
????Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.

???? GFS也擁有快照和Append記錄操作。快照以最低成本創(chuàng)建一個文件或一個目錄樹的拷貝。Append記錄允許多個客戶端同時向一個文件進行Append操作，同時確保每個單獨客戶端Append的原子性。這一點對于實現(xiàn)“多路合并”和“生產(chǎn)者-消費者隊列”非常有意義，許多客戶端可以同時進行Append操作而不受額外的加鎖限制。我們發(fā)現(xiàn)在構(gòu)造大型分布式應用時，這種類型的文件非常有價值。快照和Append記錄將在3.4和3.5章中詳細討論。

2.3 Architecture 架構(gòu)

????A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.

???? 一個GFS集群由一個master和多個塊服務器（Chunkserver）組成，被多個客戶端所訪問，如圖1所示。每個機器都是廉價的Linux機器，運行用戶態(tài)服務進程。也可以將塊服務器和客戶端在同一臺機器上運行，只要機器的資源允許，或者可以接受可能有問題的應用代碼帶來的低穩(wěn)定性。 ?

????Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.

???? 文件被分割成固定大小的塊。每個塊都使用一個不變的全局唯一的64位塊句柄進行標識，這個句柄在master創(chuàng)建塊時進行分配。塊服務器在本地磁盤上像Linux文件一樣存儲塊，根據(jù)指定的塊句柄和字節(jié)范圍來讀寫塊數(shù)據(jù)。為了可靠性，每個塊被復制在多個塊服務器上。缺省情況下，我們保存三分復制，用戶也可以為文件名稱空間的不同地區(qū)指定不同的復制級別。
????The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

???? Master維護所有的文件系統(tǒng)元數(shù)據(jù)。它將包括名字空間，訪問控制信息，文件與塊的鏈接，以及塊的當前位置。它還控制著系統(tǒng)層面的活動，諸如塊租借管理，孤立塊的垃圾回收，塊服務器之間的塊遷移。master會定期的與塊服務器使用心跳消息進行通信，發(fā)送指令給塊服務器，以及收集塊服務器的狀態(tài)。
????GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.

???? 嵌入與應用中的GFS客戶端代碼實現(xiàn)了文件系統(tǒng)API，與master和塊服務器進行通信，代為應用程序讀寫數(shù)據(jù)。客戶端與master交互以進行元數(shù)據(jù)操作，但是所有的數(shù)據(jù)通信都將直接訪問塊服務器。我們沒有提供POSIX API，因此無需在Linux vnode層放置鉤子。
????Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.

???? 客戶端和塊服務器都不會緩存文件數(shù)據(jù)。客戶端進行緩存只有極少的益處，因為多數(shù)應用操作巨大的文件，而且工作輸出的大小也超出的緩存的范圍。沒有緩存讓客戶端和整個系統(tǒng)都變得簡單，因為可以忘記緩存同步問題。（然后客戶端還是會緩存元數(shù)據(jù)）塊服務器也無需緩存文件數(shù)據(jù)，因為塊在本地文件中存放，Linux的緩沖區(qū)機制已經(jīng)將頻繁訪問的數(shù)據(jù)放進了內(nèi)存。

2.4 Single Master 單Master

????Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.

???? 單master極大的簡化了我們的設計，同時也使得master可以給予全局知識進行復雜的塊存儲和復制策略。但是我們必須使得master在讀寫方面的占用最小化，從而避免讓它成為瓶頸。客戶端從不直接從master讀寫數(shù)據(jù)。相反的，客戶端會詢問master該與哪個塊服務器進行交互。而后它會將這個信息緩存一段時間，接下來的操作會直接與這個塊服務器進行交互。

????Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunk size, the client translates the file name and byte offset specified by the application
into a chunk index within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key.

???? 讓我們用圖1來解釋一下一個簡單的讀操作的交互過程。首先，使用固定的塊大小，客戶端將文件名和應用指定的偏移量轉(zhuǎn)換成文件內(nèi)部的塊索引。然后，客戶端向master發(fā)送一個請求，包含文件名和塊索引。master響應對應的塊句柄和復本的位置。客戶端將這些信息進行緩存，使用文件名和塊索引作為Key。
????The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This
extra information sidesteps several future client-master interactions at practically no extra cost.

??? ? 客戶端向復本之一發(fā)送一個請求，通常是最近的一個。這個請求指定了塊句柄和塊內(nèi)部的一個區(qū)間。接下來對于相同塊的讀取將不會再次進行客戶端與master的交互，直到緩存過期，或者文件被重新打開。事實上，客戶端通常在一個請求中嘗試讀取多個塊，master也會立即返回相應的塊信息。這些額外的信息避免了后續(xù)的一些客戶端與master的交互，但又沒有引入額外的成本。

2.5 Chunk Size 塊大小

????Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.

???? 塊的大小是一個關鍵的設計點。我們選擇了64MB，這比通常的文件系統(tǒng)塊要大出很多。每個塊的復本在一個塊服務器上被存儲為一個平面的Linux文件，僅在需要的時候進行擴展。“懶”空間分配避免了內(nèi)部碎片導致的空間浪費，這也許是如此大小的塊機制的最無爭議之處。
????A large chunk size offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata
stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.

???? 大型的塊有許多關鍵的好處。首先，它減少了客戶端與master交互的需求，因為對于同一的塊的讀寫，只需要向master發(fā)送一個獲取塊位置信息的初始請求。這極大的降低系統(tǒng)的負荷，因為應用通常對大型文件進行順序讀寫操作。即使對于小型隨機讀操作，客戶端也可以輕松的對TB級別的工作集的塊位置存儲進行緩存。第二，因為塊足夠大，客戶端基本上是在一個給定的塊上進行多次操作，這也可以降低網(wǎng)絡方面的困難，因為可以在一個時間段內(nèi)與塊服務器之間保持一個持久的TCP連接。第三，這使得可以減少在master上存儲的元數(shù)據(jù)大小。這樣的話，我們可以將元數(shù)據(jù)放入內(nèi)存中，從而帶來其他的將在2.6中討論的好處。
????On the other hand, a large chunks ize, even with lazy space allocation, has its? disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.

???? 另一方面，雖然可以進行“懶”空間分配，大型的塊也有它的缺點。一個小文件包含較少的塊，也可能只有一個。存儲這些塊的塊服務器可能會變成“熱點”，如果許多客戶端嘗試訪問相同的文件。在實踐中，熱點不會成為主要問題，因為我們的應用在大多數(shù)情況下，是順序的對多個塊的文件進行讀操作。
????However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunk file and then started on hundreds of machines
at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batch queue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations.

???? 但是，當GFS被第一次用于一個批處理隊列系統(tǒng)中試，熱點還是出現(xiàn)了：一個可執(zhí)行文件作為單塊文件被寫入GFS，然后在成百上千臺機器上啟動運行。保存這個可執(zhí)行文件的幾臺塊服務器由于大量并發(fā)請求進入過載狀態(tài)。我們采取了一些措施來解決這個問題，提高復本數(shù)量，以及讓批處理隊列系統(tǒng)錯開應用的啟動時間。一個潛在的長期解決方案是：允許客戶端在這種情況下從其他的客戶端讀取數(shù)據(jù)。

2.6 Metadata 元數(shù)據(jù)

????The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkma pping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. Using a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins the cluster.

???? Master存儲三種主要的元數(shù)據(jù)：文件和塊的名字空間，文件和塊的映射關系，每個塊復本的位置。所有的元數(shù)據(jù)都保存在Master的內(nèi)存中。前兩類（名字空間和映射關系）也作為操作日志被保存在Master的本地磁盤上，并且在遠程機器上保存一個復本。使用日志使得我們更加簡單、可靠的更新Master的狀態(tài)，不用擔心由于Master死機造成的數(shù)據(jù)不一致。Master不會持久化塊的位置信息，相反，Master啟動時會向塊服務器查詢塊的狀態(tài)，并且一個塊服務器加入集群時也會進行相同的操作。

2.6.1 In-Memory Data Structures 內(nèi)存中的數(shù)據(jù)結(jié)構(gòu)

????Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunkm igration to balance load and disk space usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further.

???? 因為元數(shù)據(jù)被存儲在內(nèi)存中，Master的操作速度非常快。并且，Master可以簡單有效的定期在后臺掃描所有的狀態(tài)。這樣的周期掃描被用于實現(xiàn)塊的垃圾回收，塊服務器失效后重新生成復本。4.3和4.4將進行詳細討論。
????One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.

???? 對于純內(nèi)存方式的潛在憂慮在于，塊的數(shù)量、乃至于整個系統(tǒng)的容量受限于Master的內(nèi)存大小。實踐中這并不是一個嚴重的限制。對于每個64MB大小的塊，Master保存小于64字節(jié)的元數(shù)據(jù)。大多數(shù)的塊是滿的因為多數(shù)文件包含多個塊，只有最后的一個是部分填充。相似的，每個文件的名字空間數(shù)據(jù)通常也僅需要64字節(jié)，因為保存的文件名使用前綴壓縮過。
????If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.

???? 如果需要支持一個更大的文件系統(tǒng)，在Master上添加內(nèi)存只是很小的投入。將元數(shù)據(jù)放置于內(nèi)存中帶來了簡潔性、可靠性、高性能、靈活性等諸多好處。

2.6.2 Chunk Locations 塊位置

????The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.

????? Master并不持久化指定塊的復本位置信息。當啟動時，Master從塊服務器上獲取這些信息。Master可以自行保持更新，因為它控制了所有的塊放置操作，以及通過心跳信息監(jiān)視塊服務器的狀態(tài)。

????We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often.

???? 我們起初曾經(jīng)嘗試將快的位置信息在Master中進行持久化，但是我們決定啟動時讀取數(shù)據(jù)更加簡潔，同時可以消除Master與塊服務器之間的數(shù)據(jù)同步問題，諸如塊服務器加入或退出集群，變換名稱，失效，重啟等等。在一個擁有數(shù)百臺機器的集群中，這些事件經(jīng)常發(fā)生。
????Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.

???? 從另外一個角度去理解這個設計決策，對于某個塊是否存在于塊服務器上，這個快服務器是最具發(fā)言權(quán)的。在Master上維護這個信息的對應視圖是沒有必要的，因為塊服務器上的錯誤會導致塊自動消失（例如磁盤損壞失效）或者操作員重命名一個塊服務器。

2.6.3 Operation Log 操作日志

????The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.

???? 操作日志包含關鍵元數(shù)據(jù)的變動記錄。這是GFS的核心。它不僅是元數(shù)據(jù)的唯一持久記錄，也被作為定義并發(fā)操作順序的邏輯時間線。文件和塊，以及它們的版本（見4.5），在他們被創(chuàng)建后都可以被永遠唯一的標識。
????Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.

???? 由于操作日志的重要性，我們必須以可靠的方式保存它，而且只有元數(shù)據(jù)的變動被持久化后，變動才會對客戶端可見。否則，雖然塊還存在，我們卻可能丟失整個文件系統(tǒng)或者最近的客戶端操作。因此，我們將它的復本保存在多臺遠程機器上，并且只有在已經(jīng)將日志輸出到本地和遠程的磁盤上后，才會對客戶端的請求完成響應。為了降低傳輸和備份對于整個系統(tǒng)的影響，在發(fā)送日志前，Master會將多個日志記錄打包在一起。
????The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability.

???? 恢復文件系統(tǒng)的狀態(tài)時，Master重放操作日志。為了最小化啟動時間，我們必須讓日志盡量的小。每次日志超過一個指定的大小后，Master會對日志保存檢查點，這樣系統(tǒng)可以先加載最新的檢查點，而后只重放少數(shù)的日志就可以回退到最新狀態(tài)。檢查點是一個壓縮B樹的形式，可以直接被映射到內(nèi)存中，并且使用名稱空間查詢時無需額外的解析。這也將使恢復過程變得更快。

????Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to disk both locally and remotely.

???? 因為創(chuàng)建一個檢查點會花費一些時間，Master的內(nèi)部狀態(tài)被構(gòu)造為一種形式，這種形式可以使得創(chuàng)建新檢查點時不會對到來的變化產(chǎn)生延遲。Master會切換到一個新的日志文件，并在另一個線程中創(chuàng)建一個新的檢查點。新的檢查點包含切換前所有的變動。一個百萬級的集群的檢查點可以在一分鐘內(nèi)完成創(chuàng)建。當結(jié)束后，它將被寫入到本地和遠程的磁盤中。
????Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.

???? 恢復只需要最新的完整檢查點和后續(xù)的日志文件。更早的檢查點和日志文件可以被刪除，雖然我們將會保存一些來應對意外。創(chuàng)建檢查點時發(fā)生錯誤不會影響正確性，因為恢復代碼可以檢測和跳過不完整的檢查點。

2.7 Consistency Model 一致性模型

????GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the
details to other parts of the paper.

???? GFS擁有一個輕量的一致性模型，可以完美的支持高度分布的應用，但保持了簡單和容易實現(xiàn)的優(yōu)點。我們現(xiàn)在討論GFS對于一致性的保證，以及對于應用程序意味著什么。我們強調(diào)GFS如果管理這些保證，但是實現(xiàn)細節(jié)將在論文的后續(xù)部分進行討論。

????The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summarizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference
from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined regions. The applications do not need to further distinguish between different kinds of undefined regions.

???? 數(shù)據(jù)變動后文件范圍的狀態(tài)取決于變動的類型，是否成功，是否是并發(fā)變動。表格1匯總了結(jié)果。如果所有的客戶端不管從哪個復本讀取，都一直能看見相同的數(shù)據(jù)，則這個文件范圍是一致的。在一個文件數(shù)據(jù)變動后，如果它是一致的，并且客戶端可知變動的地方，則這個文件范圍是已定義的。如果一個變動成功，則被影響的文件范圍是已定義的（隱含的一致性）：所有的客戶端一直都可見寫入的變動。同步的成功變動使得范圍是一致的但是未定義：所有的客戶端看見相同的數(shù)據(jù)，但是它也許不會表現(xiàn)出發(fā)生的變動。通常情況下，它包含多個變動的混合片段。一個失敗的變動使得范圍變成不一致（因此也是未定義）：不同的客戶端在不同時間可能看見不同的數(shù)據(jù)。我們將會描述我們的程序如何能夠分辨已定義和未定義的范圍。應用不用去區(qū)分未定義的范圍的種類。

????Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.

???? 數(shù)據(jù)變動可以是寫或記錄追加。寫會將數(shù)據(jù)寫在應用指定的文件偏移位置。記錄追加將把數(shù)據(jù)原子性的追加到文件中，但是GFS可以選擇偏移位置（3.3）。（相比而言，通常的追加僅指在文件的末尾）偏移量將會返回到客戶端，并標識出包含記錄的已定義范圍的開始處。此外，GFS會在中間插入填充字符或者冗余記錄。它們占據(jù)被認為是不一致的范圍，通常比用戶數(shù)據(jù)的量少的多。
????After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas (Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity.

???? 在一系列成功變動后，變動的文件范圍保證是已定義的，并且包含了最后變動所寫入的數(shù)據(jù)。GFS通過下面的方法做到這一點：（a）將塊的變動在所有的復本上按相同的順序進行記錄（3.1），（b）使用塊版本號來檢測是否因為塊服務器死機造成錯過了某些變動，從而復本變成失效（4.5）。失效的復本將不再會涉及后續(xù)的變動，Master向客戶端響應塊的位置時也不會返回此復本的信息。它們將盡早被垃圾回收。

????Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next
open of the file, which purges from the cache all chunk information for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. When a reader retries and contacts the master, it will immediately get current chunk locations.

???? 因為客戶端緩存了塊的位置，它們可能會在信息刷新前從一個失效的復本讀取。時間窗口由緩存超時時間以及文件再次打開的時間而決定，文件打開后會清除緩存中所有塊的信息。而且，因為我們的大多數(shù)文件是僅追加的，一個失效的復本通常返回塊末尾之前的數(shù)據(jù)，而不是無效的數(shù)據(jù)。當重新聯(lián)系Master時，它可以立即得到當前的塊位置。
????Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk
is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data.

???? 成功變動過后很久，部件錯誤可以會損壞或銷毀數(shù)據(jù)。GFS使用Master和塊服務器之間的握手和數(shù)據(jù)校驗，可以識別失效的塊服務器（5.2），一旦出現(xiàn)問題，數(shù)據(jù)可以盡快的從有效的復本中恢復出來（4.3）。只有當一個塊的所有復本在GFS應對之前全部丟失，這個塊才會不可逆的丟失，通常GFS的反應時間在幾分鐘之內(nèi)，即使在此種情況下，塊變成不可用，但并沒有損壞：應用可以接收到明確的錯誤，而不是損壞的數(shù)據(jù)。

2.7.2 Implications for Applications 應用的影響

????GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.

???? GFS應用使用一些已經(jīng)在其他用途也需要的技巧，就可以適應這樣的簡化一致性模型了：依賴追加甚于覆寫，檢查點，寫入自驗證，自標識的記錄等。
????Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined
state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more efficient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.

???? 實際上，我們所有的應用程序使用追加進行文件變動多過覆寫。一個典型用法，寫入者從頭到尾生成文件。它寫完所有的數(shù)據(jù)后，將文件重命名為一個永久的名稱，或者周期性的為寫入成功多少而建立檢查點。檢查點也包含應用性的校驗和。讀取者只驗證和處理在最新檢查點中的文件范圍，也就是已定義的狀態(tài)。不管發(fā)生一致性和同步問題，這個方法工作的很好。追加比隨機寫有效的多，并且對應用失效更有彈性。檢查點允許寫入者漸進的重新開始，并避免讀取者從應用的角度認為文件數(shù)據(jù)已經(jīng)成功處理，然而實際上是不完整的。

????In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. These
functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.

???? 在另一個常見的用法中，多個寫入者并發(fā)的向一個文件進行追加，進行結(jié)果的合并或者作為生產(chǎn)者-消費者隊列。記錄追加的“最少一次追加”的語義保證了每個寫入者的輸出。讀取者按照下面的方法來應對偶爾的填充數(shù)據(jù)和冗余信息。寫入者準備的每條記錄都包含諸如校驗和這樣的額外信息，因此記錄的有效性可以被判斷。讀取者可以使用校驗和來識別和消除額外的填充數(shù)據(jù)和記錄片段。如果它不能容忍偶然的冗余（例如，如果他們出發(fā)非冪操作），它可以使用記錄的唯一標識來過濾掉它們，這些標識符通常用于名稱對應的應用，例如Web文檔。這些記錄I/O的功能（除去移除冗余）都封裝在庫的代碼中在應用中共享，并且可以用于google實現(xiàn)的其它文件接口。記錄的相同序列，加上少有的冗余，總是被分發(fā)給記錄的讀取者。

【Google論文】Google File System 中文翻譯（第1-2章）

更多文章、技術交流、商務合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯(lián)系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點擊下面給點支持吧，站長非常感激您！手機微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義

亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频