What is Fault Tolerance?
The single greatest challenge in designing any system is having to worry about and deal with frequent component failures. Since quality isn’t always up to standards, it can sometimes be more difficult than originally thought to deal with compo-nent failures. The difficulty is high because we cannot completely trust the machines that run our systems and cannot completely trust the disks in the system. The worst possible outcomes from system failure are the risks of an unavailable system or corrupted data.
Since there are so many GFS clusters in the GFS system, we must account for the possibility of any number of clusters becoming unavailable at any given time. Because of the high risk of failure at any given time, the system must be implemented with two strategies to combat this major issue, fast recovery and replication.
Our writers can help you with any type of essay. For any subjectGet your price
How it works
The master and the chunkserver are designed to restore their state in seconds no matter how they terminate. Many times, servers are just shut down by killing the process. Client servers, for example, experience a minor glitch as they time out on their outstanding requests. To offset this glitch, their system attempts to reconnect to the restarted server and then retry.
In chunkservers, each chunk is replicated on multiple chunkservers on different racks. When a user is utilizing a chunkserver, they can specify different replication levels for different parts of the namespace for the file (default = 3). The master then clones existing replicas to accommodate each chunk fully replicated as chunkservers go offline or detect them as corrupted replicas.
The Master is important to replicate because its operation is to log, where checkpoints are replicated on multiple machines. For simplicity, one master process controls all mutations to this state as well as background activities that can change the system internally. When it fails, it can restart almost instantly. “”Shadow masters”” provide read-only access to the file system even when the main master is down. They also enhance read availability for files that not actively mutating, and they enhance applications that don’t mind getting meager results. These types of masters depend on the main master only for replica location updates resulting from the main masters’ decision to create and delete replicas.
In GFS, each chunkserver uses “”checksumming”” to detect corruption of stored data. Because of the large amounts of data stored on GFS, there is a commonality of failures that could occur. It regularly experiences disk failures that cause data corruption or loss of data on both read and write paths. Recovering this data is possible, however, using other chuck replicas. Identical replicas are not certain, so each chunkserver must independently verify the integrity of its own copy by maintaining checksums.
A chunk is broken up into 64 KB blocks and has a corresponding 32-bit checksum. For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether it be a client or another chunkserver. Therefore, chunk-servers will not propagate corruptions to other machines. If a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten. Then the write is performed, and a computation is then done to record the new checksums.
Diagnostic tools, such as logging are crucial to the identification of problems throughout the system. They are beneficial to the overall system and company because they incur a minimal cost. These logs record many important events and all RPC requests and replies. The nice part about logging is that logs can be conveniently deleted without affecting the correctness of the system. The RPC logs include requests and responses sent on the wire, except for the file data being read or written. By matching requests with replies, the entire interaction history and can be reconstructed.