Understanding the Hadoop Distributed File System (HDFS)

“Write once Read n-number of times, but never change the format of the Data”

The Hadoop Distributed File System(HDFS) is a versatile, resilient, clustered approach to managing files in a big data environment. HDFS is not the final destination for the files. Rather, it is a data service that offers a unique set of capabilities needed when data volumes and velocity are high. Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis. The service includes a “NameNode” and multiple “data nodes” running on a commodity hardware cluster and provides the highest levels of performance when the entire cluster is in the same physical rack in the data center. HDFS stores large files across a commodity cluster typically in the range of GB or TB. It is scalable and portable file system written in Java for the Hadoop framework. In essence, the NameNode keeps track of where data is physically stored. Figure 9-1 depicts the basic architecture of HDFS.

Understanding the Hadoop Distributed File System (HDFS)-BDM

Description of the above defined HDFS image :

  • NameNodes

HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes make up the complete file. The NameNode also acts as a “traffic cop,” managing all access to the files, including reads, writes, creates, deletes, and replication of data blocks on the data nodes. The complete collection of all the files in the cluster is sometimes referred to as the file system namespace. It is the NameNode’s job to manage this namespace. Even though a strong relationship exists between the NameNode and the data nodes, they operate in a “loosely coupled” fashion. This allows the cluster elements to behave dynamically, adding (or subtracting) servers as the demand increases (or decreases).

In a typical configuration, you find one NameNode and possibly a data node running on one physical server in the rack. Other servers run data nodes only. Data nodes are not very smart, but the NameNode is. The data nodes constantly ask the NameNode whether there is anything for them to do. This continuous behavior also tells the NameNode what data nodes are out there and how busy they are. The data nodes also communicate among themselves so that they can cooperate during normal file system operations. This is necessary because blocks for one file are likely to be stored on multiple data nodes. Since the NameNode is so critical for correct operation of the cluster, it can and should be replicated to guard against a single point failure.

 

  •  Data nodes

Data nodes are not smart, but they are resilient. Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by the NameNode. The replication mechanism is designed for optimal efficiency when all the nodes of the cluster are collected into a rack. In fact, the NameNode uses a “rack ID” to keep track of the data nodes in the cluster. HDFS clusters are sometimes referred to as being “rack-aware.” Data nodes also provide “heartbeat” messages to detect and ensure connectivity between the NameNode and the data nodes. When a heartbeat is no longer present, the NameNode unmaps the data node from the cluster and keeps on operating as though nothing happened. When the heartbeat returns (or a new heartbeat appears), it is added to the cluster transparently with respect to the user or application. As with all file systems, data integrity is a key feature. HDFS supports a number of capabilities designed to provide data integrity. As it might expect, when files are broken into blocks and then distributed across different servers in the cluster, any variation in the operation of any element could affect data integrity.

HDFS uses transaction logs and checksum validation to ensure integrity across the cluster. Transaction logs are a very common practice in file system and database design. They keep track of every operation and are effective in auditing or rebuilding of the file system should something untoward occur. Checksum validations are used to guarantee the contents of files in HDFS. When a client requests a file, it can verify the contents by examining its checksum. If the checksum matches, the file operation can continue. If not, an error is reported. Checksum files are hidden to help avoid tampering. Data nodes use local disks in the commodity server for persistence. All the data blocks are stored locally, primarily for performance reasons. Data blocks are replicated across several data nodes, so the failure of one server may not necessarily corrupt a file. The degree of replication, the number of data nodes, and the HDFS namespace are established when the cluster is implemented. Because HDFS is dynamic, all parameters can be adjusted during the operation of the cluster.

 

Understanding the Hadoop Distributed File System (HDFS) architecture

Conclusion:

By the end 2015 Cisco estimate that global internet traffic reached 4.8 Billion Terabytes, indicates both the Big Data challenge and Big Data opportunity ahead.In Facebook there is no user guide for how this work because no website has ever handled this many visitors before. When you get more users than there are cars in the world, one of the biggest problem is storage. The storage of your laptop could fit in your hand, but here they need something bigger. This is where the concept of data centers comes into picture, this is where the information is stored in cutting edge servers and massive memory banks with data flying between them with speed of light. When we type Facebook .com our request goes to open internet and that internet lends right here in the data centers and from right here they have requested from one of the Facebook servers. And your profile and all the information associated with it. The data centers work and compile all that information and send it back to the open internet again and all of this happens in milliseconds. Some people consider the internet a cloud as it is floating around in the sky, but it’s not, it’s a real physical thing, internet is a real physical building interconnected with miles and miles of fibers and all of these buildings can talk to each other and share data back and forth.These buildings are going to increase day after the day, so we should get ready to face several new upcoming technologies emerging in the near future.

We as Coder in Me appreciate the efforts to people who contribute their precious time for making this Community Contribution.We are open to guest Contribution, If you have interesting stuff to share please contact us

Leave a reply:

Your email address will not be published.