Hadoop and HDFS interview questions and answers

February 12, 2017

What is Big Data?

Big Data is a term that describes the huge and complex data (i.e Structured, Semi-structured and Unstructured) that it becomes very tedious to capture, store, process, retrieve and analyze using traditional database and software techniques.

What are the 5 V's (characteristics) of Big Data?

Big Data has five main characteristics:

· Volume - Volume describes the amount of data generated by organizations or individuals.

· Velocity - Velocity describes the frequency at which data is generated, captured and shared.

· Variety – Variety refers to different types of data i.e. structured, semi-structured and unstructured data which is text, video, audio, sensor data, log files etc.

· Veracity -Veracity refers to the messiness or trustworthiness of the data.

· Value - Value refers to our ability turn our data into value.

What is Hadoop?

Hadoop is software framework or platform that allows for distributed storage and distributed processing of very large data sets on computer clusters.

Hadoop = HDFS + Map Reduce

What are features of Hadoop?

Fault Tolerance – By default 3 replicas of each block is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework.

Reliability – Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.

High Availability – High availability (HA) refers to the capability of a Hadoop system to continue functioning, regardless of multiple system failures. If a machine or few hardware crashes, then data will be accessed from another path.

Scalability – Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Hadoop is highly scalable in the way new hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the fly without any downtime.

Economic – Hadoop is not very expensive as it runs on the cluster of commodity hardware. We do not need any specialized machine for it. Hadoop provides huge cost saving also as it is very easy to add more nodes on the fly here. So if requirement increases, you can increase nodes as well without any downtime and without requiring much of pre-planning.

Data Locality – Hadoop works on data locality principle which states that move computation to data instead of data to computation. When the client submits the algorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where an algorithm is submitted and then processing it.

Flexibility - Hadoop manages data whether structured, semi-structured or unstructured, encoded or formatted, or any other type of data.

What is the basic difference between traditional RDBMS and Hadoop?

Hadoop Core doesn't support Real-time Data processing (OLTP), it is designed to support large-scale Batch Processing workloads (OLAP), whereas RDBMS are designed for OLTP (Real-time data processing) not Batch Processing.

Hadoop is an approach to store the huge amount of data in the distributed file system and process it, whereas RDBMS is used for transactional systems to report and archive the data.

Hadoop framework works very well with structured and unstructured data. This also supports the variety of data formats in real time such as XML, JSON and text-based flat file formats. However, RDBMS only work with better when an entity relationship model (ER model) is defined perfectly and therefore, the database schema or structure can grow and unmanaged otherwise.i.e. An RDBMS works well with structured data.

What is the difference between Hadoop 1 and Hadoop 2?

Hadoop 1.x	Hadoop 2.x
In Hadoop 1.x, “Namenode” has Single-Point-of-Failure (SPOF) because of single Namenode.	In Hadoop 2.x, there are Active and Passive (standby) “Namenodes”. If the active Namenode fails, the passive “Namenode” takes charge. Because of this High Availability can be achieved.
Supports MapReduce (MR) processing model only.	Allows working in MR as well as other distributed computing models like Spark, HBase coprocessors etc.
MR does both processing and cluster resource management.	YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.
A single Namenode to manage the entire namespace.	Multiple Namenode servers manage multiple namespaces.

What are the core components of Hadoop?

Core components of Hadoop are HDFS and MapReduce.

HDFS is basically used to store large datasets.

MapReduce is used to process such large datasets.

What is HDFS?

HDFS (Hadoop Distributed File System) is a file system designed for storing very large files with streaming data access patterns, running clusters. HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. The file system is designed to be highly fault-tolerant, with high throughput.

How do you define “block” in HDFS? What is the block size in Hadoop 1 and Hadoop 2? Can it be changed?

A “block” is the minimum amount of data that can be read or written. It is a storage entity of HDFS. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.

In Hadoop 1, default block size is 64MB

In Hadoop 2, default block size is 128MB

Yes, block size can be changed. The dfs.block.size parameter can be used in hdfs-site.xml file to set the size of a block in a Hadoop environment.

What is Block Scanner in HDFS?
Block scanner maintains the integrity of the data blocks. It runs periodically on every Datanode to verify whether the data blocks stored are correct or not.
Steps:-

Datanode reports to Namenode.
Namenode schedules the creation of new replicas using the good replicas.
Once the replication factor (uncorrupted replicas) reaches to the required level, deletion of corrupted blocks takes place.

How to copy a file into HDFS with a different block size to that of existing block size configuration?

What is a Daemon?

Daemon is a process or service that runs in the background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services”.

Various Hadoop Daemons and their roles in a Hadoop cluster.

Namenode: Namenode is the Master node which is responsible for storing the metadata for all the files and directories. It has information about the blocks that make a file and where those blocks are located in the cluster.NameNode uses two files for the namespace-

fsimage file- It keeps track of the latest checkpoint of the namespace.

edits file-It is a log of changes that have been made to the namespace since checkpoint.

Datanode: Datanode is the Slave node that contains the actual data. It reports information about blocks it contains to the Namenode in a periodic fashion.

Secondary Namenode: Secondary Namenode periodically merges changes in the Namenode with the Edit log so that it doesn't grow too large in size. It also keeps a copy of the image which can be used in case of failure of Namenode.

In a YARN cluster, there are two types of hosts:

Resource Manager: The ResourceManager is the central authority that manages resources and schedules applications running on top of YARN. It is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.

Node Manager: A NodeManager runs on slave machines and is responsible for launching the application's containers, monitoring their resource usage (i.e CPU, memory, disk, network) and reporting these to the Resource Manager. It is a worker daemon that launches and tracks processes spawned on worker hosts.

Application Master: The Application Master is responsible for the execution of a single application. It asks for containers to the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers.

Job History Server: It maintains information about MapReduce jobs after the Application Master terminates.

What is a Rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

On what basis data will be stored on a rack?

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and get datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as “Replica Placement Policy”.

What if rack 2 and datanode fails?

If both rack 2 and datanode present in rack 1 fails, then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replication only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

What is a Metadata?

Metadata is the information about the data which is stored in datanodes such as location of the file, size of the file and so on.

Are Namenode and Resource Manager in the same host?

No, in practical environment, Namenode is on separate host and Resource Manager is on separate host.

Why do we use HDFS for applications having large data sets and not when there are a lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because “Namenode” is a very expensive, high performance system, and it is not prudent to occupy the space in the “Namenode” by unnecessary amounts of metadata that are generated for multiple small files. So, when there is a large amount of data in a single file, “Namenode” will occupy less space. Hence, for getting optimized performance, HDFS supports large data sets instead of multiple small files.

Explain what is Speculative Execution?

If a node appears to be running a task slower, the master node can redundantly execute another instance of the same task on another node. Here, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.

What is compute and Storage nodes?

Compute Node: This is the computer or machine where your actual business logic will be executed.

Storage Node: This is the computer or machine where your file system reside to store the processing data.

In most of the cases compute node and storage node would be the same machine.

Explain the difference between NAS and HDFS?

NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.

NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.

In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

Explain about the process of inter cluster data copying?

HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the Hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of Hadoop.

What happens when two clients try to access the same file on the HDFS?

HDFS supports exclusive writes only. HDFS works on the principle of 'Write Once, Read Many'.

When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

What is high availability in Hadoop?

Hadoop 2.0 overcomes this SPOF shortcoming by providing support for multiple NameNodes. It introduces Hadoop 2.0 High Availability feature that brings in an extra NameNode (Passive Standby NameNode) to the Hadoop Architecture which is configured for automatic failover. The goal of the HA Name Node project is to add support for deploying two Name Nodes in an active/passive configuration. The active NameNode performs all the client operations which includes serving the read and write requests. The standby NameNode maintains its state in order to ensure a fast failover in the event the active NameNode goes down.

Explain about the indexing process in HDFS?

Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

What is the Use of SSH(Secure Shell or Secure Socket Shell) in Hadoop?

We should use SSH in Hadoop because SSH is a built-in username and password schema that can be used for secure access to a remote host; it is a more secure alternative to rlogin and telnet.

What are the modes Hadoop can run in?

Hadoop can run in three different modes-

Local/Standalone Mode

i. This is the single process mode of Hadoop, which is the default mode, wherein which no daemons are running.

ii. This mode is useful for testing and debugging.

Pseudo Distributed Mode

i. This mode is a simulation of fully distributed mode but on single machine. This means that, all the daemons of Hadoop will run as a separate process.

ii. This mode is useful for development.

Fully Distributed Mode

i. This mode requires two or more systems as cluster.

ii. Name Node, Data Node and all the processes run on different machines in the cluster.

iii. This mode is useful for the production environment.

If there are 10 HDFS blocks to be copied from one machine to another. However, the other machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down during the time of replication?

No, the blocks cannot be broken, it is the responsibility of the master node to calculate the space required and accordingly allocate the blocks. Master node monitors the number of blocks that are in use and keeps track of the available space.