Hadoop and HDFS interview questions and answers



What is Big Data?
    Big Data is a term that describes the huge and complex data (i.e Structured, Semi-structured and Unstructured) that it becomes very tedious to capture, store, process, retrieve and analyze using traditional database and software techniques. 

What are the 5 V's (characteristics) of Big Data?
Big Data has five main characteristics: 
· Volume - Volume describes the amount of data generated by organizations or individuals.
·  Velocity - Velocity describes the frequency at which data is generated, captured and shared.
·  Variety – Variety refers to different types of data i.e. structured, semi-structured and unstructured data which is text, video, audio, sensor data, log files etc.
·   Veracity -Veracity refers to the messiness or trustworthiness of the data.
·   Value - Value refers to our ability turn our data into value.

What is Hadoop?
Hadoop is software framework or platform that allows for distributed storage and distributed processing of very large data sets on computer clusters.
Hadoop = HDFS + Map Reduce

What are features of Hadoop?
Fault Tolerance – By default 3 replicas of each block is stored across the cluster in Hadoop and it can be changed also as per the requirement. So if any node goes down, data on that node can be recovered from other nodes easily. Failures of nodes or tasks are recovered automatically by the framework.
Reliability – Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite machine failures. If your machine goes down, then also your data will be stored reliably.
High Availability – High availability (HA) refers to the capability of a Hadoop system to continue functioning, regardless of multiple system failures. If a machine or few hardware crashes, then data will be accessed from another path.
Scalability – Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of inexpensive servers that operate in parallel. Hadoop is highly scalable in the way new hardware can be easily added to the nodes. It also provides horizontal scalability which means new nodes can be added on the fly without any downtime.
Economic – Hadoop is not very expensive as it runs on the cluster of commodity hardware. We do not need any specialized machine for it. Hadoop provides huge cost saving also as it is very easy to add more nodes on the fly here. So if requirement increases, you can increase nodes as well without any downtime and without requiring much of pre-planning.
Data Locality – Hadoop works on data locality principle which states that move computation to data instead of data to computation. When the client submits the algorithm, this algorithm is moved to data in the cluster rather than bringing data to the location where an algorithm is submitted and then processing it.
Flexibility - Hadoop manages data whether structured, semi-structured or unstructured, encoded or formatted, or any other type of data.

What is the basic difference between traditional RDBMS and Hadoop?
    Hadoop Core doesn't support Real-time Data processing (OLTP), it is designed to support large-scale Batch Processing workloads (OLAP), whereas RDBMS are designed for OLTP (Real-time data processing) not Batch Processing.
    Hadoop is an approach to store the huge amount of data in the distributed file system and process it, whereas RDBMS is used for transactional systems to report and archive the data.
    Hadoop framework works very well with structured and unstructured data. This also supports the variety of data formats in real time such as XML, JSON and text-based flat file formats. However, RDBMS only work with better when an entity relationship model (ER model) is defined perfectly and therefore, the database schema or structure can grow and unmanaged otherwise.i.e. An RDBMS works well with structured data. 

What is the difference between Hadoop 1 and Hadoop 2?
   Hadoop 1.x
Hadoop 2.x
In Hadoop 1.x, “Namenode” has Single-Point-of-Failure (SPOF) because of single Namenode.
In Hadoop 2.x, there are Active and Passive (standby) “Namenodes”. If the active Namenode fails, the passive “Namenode” takes charge. Because of this High Availability can be achieved.
Supports MapReduce (MR) processing model only.
Allows working in MR as well as other distributed computing models like Spark, HBase coprocessors etc.
MR does both processing and cluster resource management.


YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.
A single Namenode to manage the entire namespace.
Multiple Namenode servers manage multiple namespaces.

What are the core components of Hadoop?
    Core components of Hadoop are HDFS and MapReduce.
 HDFS is basically used to store large datasets.
 MapReduce is used to process such large datasets. 

What is HDFS?
    HDFS (Hadoop Distributed File System) is a file system designed for storing very large files with streaming data access patterns, running clusters. HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. The file system is designed to be highly fault-tolerant, with high throughput.

How do you define “block” in HDFS? What is the block size in Hadoop 1 and Hadoop 2? Can it be changed?
   A “block” is the minimum amount of data that can be read or written. It is a storage entity of HDFS. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
In Hadoop 1, default block size is 64MB
In Hadoop 2, default block size is 128MB
    Yes, block size can be changed. The dfs.block.size parameter can be used in hdfs-site.xml file to set the size of a block in a Hadoop environment.

What is Block Scanner in HDFS?

Block scanner maintains the integrity of the data blocks. It runs periodically on every Datanode to verify whether the data blocks stored are correct or not.
Steps:-
  1. Datanode reports to Namenode.
  2. Namenode schedules the creation of new replicas using the good replicas.
  3. Once the replication factor (uncorrupted replicas) reaches to the required level, deletion of corrupted blocks takes place.
How to copy a file into HDFS with a different block size to that of existing block size configuration?


What is a Daemon?
Daemon is a process or service that runs in the background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “services”.

Various Hadoop Daemons and their roles in a Hadoop cluster.
Namenode: Namenode is the Master node which is responsible for storing the metadata for all the files and directories. It has information about the blocks that make a file and where those blocks are located in the cluster.NameNode uses two files for the namespace-
     fsimage file- It keeps track of the latest checkpoint of the namespace.
      edits file-It is a log of changes that have been made to the namespace since checkpoint.
Datanode: Datanode is the Slave node that contains the actual data. It reports information about blocks it contains to the Namenode in a periodic fashion.
Secondary Namenode: Secondary Namenode periodically merges changes in the Namenode with the Edit log so that it doesn't grow too large in size. It also keeps a copy of the image which can be used in case of failure of Namenode.
In a YARN cluster, there are two types of hosts:
Resource Manager: The ResourceManager is the central authority that manages resources and schedules applications running on top of YARN. It is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
Node Manager: A NodeManager runs on slave machines and is responsible for launching the application's containers, monitoring their resource usage (i.e CPU, memory, disk, network) and reporting these to the Resource Manager. It is a worker daemon that launches and tracks processes spawned on worker hosts.
Application Master: The Application Master is responsible for the execution of a single application. It asks for containers to the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. 
Job History Server: It maintains information about MapReduce jobs after the Application Master terminates.

What is a Rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

On what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and get datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as “Replica Placement Policy”.


What if rack 2 and datanode fails?
If both rack 2 and datanode present in rack 1 fails, then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replication only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

What is a Metadata?
Metadata is the information about the data which is stored in datanodes such as location of the file, size of the file and so on.

Are Namenode and Resource Manager in the same host?
No, in practical environment, Namenode is on separate host and Resource Manager is on separate host.

Why do we use HDFS for applications having large data sets and not when there are a lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because “Namenode” is a very expensive, high performance system, and it is not prudent to occupy the space in the “Namenode” by unnecessary amounts of metadata that are generated for multiple small files. So, when there is a large amount of data in a single file, “Namenode” will occupy less space. Hence, for getting optimized performance, HDFS supports large data sets instead of multiple small files.

Explain what is Speculative Execution?
If a node appears to be running a task slower, the master node can redundantly execute another instance of the same task on another node. Here, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.

Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.

What is compute and Storage nodes?
Compute Node: This is the computer or machine where your actual business logic will be executed.
Storage Node: This is the computer or machine where your file system reside to store the processing data.
In most of the cases compute node and storage node would be the same machine.

Explain the difference between NAS and HDFS?
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

Explain about the process of inter cluster data copying?
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the Hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of Hadoop.

What happens when two clients try to access the same file on the HDFS?
HDFS supports exclusive writes only. HDFS works on the principle of 'Write Once, Read Many'.
    When the first client contacts the “Namenode” to open the file for writing, the “Namenode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “Namenode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.

What is high availability in Hadoop?

Hadoop 2.0 overcomes this SPOF shortcoming by providing support for multiple NameNodes. It introduces Hadoop 2.0 High Availability feature that brings in an extra NameNode (Passive Standby NameNode) to the Hadoop Architecture which is configured for automatic failover. The goal of the HA Name Node project is to add support for deploying two Name Nodes in an active/passive configuration. The active NameNode performs all the client operations which includes serving the read and write requests. The standby NameNode maintains its state in order to ensure a fast failover in the event the active NameNode goes down.

Explain about the indexing process in HDFS?
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

What is the Use of SSH(Secure Shell or Secure Socket Shell) in Hadoop?
We should use SSH in Hadoop because SSH is a built-in username and password schema that can be used for secure access to a remote host; it is a more secure alternative to rlogin and telnet.

What are the modes Hadoop can run in?
Hadoop can run in three different modes-
Local/Standalone Mode
       i. This is the single process mode of Hadoop, which is the default mode, wherein which no    daemons are running.
       ii. This mode is useful for testing and debugging.
Pseudo Distributed Mode
        i. This mode is a simulation of fully distributed mode but on single machine. This means that,    all the daemons of Hadoop will run as a separate process.
        ii. This mode is useful for development.
Fully Distributed Mode
        i. This mode requires two or more systems as cluster.
        ii. Name Node, Data Node and all the processes run on different machines in the cluster.
        iii. This mode is useful for the production environment.

If there are 10 HDFS blocks to be copied from one machine to another. However, the other machine can copy only 7.5 blocks, is there a possibility for the blocks to be broken down during the time of replication?
No, the blocks cannot be broken, it is the responsibility of the master node to calculate the space required and accordingly allocate the blocks. Master node monitors the number of blocks that are in use and keeps track of the available space.

Comments

  1. Good Knowledge sharing about Hadoop .
    Hadoop has a huge demand in IT Industry.
    http://eonlinetraining.co/course/big-data-hadoop-online-training/

    ReplyDelete
  2. Nice Post.. Thanks for sharing such a valuable information about Big Data Hadoop Online Training.. Keep Sharing

    ReplyDelete
  3. Excellent Blog very imperative good content, this article is useful to beginners and real time
    employees.Thank u for sharing...
    Hadoop Training in Hyderabad

    ReplyDelete
  4. I agree with your points but i can't understand what's logic behind by including with the number? Why most of the marketers will suggest that one? Is there any important factor within that please convey me.....

    AWS Training in Chennai

    SEO Training in Chennai

    ReplyDelete
  5. Thanks For Sharing a good blog and it is useful to me keep share to these type of informatinon Hadoop training in Hyderabad

    ReplyDelete
  6. Excellent Blog very imperative good content, this article is useful to beginners and real time
    Employees. Hadoop Admin and Developer Online Training

    ReplyDelete
  7. Thanks for the information.I enjoyed reading this article.Keep doing this.We are providing Big Data Hadoop Training through various ways.For more information please visit our site Tekclasses

    ReplyDelete
  8. This post is the very useful post. Best Interview Questions for Hadoop. Please provide some more interview questions if you have.

    Thank you
    Hadoop Training in Hyderabad

    ReplyDelete
  9. Thanks for sharing such a grateful information with us on Big Data Analytics.
    We are eagerly waiting for more posts like this from the blog.
    I had some valuable information in my site about Big Data Analytics to get information please login to my site .
    Big Data Analytics Training In Hyderabad
    Big Data Science Course In Hyderabad

    ReplyDelete
  10. very useful topic in Hadoop .thank you for sharing big data hadoop online training

    ReplyDelete
  11. i learnt new information about Hadoop interview questions which really helpful to develop my knowledge and cracking the interview easily.. This concept explanation are very clear so easy to understand.
    Check out the : https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/

    ReplyDelete
  12. It was really nice post .I Was really impressed this post Big data hadoop online Course

    ReplyDelete
  13. Really very informative and creative contents. This concept is a good way to enhance the knowledge.
    thanks for sharing. please keep it up.
    Hadoop Training in Gurgaon

    ReplyDelete
  14. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  15. Thanks for the information, Hadoop training in Hyderabad
    It records the metadata of all the files stored in the cluster

    ReplyDelete
  16. This comment has been removed by the author.

    ReplyDelete
  17. Video editing course in Noida
    Video editing training institute in Noida- Webtrackker Technology is and IT Training institute providing the Video editing course in Noida, FCP, Final Cut Pro Training in Noida. For more call us- 8802820025.
    Company Address:
    Webtrackker Technology
    C- 67, Sector- 63, Noida
    Phone: 01204330760, 8802820025
    Email: info@webtrackker.com
    Website: http://webtrackker.com/Best-training-institute-Video-editing-FCP-course-in-Noida.php

    Video editing course in Noida

    ReplyDelete
  18. Cloud Computing Training In Noida
    Webtrackker is IT based company in many countries. Webtrackker will provide you a real time projects based training on Cloud Computing. If you are looking for the Cloud computing training in Noida then you can join the webtrackker technology.
    Cloud Computing Training In Noida , Cloud Computing Training center In Noida , Cloud Computing Training institute In Noida ,

    Company Address:
    Webtrackker Technology
    C- 67, Sector- 63, Noida
    Email: info@webtrackker.com
    Website: www.webtrackker.com
    http://webtrackker.com/Cloud-Computing-Training-Institutes-In-Noida.php

    ReplyDelete
  19. This article is very helpful and i hope this will be an useful information for everyone. Keep on updating these kinds of informative things...

    Software Training Institutes in Pune
    Best Software Training Institute in Pune

    ReplyDelete
  20. Thank you.Well it was nice post and very helpful information on Big Data Hadoop Online Course

    ReplyDelete
  21. It is really a great work and the way in which you are sharing the knowledge is excellent.

    big data company in chennai

    ReplyDelete
  22. Data Science is not an easy course; it is tough and challenging. You might be tempted to give up halfway through the course thinking that you cannot do it but the continuous motivation and excellent teaching by professionals will boost your morale and allow you to achieve it for sure.

    data science course in hyderabad

    ReplyDelete
  23. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging.
    Data Science Training in Chennai
    Data Science training in kalyan nagar
    Data science training in Bangalore
    Data Science training in marathahalli
    Data Science interview questions and answers
    Data science training in bangalore

    ReplyDelete
  24. Before taking a data science course, it is important for you to learn what data scientists do and what skills are required to pursue this course.

    Data Science Course in Hyderabad

    ReplyDelete
  25. Talentedgenext Way of Online Learning, Distance Education, is an increasing number of becoming popular all over the world due as it has many benefits. For further details visit in this site:- Distance Education Website,

    ReplyDelete
  26. Online BBA degree in India is one of the most popular couses for students who are keen to learn business statistics, communicating skills, marketing management, entrepreneurship and small business management, international business, etc. To know more, visit:
    Online BBA Degree in India,

    ReplyDelete
  27. I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article. You can also check info about Big Data Analytics Systems & Solutions
    big data customer analytics

    ReplyDelete
  28. Excellent Blog, I like your blog and It is very informative. Thank you
    Python
    Programming Language

    ReplyDelete
  29. I read this blog, a Nice article...Thanks for sharing and waiting for the next...
    How does Digital Marketing works?
    What is Digital Marketing?

    ReplyDelete

Post a Comment