Spark Interview Questions and Answers

Spark Interview Questions and Answers

What is Spark?
Spark is a cluster computing framework designed to be fast and general purpose.
Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, interactive algorithms, interactive queries, and streaming.
Spark is designed to be highly accessible, offering simple APIs in Scala, Python, Java and SQL and has rice built-in libraries.
Spark can run on Hadoop cluster and is capable of accessing diverse data sources including HDFS, HBase, Cassandra, Mongodb and others.

Explain key features of Spark.
  • Spark allows integration with Hadoop and files included in HDFS.
  • Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter.
  • Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
  • Spark supports multiple analytics tools that are used for interactive query analysis, real-time analysis and graph processing.

Difference between MapReduce and Spark.
Properties
MapReduce
Spark
Data Storage (Caching)
Hard disk
In-memory
Processing Speeds
Good
Excellent(upto 100x faster)
Interactive jobs performance
Average
Excellent
Hadoop Independency
No
Yes
Machine learning applications
Average
Excellent
Usage
Batch Processing
Real-time processing
Written in
Java
Scala

What is Spark Core?
Spark Core contains the basic functionality of Spark which include components for task scheduling, memory management, fault recovery, interacting with storage systems and more.
Spark Core acts as home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction.

Spark stack


Cluster Managers in Spark.
Spark depends on a cluster manager to launch executors and in certain cases, to launch the driver.
The Spark framework supports three major types of Cluster Managers:
     • Standalone Scheduler: a basic cluster manager to set up a cluster
     • Mesos: It is a generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
     • Yarn: It is responsible for resource management in Hadoop

Core Spark Concepts
Every Spark application consists of a driver program that lunches various parallel operations on a cluster.
The driver program contains your applications main function and defines distributed datasets on the cluster, then applies operations to them.
Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.
To run the operations, driver programs typically manages the number of nodes called executors.
Spark connect to a cluster to analyse data in parallel.

What does a Spark Engine do?
Spark Engine is responsible for scheduling, distributing and monitoring the data application across the cluster.

What is SparkContext?
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

What is RDD (Resilient distributed dataset)?
An RDD in Spark is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.  

How to create RDDs?
Spark provides two ways to create RDDs:
     ·        Loading an external dataset
     ·        Parallelizing a collection in a driver program
One way to create RDDs is to load data from external storage.
             Eg. val lines= sc.textfile(“/path/to/README.md”)

Another way to create RDDs is to take an existing collection in a program and pass it to SparkContext’s parallelize() method.
            Eg. val lines= sc.parallelize(List(“Spark”, ”It is very fast”))

What are RDD operations?
RDDs support two types of operations:
     ·        Transformations- Transformations construct a new RDD from previous one.
     ·        Actions- Actions compute a result based on an RDD.

What are Transformations operators?
Transformations are operations on RDDs that returns a new RDD. Transformations on RDDs are lazily evaluated, which means that Spark will not begin to execute until it sees an action. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD.

What are Actions operators?
Actions are the operators that return a final value to the driver program or write data to an external storage system. Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output.
reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to local node.

What is RDD Lineage graph?
Spark keeps track of the set of dependencies between different RDDs, called lineage graph. Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Define Partitions.
As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. Partitioning is the process to derive logical units of data to speed up the processing process. Everything in Spark is a partitioned RDD.
Spark’s partitioning is available on all RDDs of key/value pairs, and causes the system to group elements based on a function of each key.
Eg. rdd.partitionBy(100)

What is Spark Driver?
“Spark Driver” is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

What is Spark Executor?
When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

What is worker node?
Worker node refers to any node that can run the application code in a cluster.
What is Hive on Spark?
Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.
set hive.execution.engine=spark;
The main task around implementing the Spark execution engine for Hive lies in query planning, where Hive operator plans from the semantic analyser which is translated to a task plan that Spark can execute. It also includes query execution, where the generated Spark plan gets actually executed in the Spark cluster.

What are Spark’s Ecosystems?
• Spark SQL for working with structured data
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.

What is Spark SQL?
Spark SQL is Spark’s package for working with structured data. It allows querying of data via SQL as well as the Hive variant of SQL which is called as Hive Query Language(HQL). It supports many sources of data which includes Hive tables, Parquet and JSON.

What is Spark Streaming?
Spark Streaming is a Spark component that enables processing of live streams of data. Data streams includes log files generated by production web servers, or queues of messages containing status updates posted by users of a web service.

 What is MLlib?
Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering and collaborative filtering, as well as supporting functionality such as model evaluation and data import.

What is GraphX?
GraphX is a library for manipulating graphs (e.g. a social networks friend graph) and performing graph-parallel computations. GraphX also provides various operators for manipulating graphs (e.g. subgraph and mapVertices) and a library of common graph algorithms (e.g. PageRank and triangle counting)

What is PageRank?
PageRank is an iterative algorithm that can be used to rank web pages and can perform many joins. It is a unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.

What is Yarn?
Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.

What are the common mistakes developers make when running Spark applications?
Developers often make the mistake of-
     ·         Hitting the web service several times by using multiple clusters.
     ·         Run everything on the local node instead of distributing it.
     ·         Developers need to be careful with this, as Spark makes use of memory for processing.

What is the difference between persist() and cache()?
persist() allows the user to specify the storage level whereas cache() uses the default storage level.

What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics format so far.

What is the advantage of a Parquet file?
Parquet file is a columnar format file that helps –
     ·         Limit I/O operations
     ·         Consumes less space
     ·         Fetches only required columns.

What are the various data sources available in SparkSQL?
     ·         Parquet file
     ·         JSON Datasets
     ·         Hive tables

How Spark uses Hadoop?
Spark has its own cluster management computation and mainly uses Hadoop for storage.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
What are the benefits of using Spark with Apache Mesos?
It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

Which spark library allows reliable file sharing at memory speed across different cluster frameworks?
Tachyon

What do you understand by Pair RDD?
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function

What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are -
·     MEMORY_ONLY
·     MEMORY_ONLY_SER
·     MEMORY_AND_DISK
·     MEMORY_AND_DISK_SER, DISK_ONLY
·     OFF_HEAP

How Spark handles monitoring and logging in Standalone mode?
Spark has a web based user interface for monitoring the cluster in standalone mode that shows the cluster and job statistics. The log output for each job is written to the work directory of the slave nodes.

Does Apache Spark provide check pointing?
Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

How can you launch Spark jobs inside Hadoop MapReduce?
Using SIMR (Spark in MapReduce) users can run any spark job inside MapReduce without requiring any admin rights.

How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

How can you achieve high availability in Spark?
     ·         Implementing single node recovery with local file system
     ·         Using StandBy Masters with Apache ZooKeeper.

Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

What do you understand by SchemaRDD?
An RDD that consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column 

What are Shared Variables?
When a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

What is an “Accumulator”?
“Accumulators” provides a simple syntax for aggregating values from worker nodes back to the driver program. Accumulators are Spark’s offline debuggers. Similar to “Hadoop Counters”, Accumulators provide the number of “events” in a program.
Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.

What are “Broadcast variables”?
“Broadcast variables” allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

What is sbt (simple build tool) in Spark?
sbt is a newer build tool most often used for Scala projects. sbt assumes a similar project layout to Maven. sbt build files are written in a configuration language where we assign values to specific keys in order to define the build for our project. 

Comments

  1. thank for you sharing your blog ,excellent blog good idea big data hadoop
    Hadoop Training In Hyderabad

    ReplyDelete
    Replies
    1. Wonderful article! We are linking to this great post on our site.Keep up the good writing. Visit: Please Read More: Download Ebook: Ultimate Guide To Job Interview Questions Answers:



















































      Delete
    2. Spark Interview Questions And Answers >>>>> Download Now

      >>>>> Download Full

      Spark Interview Questions And Answers >>>>> Download LINK

      >>>>> Download Now

      Spark Interview Questions And Answers >>>>> Download Full

      >>>>> Download LINK og

      Delete
  2. thanks for providing this blog ,it is very informative and we have an idea about this blog..please once check it out Hadoop Admin Online Training Hyderabad

    ReplyDelete
  3. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.
    Hadoop Training in Chennai

    ReplyDelete
  4. Excellent Blog very imperative good content, this article is useful to beginners and real time
    employees.Thank u for sharing...
    Hadoop Training in Hyderabad

    ReplyDelete
  5. very useful lecture .really felt happy with the lecture. big data hadoop online training

    ReplyDelete
  6. Helped me a lot at the time of INTERVIEW . Waiting for other posts big data hadoop online training

    ReplyDelete
  7. I definitely appreciate your blog. Excellent work!
    Big data

    ReplyDelete
  8. Hai,
    It's very nice blog
    Thank you for giving valuable information on Hadoop
    I'm expecting much more from you...

    ReplyDelete
  9. Hi,
    Thank you for sharing hadoop blog .this is very interesting useful to us...

    ReplyDelete
  10. Excellent Blog very imperative nice content, thanks for sharing- https://www.madridsoftwaretrainings.com/hadoop.php

    ReplyDelete
  11. awesome post presented by you..your writing style is fabulous and keep update with your blogs Hadoop Admin Online Course India

    ReplyDelete
  12. thank you for sharing such a good and useful information, please keep on share like this
    hadoop training in hyderabad
    hadoop online training
    hadoop training in ameerpet

    ReplyDelete
  13. Excellent post. Very interesting to read. I really love to read such a nice post. Thanks! keep rocking.Big Data Hadoop Online Course Hyderabad

    ReplyDelete
  14. Thanks for helping me to understand Spark. As a beginner in Hadoop your post helped me a lot.keep rocking!!
    Hadoop Training in Velachery .


    ReplyDelete
  15. Privileged to read this informative blog on Hadoop.Commendable efforts to put on research the hadoop. Please enlighten us with regular updates on hadoop. Friends if you're keen to learn more about AI you can watch this amazing tutorial on the same.
    https://www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete
  16. It was really very useful for both beginners and Real time workers. Keep continuing.
    Best Hadoop Training institute in chennai

    Bigdata Hadoop Online course

    Best Training institute in chennai

    ReplyDelete
  17. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  18. We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.


    http://www.coepd.com/AnalyticsTraining.html

    ReplyDelete
  19. It was amazing blog. I am happy to find your post. Thanks for sharing.
    Big Data Certification in Pune

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete
  21. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    java training in chennai | java training in bangalore

    java training in tambaram | java training in velachery

    java training in omr

    ReplyDelete
  22. This comment has been removed by the author.

    ReplyDelete
  23. Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
    python training in OMR
    python training in tambaram
    python training in annanagar

    ReplyDelete
  24. Read all the information that i've given in above article. It'll give u the whole idea about it.
    Devops Training in pune

    ReplyDelete
  25. It is really a great work and the way in which you are sharing the knowledge is excellent.

    big data company in chennai

    ReplyDelete

  26. Appreciation for really being thoughtful and also for deciding on certain marvelous guides most people really want to be aware of.

    Big data training in bangalore
    Hadoop training institute in bangalore

    ReplyDelete
  27. Mytranings _Hadoop is an Open Source Framework Data Processing For Application Big Data Environment Cluster Using Computers in a Simple Modules .Programming Modules .
    * I want to Give Best Training time Link _< http://mytrainingsonline.com/course/hadoop/ >

    Hadoop Online Trainings
    Hadoop Best Online Trainings
    Hadoop Online Trainings in Hyderabad

    ReplyDelete
  28. Thanks for sharing the good information and post more information. I need some facilitate to my website. please check once http://talentflames.com/
    training and placement company in Hyderabad

    ReplyDelete
  29. It is a great post. Keep sharing such kind of useful information.

    Article submission sites
    Technology

    ReplyDelete
  30. I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.Surya Informatics

    ReplyDelete
  31. thanks for the information which you have provided.it helps me a lot i have seen so many youtube videos there is no satisfaction onces again thank you sir
    data science training in hyderabad
    Hadoop training in hyderabad

    ReplyDelete
  32. I’ve desired to post about something similar to this on one of my blogs and this has given me an idea. Cool Mat.


    Hadoop Online Training

    ReplyDelete
  33. Good post!Thank you so much for sharing this pretty post,it was so good to read and useful to improve my knowledge as updated one,keep blogging.
    Big Data Hadoop training in Electronic City

    ReplyDelete
  34. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating big data online training

    ReplyDelete
  35. Good post!Thank you so much for sharing this pretty post,it was so good to read and useful to improve my knowledge on Apache Spark Certification

    ReplyDelete
  36. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Big data training

    ReplyDelete
  37. Very impressive article! The blog is highly informative and has answered all my questions. To introduce about our company and the activities, Techno Data Group is a database provider that helps you to boost your sales & grow your business through well-build Hadoop Users Email.

    ReplyDelete
  38. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Big Data Solutions

    Data Lake Companies

    Advanced Analytics Solutions

    Full Stack Development Company

    ReplyDelete


  39. Hii, I'm too lazy to sign up an account just for comment your article. it's really good and helping dude. thanks. Hadoop Training in Delhi

    ReplyDelete
  40. Good Blog, well descrided, Thanks for sharing this information.
    Spark and Scala Online Training

    ReplyDelete
  41. AWS big data consultant should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.

    ReplyDelete
  42. You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Mapr Installation Steps,for more information about Mapr Installation guide.

    ReplyDelete
  43. Limiting the number of questions was not appealing because it made the sampling small and coverage uneven while placing more weight on the few remaining questions. machine learning and artificial intelligence courses in hyderabad

    ReplyDelete
  44. This comment has been removed by the author.

    ReplyDelete
  45. Thank you for sharing the post,it helps a lot in preparation.

    Best Hadoop Online Training Institute

    ReplyDelete
  46. I am always eager to catch hold of the new posts being published on your website, because of this i use to updated, thanks for sharing this wonderful article.This is my first time visit here. From the tons of comments on your articles.I guess I am not only one having all the enjoyment right here!Java training in Chennai

    Java Online training in Chennai

    Java Course in Chennai

    Best JAVA Training Institutes in Chennai

    Java training in Bangalore

    Java training in Hyderabad

    Java Training in Coimbatore

    Java Training

    Java Online Training

    ReplyDelete
  47. Thanks for sharing an informative blog keep rocking bring more details.I like the helpful info you provide in your articles. I’ll bookmark your weblog and check again here regularly.

    Software Testing Training in Chennai | Certification | Online
    Courses



    Software Testing Training in Chennai

    Software Testing Online Training in Chennai

    Software Testing Courses in Chennai

    Software Testing Training in Bangalore

    Software Testing Training in Hyderabad

    Software Testing Training in Coimbatore

    Software Testing Training

    Software Testing Online Training

    ReplyDelete
  48. very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.


    Azure Training in Chennai

    Azure Training in Bangalore

    Azure Training in Hyderabad

    Azure Training in Pune

    Azure Training | microsoft azure certification | Azure Online Training Course

    Azure Online Training


    ReplyDelete
  49. Thank you ,these blog is very useful to college students its very useful learn and improve their skills ,it used to improve their knowledge.

    ReplyDelete
  50. Thanks for sharing this informative tips for data science. I am also sharing one more data science training institute in Gurgaon.

    ReplyDelete
  51. Infycle Technologies, the best software training institute in Chennai offers the leading Data Science course in Chennai for tech professionals, freshers, and students at the best offers. In addition to the Data Science course, other in-demand courses such as Python, Cyber Security, Selenium, Oracle, Java, Power BI, Digital Marketing also will be trained with 100% practical classes. After the completion of training, the trainees will be sent for placement interviews in the top MNC's. Call 7504633633 to get more info and a free demo.

    ReplyDelete
  52. Spark Interview Questions And Answers >>>>> Download Now

    >>>>> Download Full

    Spark Interview Questions And Answers >>>>> Download LINK

    >>>>> Download Now

    Spark Interview Questions And Answers >>>>> Download Full

    >>>>> Download LINK hP

    ReplyDelete

Post a Comment