2023 – 2024 Cognitive Class – Spark Fundamentals I Answers
Module 1: Introduction to Spark
1. What gives Spark its speed advantage for complex applications?
- Spark extends the MapReduce model
- Various libraries provide Spark with additional functionality
- Spark can cover a wide range of workloads under one system
- Spark makes extensive use of in-memory computations
- All of the above
2. For what purpose would an Engineer use Spark? Select all that apply.
- Analyzing data to obtain insights
- Programming with Spark’s API
- Transforming data into a useable form for analysis
- Developing a data processing system
- Tuning an application for a business use case
3. Which of the following statements are true of the Resilient Distributed Dataset (RDD)? Select all that apply.
- There are three types of RDD operations.
- RDDs allow Spark to reconstruct transformations
- RDDs only add a small amount of code due to tight integration
- RDD action operations do not return a value
- RDD is a distributed collection of elements parallelized across the cluster.
Module 2: Resilient Distributed Dataset and DataFrames
1. Which of the following methods can be used to create a Resilient Distributed Dataset (RDD)? Select all that apply.
- Creating a directed acyclic graph (DAG)
- Parallelizing an existing Spark collection
- Referencing a Hadoop-supported dataset
- Using data that resides in Spark
- Transforming an existing RDD to form a new one
2. What happens when an action is executed?
- Executors prepare the data for operation in parallel
- The driver sends code to be executed on each block
- A cache is created for storing partial results in memory
- Data is partitioned into different blocks across the cluster
- All of the above
3. Which of the following statements is true of RDD persistence? Select all that apply.
- Persistence through caching provides fault tolerance
- Future actions can be performed significantly faster
- Each partition is replicated on two cluster nodes
- RDD persistence always improves space efficiency
- By default, objects that are too big for memory are stored on the disk
Module 3: Spark application programming
1. What is SparkContext?
- An object that represents the connection to a Spark cluster
- A tool for linking to nodes
- A tool that provides fault tolerance
- The built-in shell for the Spark engine
- A programming language for applications
2. Which of the following methods can be used to pass functions to Spark? Select all that apply.
- Transformations and actions
- Passing by reference
- Static methods in a global singleton
- Import statements
- Anonymous function syntax
3. Which of the following is a main component of a Spark application’s source code?
- SparkContext object
- Transformations and actions
- Business Logic
- Import statements
- All of the above
Module 4: Introduction to the Spark libraries
1. Which of the following is NOT an example of a Spark library?
- Hive
- MLlib
- Spark Streaming
- Spark SQL
- GraphX
2. From which of the following sources can Spark Streaming receive data? Select all that apply.
- Kafka
- JSON
- Parquet
- HDFS
- Hive
3. In Spark Streaming, processing begins immediately when an element of the application is executed. True or false?
- True
- False
Module 5: Spark configuration, monitoring and tuning
1. Which of the following is a main component of a Spark cluster? Select all that apply.
- Driver Program
- SparkContext
- Cluster Manager
- Worker node
- Cache
2. What are the main locations for Spark configuration? Select all that apply.
- The SparkConf object
- The Spark Shell
- Executor Processes
- Environment variables
- Logging properties
3. Which of the following techniques can improve Spark performance? Select all that apply.
- Scheduler Configuration
- Memory Tuning
- Data Serialization
- Using Broadcast variables
- Using nested structures
Spark Fundamentals I Final Exam Answers
1. Which of the following is a type of Spark RDD operation? Select all that apply.
- Parallelization
- Action
- Persistence
- Transformation
- Evaluation
2. Spark must be installed and run on top of a Hadoop cluster. True or false
- True
- False
3. Which of the following operations will work improperly when using a Combiner?
- Count
- Maximum
- Minimum
- Average
- All of the above operations will work properly
4. Spark supports which of the following libraries?
- GraphX
- Spark Streaming
- MLlib
- Spark SQL
- All of the above
5. Spark supports which of the following programming languages?
- C++ and Python
- Scala, Java, C++, Python, Perl
- Scala, Perl, Java
- Scala, Python, Java, R
- Java and Scala
6. A transformation is evaluated immediately. True or false?
- True
- False
7. Which storage level does the cache() function use?
- MEMORY_AND_DISK_SER
- MEMORY_AND_DISK
- MEMORY_ONLY_SER
- MEMORY_ONLY
8. Which of the following statements does NOT describe accumulators?
- They can only be read by the driver
- Programmers can extend them beyond numeric types
- They implement counters and sums
- They can only be added through an associative operation
- They are read-only
9. You must explicitly initialize the SparkContext when creating a Spark application. True or false?
- True
- False
10. The “local” parameter can be used to specify the number of cores to use for the application. True or false?
- True
- False
11. Spark applications can ONLY be packaged using one, specific build tool. True or false?
- True
- False
12. Which of the following parameters of the “spark-submit” script determine where the application will run?
- –class
- –master
- –deploy-mode
- –conf
- None of the above
13. Which of the following is NOT supported as a cluster manager?
- YARN
- Helix
- Mesos
- Spark
- All of the above are supported
14. Spark SQL allows relational queries to be expressed in which of the following?
- HiveQL only
- Scala, SQL, and HiveQL
- Scala and SQL
- Scala and HiveQL
- SQL only
15. Spark Streaming processes live streaming data in real-time. True or false?
- True
- False
16. The MLlib library contains which of the following algorithms?
- Dimensionality Reduction
- Regression
- Classification
- Clustering
- All of the above
17. What is the purpose of the GraphX library?
- To create a visual representation of the data
- To generate data-parallel models
- To create a visual representation of a directed acyclic graph (DAG)
- To perform graph-parallel computations
- To convert from data-parallel to graph-parallel algorithms
18. Which list describes the correct order of precedence for Spark configuration, from highest to lowest?
- Properties set on SparkConf, values in spark-defaults.conf, flags passed to spark-submit
- Flags passed to spark-submit, values in spark-defaults.conf, properties set on SparkConf
- Values in spark-defaults.conf, properties set on SparkConf, flags passed to spark-submit
- Values in spark-defaults.conf, flags passed to spark-submit, properties set on SparkConf
- Properties set on SparkConf, flags passed to spark-submit, values in spark-defaults.conf
19. Spark monitoring can be performed with external tools. True or false?
- True
- False
20. Which serialization libraries are supported in Spark? Select all that apply.
- Apache Avro
- Java Serialization
- Protocol Buffers
- Kyro Serialization
- TPL