2023 Cognitive Class – Data Science with Scala Answers (Updated)
2023 – 2024 Cognitive Class – Data Science with Scala Answers
Module 1: Basic Statistics and Data Types
1. You import MLlib’s vectors from ?
- org.apache.spark.mllib.TF
- org.apache.spark.mllib.numpy
- org.apache.spark.mllib.linalg
- org.apache.spark.mllib.pandas
2. Select the types of distributed Matrices :
- RowMatrix
- IndexedRowMatrix
- CoordinateMatrix
3. How would you caculate the mean of the following ?
val observations: RDD[Vector] = sc.parallelize(Array(
Vectors.dense(1.0, 2.0),
Vectors.dense(4.0, 5.0),
Vectors.dense(7.0, 8.0)))
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
- summary.normL1
- summary.numNonzeros
- summary.mean
- summary.normL2
4. what task does the following lines of code?
import org.apache.spark.mllib.random.RandomRDDs._
val million = poissonRDD(sc, mean=1.0, size=1000000L, numPartitions=10)
- Calculate the variance
- calculate the mean
- generate random samples
- Calculate the variance
5. MLlib uses the compressed sparse column format for sparse matrices, as Such it only keeps the non-zero entrees?
- True
- False
Module 2: Preparing Data
1. WFor a dataframe object the method describe calculates the ?
- count
- mean
- standard deviation
- max
- min
- all of the above
2. What line of code drops the rows that contain null values, select the best answer ?
- val dfnan = df.withColumn(“nanUniform”, halfTonNaN(df(“uniform”)))
- dfnan.na.replace(“uniform”, Map(Double.NaN -> 0.0))
- dfnan.na.drop(minNonNulls = 3)
- dfnan.na.fill(0.0)
3. What task does the following lines of code perform ?
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
val model1 = lr.fit(training)
- perform one hot encoding
- Train a linear regression model
- Train a Logistic regression model
- Perform PCA on the data
4. The StandardScaleModel transforms the data such that ?
- each feature has a max value of 1
- each feature is Orthogonal
- each feature to have a unit standard deviation and zero mean
- each feature has a min value of -1
Module 3: Feature Engineering
1. Spark ML works with?
- tensors
- vectors
- dataframes
- lists
2. the function IndexToString() performs One hot encoding?
- True
- False
3. Principal Component Analysis is Primarily used for ?
- to convert categorical variables to integers
- to predict discrete values
- dimensionality reduction
4. one import set prior to using PCA is ?
- normalizing your data
- making sure every feature is not correlated
- taking the log for your data
- subtracting the mean
Module 4: Fitting a Model
1. You can use decision trees for ?
- regression
- classification
- classification and regression
- data normalization
2. the following lines of code: val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
- split the data into training and testing data
- train the model
- use 70% of the data for testing
- use 30% of the data for training
- make a prediction
3. in the Random Forest Classifier constructor .setNumTrees() ?
- sets the max depth of trees
- sets the minimum number of classes before a split
- set the number of trees
4. Elastic net regularization uses ?
- L0-norm
- L1-norm
- L2-norm
- a convex combination of the L1 norm and L2 norm
Module 5: Pipeline and Grid Search
1. what task does the following code perform: withColumn(“paperscore”, data(“A2”) * 4 + data(“A”) * 3) ?
- add 4 colunms to A2
- add 3 colunms to A1
- add 4 to each elment in colunm A2
- assign a higher weight to A2 and A journals
2. In an estimator ?
- there is no need to call the method fit
- fit function is called
- transform fuction is only called
3. Which is not a valid type of Evaluator in MLlib?
- RegressionEvaluator
- MultiClassClassificationEvaluator
- MultiLabelClassificationEvaluator
- BinaryClassificationEvaluator
- All are valid
4. In the following lines of code, the last transform in the pipeline is a:
val rf = new RandomForestClassifier().setFeaturesCol(“assembled”).setLabelCol(“status”).setSeed(42)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(value_band_indexer,category_indexer,label_indexer,assembler,rf))
- principal component analysis
- Vector Assembler
- String Indexer
- Vector Assembler
- Random Forest Classifier
Final Exam Answers
1. What is not true about labeled points?
- They associate sparse vectors with a corresponding label/response
- They associate dense vectors with a corresponding label/response
- They are used in unsupervised machine learning algorithms
- All are true
- None are true
2. Which is true about column pointers in sparse matrices?
- By themselves, they do not represent the specific physical location of a value in the matrix
- They never repeat values
- They have the same number of values as the number of columns
- All are true
- None are true
3. What is the name of the most basic type of distributed matrix?
- CoordinateMatrix
- IndexedRowMatrix
- SparseMatrix
- SimpleMatrix
- RowMatrix
4. A perfect correlation is represented by what value?
- 3
- 1
- -1
- 100
- 0
5. A MinMaxScaler is a transformer which:
- Rescales each feature to a specific range
- Takes no parameters
- Makes zero values remain untransformed
- All are true
- None are true
6. Which is not a supported Random Data Generation distribution?
- Poisson
- Uniform
- Exponential
- Delta
- Normal
7. Sampling without replacement means:
- The expected number of times each element is chosen is randomized
- The expected size of the sample is a fraction of the RDDs size
- The expected number of times each element is chosen
- The expected size of the sample is unknown
- The expected size of the sample is the same as the RDDs size
8. What are the supported types of hypothesis testing?
- Pearson’s Chi-Squared Test for goodness of fit
- Pearson’s Chi-Squared Test for independence
- Kolmogorov-Smirnov test for equality of distribution
- All are supported
- None are supported
9. For Kernel Density Estimation, which kernel is supported by Spark?
- KDEMultivariate
- KDEUnivariate
- Gaussian
- KernelDensity
- All are supported
10. Which DataFrames statistics method computes the pairwise frequency table of the given columns?
- freqItems()
- cov()
- crosstab()
- pairwiseFreq()
- corr()
11. Which is not true about the fill method for DataFrame NA functions?
- It is used for replacing NaN values
- It is used for replacing nil values
- It is used for replacing null values
- All are true
- None are true
12. Which transformer listed below is used for Natural Language processing?
- StandardScaler
- OneHotEncoder
- ElementwiseProduct
- Normalizer
- None are used for Natural Language processing
13. Which is true about the Mahalanobis Distance?
- It is a scale-variant distance
- It does not take into account the correlations of the dataset
- It is measured along each Principle Component axis
- It is a multi-dimensional generalization of measuring how many standard deviations a point is away from the median
- It has units of distance
14. Which is true about OneHotEncoder?
- It must be told which column to create for its output
- It creates a Sparse Vector
- It must be told which column is its input
- All are true
- None are true
15. Principle Component Analysis is:
- Is never used for feature engineering
- Used for supervised machine learning
- A dimension reduction technique
- All are true
- None are true
16. MLlib’s implementation of decision trees:
- Supports only multiclass classification
- Does not support regressions
- Partitions data by rows, allowing distributed training
- Supports only continuous features
- None are true
17. Which is not a tunable of SparkML decision trees?
- maxBins
- maxMemoryInMB
- minInstancesPerNode
- minDepth
- minInfoGain
18. Which is true about Random Forests?
- They support non-categorical features
- They combine many decision trees in order to reduce the risk of overfitting
- They do not support regression
- They only support binary classification
- None are true
19. When comparing Random Forest versus Gradient-Based Trees, what must you consider?
- How the number of trees affects the outcome
- Depth of Trees
- Parallelization abilities
- All of these
- None of these
20. Which is not a valid type of Evaluator in MLlib?
- MultiLabelClassificationEvaluator
- RegressionEvaluator
- BinaryClassificationEvaluator
- MultiClassClassificationEvaluator
- All are valid
0 comments