2023 – 2024 Cognitive Class – Data Science with Scala Answers

### Module 1: Basic Statistics and Data Types

**1. You import MLlib’s vectors from ?**

- org.apache.spark.mllib.TF
- org.apache.spark.mllib.numpy
**org.apache.spark.mllib.linalg**- org.apache.spark.mllib.pandas

**2. Select the types of distributed Matrices :**

**RowMatrix****IndexedRowMatrix****CoordinateMatrix**

**3. How would you caculate the mean of the following ?**

val observations: RDD[Vector] = sc.parallelize(Array(

Vectors.dense(1.0, 2.0),

Vectors.dense(4.0, 5.0),

Vectors.dense(7.0, 8.0)))

val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)

- summary.normL1
- summary.numNonzeros
**summary.mean**- summary.normL2

**4. what task does the following lines of code?**

import org.apache.spark.mllib.random.RandomRDDs._

val million = poissonRDD(sc, mean=1.0, size=1000000L, numPartitions=10)

- Calculate the variance
- calculate the mean
**generate random samples**- Calculate the variance

**5. MLlib uses the compressed sparse column format for sparse matrices, as Such it only keeps the non-zero entrees?**

**True**- False

### Module 2: Preparing Data

**1. WFor a dataframe object the method describe calculates the ?**

- count
- mean
- standard deviation
- max
- min
**all of the above**

**2. What line of code drops the rows that contain null values, select the best answer ?**

- val dfnan = df.withColumn(“nanUniform”, halfTonNaN(df(“uniform”)))
- dfnan.na.replace(“uniform”, Map(Double.NaN -> 0.0))
**dfnan.na.drop(minNonNulls = 3)**- dfnan.na.fill(0.0)

**3. What task does the following lines of code perform ?**

val lr = new LogisticRegression()

lr.setMaxIter(10).setRegParam(0.01)

val model1 = lr.fit(training)

- perform one hot encoding
- Train a linear regression model
**Train a Logistic regression model**- Perform PCA on the data

**4. The StandardScaleModel transforms the data such that ?**

- each feature has a max value of 1
- each feature is Orthogonal
**each feature to have a unit standard deviation and zero mean**- each feature has a min value of -1

### Module 3: Feature Engineering

**1. Spark ML works with?**

- tensors
- vectors
**dataframes**- lists

**2. the function IndexToString() performs One hot encoding?**

- True
**False**

**3. Principal Component Analysis is Primarily used for ?**

- to convert categorical variables to integers
- to predict discrete values
**dimensionality reduction**

**4. one import set prior to using PCA is ?**

**normalizing your data**- making sure every feature is not correlated
- taking the log for your data
- subtracting the mean

### Module 4: Fitting a Model

**1. You can use decision trees for ?**

- regression
- classification
**classification and regression**- data normalization

**2. the following lines of code: val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))**

- split the data into training and testing data
- train the model
- use 70% of the data for testing
- use 30% of the data for training
- make a prediction

**3. in the Random Forest Classifier constructor .setNumTrees() ?**

- sets the max depth of trees
- sets the minimum number of classes before a split
**set the number of trees**

**4. Elastic net regularization uses ?**

- L0-norm
- L1-norm
- L2-norm
**a convex combination of the L1 norm and L2 norm**

### Module 5: Pipeline and Grid Search

**1. what task does the following code perform: withColumn(“paperscore”, data(“A2”) * 4 + data(“A”) * 3) ?**

- add 4 colunms to A2
- add 3 colunms to A1
- add 4 to each elment in colunm A2
**assign a higher weight to A2 and A journals**

**2. In an estimator ?**

- there is no need to call the method fit
**fit function is called**- transform fuction is only called

**3. Which is not a valid type of Evaluator in MLlib?**

- RegressionEvaluator
- MultiClassClassificationEvaluator
**MultiLabelClassificationEvaluator**- BinaryClassificationEvaluator
- All are valid

**4. In the following lines of code, the last transform in the pipeline is a:**

val rf = new RandomForestClassifier().setFeaturesCol(“assembled”).setLabelCol(“status”).setSeed(42)

import org.apache.spark.ml.Pipeline

val pipeline = new Pipeline().setStages(Array(value_band_indexer,category_indexer,label_indexer,assembler,rf))

- principal component analysis
- Vector Assembler
- String Indexer
- Vector Assembler
**Random Forest Classifier**

### Final Exam Answers

**1. What is not true about labeled points?**

- They associate sparse vectors with a corresponding label/response
- They associate dense vectors with a corresponding label/response
**They are used in unsupervised machine learning algorithms**- All are true
- None are true

**2. Which is true about column pointers in sparse matrices?**

**By themselves, they do not represent the specific physical location of a value in the matrix**- They never repeat values
- They have the same number of values as the number of columns
- All are true
- None are true

**3. What is the name of the most basic type of distributed matrix?**

- CoordinateMatrix
- IndexedRowMatrix
- SparseMatrix
- SimpleMatrix
**RowMatrix**

**4. A perfect correlation is represented by what value?**

- 3
**1**- -1
- 100
- 0

**5. A MinMaxScaler is a transformer which:**

**Rescales each feature to a specific range**- Takes no parameters
- Makes zero values remain untransformed
- All are true
- None are true

**6. Which is not a supported Random Data Generation distribution?**

- Poisson
- Uniform
- Exponential
**Delta**- Normal

**7. Sampling without replacement means:**

- The expected number of times each element is chosen is randomized
**The expected size of the sample is a fraction of the RDDs size**- The expected number of times each element is chosen
- The expected size of the sample is unknown
- The expected size of the sample is the same as the RDDs size

**8. What are the supported types of hypothesis testing?**

- Pearson’s Chi-Squared Test for goodness of fit
- Pearson’s Chi-Squared Test for independence
- Kolmogorov-Smirnov test for equality of distribution
**All are supported**- None are supported

**9. For Kernel Density Estimation, which kernel is supported by Spark?**

- KDEMultivariate
- KDEUnivariate
**Gaussian**- KernelDensity
- All are supported

**10. Which DataFrames statistics method computes the pairwise frequency table of the given columns?**

- freqItems()
- cov()
**crosstab()**- pairwiseFreq()
- corr()

**11. Which is not true about the fill method for DataFrame NA functions?**

- It is used for replacing NaN values
- It is used for replacing nil values
- It is used for replacing null values
- All are true
- None are true

**12. Which transformer listed below is used for Natural Language processing?**

- StandardScaler
- OneHotEncoder
- ElementwiseProduct
- Normalizer
**None are used for Natural Language processing**

**13. Which is true about the Mahalanobis Distance?**

- It is a scale-variant distance
- It does not take into account the correlations of the dataset
**It is measured along each Principle Component axis**- It is a multi-dimensional generalization of measuring how many standard deviations a point is away from the median
- It has units of distance

**14. Which is true about OneHotEncoder?**

- It must be told which column to create for its output
- It creates a Sparse Vector
- It must be told which column is its input
**All are true**- None are true

**15. Principle Component Analysis is:**

- Is never used for feature engineering
- Used for supervised machine learning
**A dimension reduction technique**- All are true
- None are true

**16. MLlib’s implementation of decision trees:**

- Supports only multiclass classification
- Does not support regressions
**Partitions data by rows, allowing distributed training**- Supports only continuous features
- None are true

**17. Which is not a tunable of SparkML decision trees?**

- maxBins
- maxMemoryInMB
- minInstancesPerNode
**minDepth**- minInfoGain

**18. Which is true about Random Forests?**

- They support non-categorical features
**They combine many decision trees in order to reduce the risk of overfitting**- They do not support regression
- They only support binary classification
- None are true

**19. When comparing Random Forest versus Gradient-Based Trees, what must you consider?**

- How the number of trees affects the outcome
- Depth of Trees
- Parallelization abilities
**All of these**- None of these

**20. Which is not a valid type of Evaluator in MLlib?**

**MultiLabelClassificationEvaluator**- RegressionEvaluator
- BinaryClassificationEvaluator
- MultiClassClassificationEvaluator
- All are valid