Machine Learning Scenario-Based Questions

Here, we will be talking about some popular Data Science and Machine Learning Scenario-Based Questions that must be covered while preparing for the interview. We have tried to select the best scenario-based machine learning interview questions which should help our readers in the best ways.

Let’s start,

Learn ML for free

Question 1: Assume that you have to achieve 96% accuracy in the classification model for cancer detection. What could be the conditions because of which the model will fail? Are there any solutions?

Answer: Cancer detection datasets produce highly imbalanced results if the dataset is imbalanced. An imbalanced dataset should not be used for the model building.
Since 96% of it might only correctly forecast the majority of the class but not the rest 4% of the minority class who are maybe people having cancer.
So, in order to evaluate the class-wise performance of the classifier, we can use True Positive Rate (Sensitivity), True Negative Rate (Specificity), and F measures.
If the minority class performance is poor, We can use
1. Under-sampling or oversampling to make the data balanced.
2. Probability calibration to adjust the prediction threshold value.
3. Assign weights to classes so that minority classes will become heavier and 4. Anomaly detection.
Also, we can try the following ways,
1. Add more data to the dataset
2. Feature Engineering/Feature Selection
3. Algorithm Tuning
4. Cross Validation, etc

Question 2: You are given a table of ‘tweets’. The table shows tweets by each user over a specific period. Calculate the 7-day rolling average of tweets by each user for every day.

Column_Name	Datatype
tweet_id	integer
msg	string
user_id	string
tweet_date	datetime

Answer:

SELECT tweet_date, user_id
AVG(COUNT(tweet_id) OVER (
    PARTITION BY user_id
    ORDER BY tweet_date
    ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
)
FROM tweets
ORDER BY tweet_date

Question 3: How rotation is important in PCA?

Answer: Principal Component Analysis is an unsupervised learning technique that reduces the dimensionality of data. Its aim is to select the fewer components that can explain the maximum variance in the data set.
Rotations are done for the interpretation of extracted factors in factor analysis. It maximizes the difference between variance captured by the component, making it easier to analyze. In rotation correlation between variables is preserved. It only changes the actual coordinates. If the rotation of components is not held, it will have to maximize the number of components for the explanation of the variance in the data set.
For more details, read here.

Question 4: Why is Naive Bayes called ‘naive’?

Answer: Bayesian classifiers are statistical classifiers. They can predict class membership probabilities. Naive Bayes assumes that all the features in the data set are independent of each other. This holds quite untrue for the real-world scenario. But the classifier works when the problems are complex and large.
Read more about this here.

Question 5: How can you reduce the dimensionality of the dataset when there are memory constraints? Assume that the dataset is for classification problems.

Answer: Memory constraints for managing the higher dimensional data is quite tricky and challenging.
1. We can create smaller datasets out of that large dataset and do the computations. These datasets can be randomly sampled.
3. Also, we can reduce the dimensionality by separating numerical and categorical variables and removing those that are correlated. In the case of numerical variables, correlation, and categorical variables chi-square test would work.
4. PCA can also be used to identify the components that explain the most variance.
5. Building a linear model using Stochastic Gradient Descent is also helpful.
6. Furthermore with the help of business understanding, we can estimate which predictors might affect the response variable.

Question 6: Assume you’re given a dataset with missing values in it that vary by 1 standard deviation from the median. How much of the data will remain unaffected? Why?

Answer: In a normal distribution, the maximum area covered under 1 standard deviation i.e. from the positive axis to the negative axis of central tendency is 68% from the mean or mode or median, so, the total data that will remain unaffected is 32%.

Question 7: Imagine you are working on a time series data set and you want to build a high-accuracy model. You started building a decision tree and time series, regression model. Now you can see that time series regression got higher accuracy than the decision tree model. What will you conclude?

Answer: Time series data detects linear interaction whereas the decision tree algorithm detects nonlinear interactions. A time series model is the same as the regression model. Hence, it finds the linear relationship effectively. Hence, if the data set is assuming linear relationships, the time series model is best.

Question 8: What are the low bias and high variance? How you can tackle this problem?

Answer: Bias is the difference between the average prediction of our model and the actual result. When the model’s predicted value is close to the actual value, low bias occurs. Variance is the variability of model prediction for a given data point or a value that tells us the spread of our data. High Variance results in high error rates on test data.
If the model has a large number of parameters then it’s going to have high variance and low bias and it will suffer from overfitting.
To prevent overfitting, regularization of L1 or L2 can be used. We can use early stopping and in the case of the neural networks, we can use dropout. If we are dealing with linear regression we can try Ridge and Lasso regression.

Question 9: Should we remove highly correlated variables before Principal Component Analysis?

Answer: In presence of correlated variables, the variance explained by a particular component gets inflated. For eg. consider, a data set with three variables, of which two are correlated. A PCA of this data set would demonstrate twice the variance as a PCA of uncorrelated variables. And adding correlated variables makes PCA focus more on those variables.
Hence removing them is the best choice.

Question 10: What are prior probability, likelihood, and marginal likelihood in the context of the NaiveBayes algorithm?

Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. Without any further information, you can make a guess about the class.
The likelihood is the probability of classifying a given observation as 1 in presence of some other variable.
The marginal likelihood is the probability that the word ‘FREE’ is used in any message.

Question 11: The rise in global average temperature led to a decrease in the number of pirates around the world. Does that mean that a decrease in the number of pirates caused climate change?

Answer: No, there can be several factors for climate change, we can’t infer that the decrease in the number of pirates caused climate change. There can be a correlation between the global average temperature and the number of pirates.
This is the question of causation and correlation.

Question 12: Label encoding doesn’t increase the dimensionality of the dataset, but One hot encoding does. How?

Answer: One hot encoding creates a new variable for each level present in categorical values. For eg. we have a variable called animal and it has 4 levels eg., cat, dog, lion, and tiger. Then the one hot encoding will create four new variables as the animal.cat, animal.dog, animal.lion, and animal.tiger. While label encoding encoded variables as 0 or 1, hence it doesn’t increase the data dimensionality.

These were a few real-life scenario interview questions from ML and Data Science. You can find ML and data science projects for the final year here.

Thank You!

Learn ML for free

Also Read: