Knowledge Dump: Machine Learning Scoring & Evaluation Metrics

I’ve decided to start a small series of knowledge dumps, where I post my notes from study sessions I have in my free time. Below you’ll find my thought process when learning new information, particularly around machine learning. I find myself asking the same questions over and over, so I write my answers down. Let me know if this has helped you as well!

NOTE: Often times I copy + paste directly from other websites, so I’m confident that if you find an entire paragraph on something, it likely came from another source.

Model evaluation procedures

[Video + GitHub] Cross-validation for parameter tuning, model selection, and feature selection

Comparing cross-validation to train/test split

Advantages of cross-validation:

  • More accurate estimate of out-of-sample accuracy
  • More “efficient” use of data (every observation is used for both training and testing)

Advantages of train/test split:

  • Runs K times faster than K-fold cross-validation
  • Simpler to examine the detailed results of the testing process


From <>


Which evaluation metrics should I use?


Sklearn has a ton of their evaluation and scoring metrics to use, out of the box. Start here.

Model evaluation metrics for regression

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Three common evaluation metrics for regression problems:

  • Mean Absolute Error (MAE) is the mean of the absolute value of the errors.
    • Easiest to understand, because it’s the average error.
  • Mean Squared Error (MSE) is the mean of the squared errors
    • more popular than MAE, because MSE “punishes” larger errors.
  • Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors
    • even more popular than MSE, because RMSE is interpretable in the “y” (base) units. It is recommended that RMSE be used as the primary metric to interpret your model.


[How to measure metrics in scikit-learn]


What is recall?

What is R-squared?

[More info]

  • AKA: coefficient of determination
  • Fraction of total variation in Y that is captured by the model
    • How well does your line follow the variation happening? (dots)
  • Ranges from 0 – 1
    • 0  None of the variance is captured
    • 1  All of it was captured
  • A high r-squared simply means your curve fits your training data well; may not be a good predictor
  • It’s impossible to calculate R-squared for nonlinear regression

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.


How do we interpret the TV coefficient (0.0466)?

[('TV',         0.046564567874150281),
  ('Radio',     0.17915812245088836),
  ('Newspaper', 0.0034504647111804347)]
  • For a given amount of Radio and Newspaper ad spending, a “unit” increase in TV ad spending is associated with a 0.0466 “unit” increase in Sales.
  • Or more clearly: For a given amount of Radio and Newspaper ad spending, **an additional $1,000 spent on TV ads** is associated with an **increase in sales of 46.6 items**.

Important notes:

  •  This is a statement of association, not causation.
  •  If an increase in TV ad spending was associated with a **decrease** in sales, $\beta_1$ would be **negative**.

Metrics computed from a confusion matrix

[How to evaluate a classifier in scikit-learn]

A confusion matrix gives you a more complete picture of how your classifier is performing & allows you to compute various classification metrics, which help guide model selection. Useful for multi-class problems.

Sensitivity: When the actual value is positive, how often is the prediction correct?

  • How “sensitive” is the classifier to detecting positive instances?
  • AKA: “True positive rate” or “Recall”

Specificity: When the actual value is negative, how often is the prediction correct?

  • How “specific” (selective) is the classifier in predicating positive instances?

False Positive Rate: When the actual value is negative, how often is the prediction incorrect?

Precision: When a positive value is predicted, how often is the prediction correct?

  • How “precise” is the classifier when predicting positive instances?
  • # relevant found / # found

How do you choose which metrics to focus on?

Depends on business objective.

  • Spam Filter: Optimize for precision or specificity because false negatives are more acceptable than false positives.
  • Fraudulent transactions: Optimize for sensitivity, because false positives are more acceptable than false negatives.

Changing the threshold from the default value of 0.5 can affect sensitivity and specificity. Lowering  threshold  to increase sensitivity, but lowers specificity.

Metrics to assist with binary classification

What is an RoC curve?

The most commonly used way to visualize the performance of a binary classifier

Can help choose a threshold that balances sensitivity / specificity for your context.

What is AUC?

[The Area under the RoC curve]

Perhaps the best way to summarize binary classifier performance in a single number.

The percentage of the ROC plot that is under the curve. Represents the likelihood that your classifier will assign a higher predicted probability to the positive observation.

What does the Metrics function do in Keras?


Jason Brownlee – How to Use Metrics for Deep Learning with Keras in Python


Why score a model?

  • Hyper parameter selection
  • Feature selection


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.