Dataset Strategy and Evaluation in ML

Mona
3 min readJun 1, 2019

This article defines how dataset should be used to train the model. Interpret the evaluation results for the model and discuss Model Fit.

The main task of machine learning is to explore and construct algorithms that can learn from historical data and make predictions on new input data.

What is Dataset ?

Dataset is a set of rows that can be split into training and test segments. “Training Data” and “Test Data” refer to subsets of the Data you wish to work with.

Training Data provides an algorithm an opportunity to learn for inferring a general solution for new data in similar way as humans train and handle new situations in the future.

Training and test data should always be kept separate. The reason for that is when you train your algorithm, it should be kept ignorant of test data.

Training data set is for the training of model and after it gets trained how it will be checked for accuracy.

If the training data set contains elements from the test data set, it is contaminated.

Case 1 — when train and test data sets are merged -

In this case, it is advised to split the whole data into train, cross-validation and test sets with ratio 60:20:20 (train:cross validation:test).

The idea is to use train data to build the model and use cross validation data to test the validity of the model and parameters.

Never expose the model to the test data until prediction stage. So basically, you should be using train and cross validation data to build the model.

Case 2 — when training and test data sets are separate —

In this case, you should split training data into training and cross validation data sets. Alternatively, you could perform k-fold cross-validation on training set.

In most cases, the split is done randomly. However, in cases when the data is time-dependent, then the split cannot be random.

The training data set is used to build the model. This contains a set of data that has target and predictor variables. This is the data which model has already seen while training and so (after finding optimum parameters), gives good model performance parameter.

Test set is used to evaluate how well the model performs with data outside the training set(which model has not processed). The model is adjusted to minimise error on the test set.

Evaluation

For a data-driven solution,we need to define an evaluation function , which measures how well the models are learning.

Evaluate Model: Classification models

For evaluating classification models we check the -

Accuracy — which measures the goodness of a classification model as the proportion of true results to total cases.

Accuracy=(number of correct prediction)/(Total number of predictions made)

Precision — is the proportion of true results over all positive results.

Precision=True Positives/(True Positives+FalsePositives)

Recall — is the fraction of all correct results returned by the model.

Recall=True Positives/(True Positives+FalseNegatives)

F1-score — is computed as the weighted average of precision and recall between 0 and 1, where the ideal F-score value is 1.

F1=2*(1/((1/precision)+(1/recall)))

F1 score tries to find the balance between precision and recall.

Model Fit

Overfitting -

The concept of Overfitting refers to creating a model that doesn’t generalize to your model.

In other words, if your model overfits your data, that means it’s learned your data too much — it’s essentially memorized it.

A model that’s just “memorized” your data is one that is going to perform poorly on new, unobserved data.

Underfitting -

The concept of Underfitting is when the model performs poorly on training data.

It could be because the model is too simple i.e. input features are not expressive enough to describe the target variable well.

Underfitting model does not predict the targets in the test data sets very accurately.

Ideal Case -

As the number of training samples increases, the model performance on testing samples become close to that on training samples.

--

--