Code-Snippet

The Forensic Science Glass Identification

For this code-snippet we use a data set from USA Forensic Science Service. Described as 6 types of glass, defined in terms of their oxide content (i.e. Na, Fe, K, etc). You can download the original data here. It is an unbalanced multi-class low dimensional data set, and as such not very interesting, but it hides a small surprise (a problem often overlooked by data scientist).

We use the HDcls machine learning engine for this job, it is build for high dimensional problems but works well in low dimensions too. Ah, by the way you need R and the HDcls R-package, more information on the Software Page. You can install HDcls on linux (test on Ubuntu 18) by doing:

# install procedure on linux 64 bit x86
# install.packages ( "http://roasted.space/?package=HDclsL64x86", repos = NULL )

Then load the HDcls library in R:

library(HDcls)

I have prepared a RData version of the original data, you may download it here (Sure, you can also load and format the original data yourself). Load the glass data:

load("glass.RData")

First, take a look at the sample labels, as you can see the data set is unbalanced, a problem we shall not address any further in this code-snippet.

table(y)

## y
##     building_windows_float_processed building_windows_non_float_processed 
##                                   70                                   76 
##      vehicle_windows_float_processed                           containers 
##                                   17                                   13 
##                            tableware                            headlamps 
##                                    9                                   29 

Next, look at the features of the data, can you spot the bug?

head(x)

##   identifier refractive index sodium magnesium aluminum silcon potassium
## 1          1          1.52101  13.64      4.49     1.10  71.78      0.06
## 2          2          1.51761  13.89      3.60     1.36  72.73      0.48
## 3          3          1.51618  13.53      3.55     1.54  72.99      0.39
## 4          4          1.51766  13.21      3.69     1.29  72.61      0.57
## 5          5          1.51742  13.27      3.62     1.24  73.08      0.55
## 6          6          1.51596  12.79      3.61     1.62  72.97      0.64
##   calcium barium iron
## 1    8.75      0 0.00
## 2    7.83      0 0.00
## 3    7.78      0 0.00
## 4    8.22      0 0.00
## 5    8.07      0 0.00
## 6    8.07      0 0.26

Be naive and run the msgl algorithm using the data as given above. A minor technicality, the data comes as a data.frame, the algorithm takes a matrix or sparse matrix, we need to cast the data as a matrix, that’s easy:

x <- as.matrix(x)

We are now ready to cross-validate. If you do not know what cross-validation is, you need to read up on data science and statistics. In short, it is the canonical way of estimating the expected generalization error of a predictive algorithm, in most simple cases it reports a reliable error estimate.

Run the algorithm:

v <- HDcls::cv(
  x = x, 
  y = y, 
  fold = 9, 
  alpha = 0, 
  lambda = 0.02
)

## Running HDcls 9 fold cross validation (dense design matrix)
## 
##  Samples:  Features:  Classes:  Groups:  Parameters: 
##        214         11         6        0           66
## 

Take a look at the results:

v

## 
## Call:
## HDcls::cv(x = x, y = y, lambda = 0.02, fold = 9, alpha = 0)
## 
## Models:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##        1    1.000      1.556        9.333   0.645
##        5    0.521          2           12   0.383
##       10    0.231      6.333           38   0.210
##       15    0.102      7.889       47.333   0.140
##       20    0.045     10.556       63.333   0.075
##       25    0.020         11           66   0.047
## 
## Best model:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##       25     0.02         11           66   0.047
## 

The best model performs with an expected classification error of around 5%. This is an impressive error rate and as such everything looks alright. But did you spot the bug?

The bug is the first column, the identification number. It should not have any predictive power, however since the data was build in a systematic way, it has! Remove the first column:

x <- x[,-1]

And rerun the algorithm:

v <- HDcls::cv(
  x = x, 
  y = y, 
  fold = 9, 
  alpha = 0, 
  lambda = 0.01
)

## Running HDcls 9 fold cross validation (dense design matrix)
## 
##  Samples:  Features:  Classes:  Groups:  Parameters: 
##        214         10         6        0           60
## 

Take a look at the results:

v

## 
## Call:
## HDcls::cv(x = x, y = y, lambda = 0.01, fold = 9, alpha = 0)
## 
## Models:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##        1    1.000      1.333            8    0.64
##        5    0.464          5           30    0.44
##       10    0.178      7.444       44.667    0.40
##       15    0.068      9.111       54.667    0.36
##       20    0.026         10           60    0.36
##       25    0.010         10           60    0.37
## 
## Best model:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##       16    0.056      9.111       54.667    0.35
## 

The best model now performs with an expected classification error of around 35%. This is remarkably lower than with the ID column in the data set. A lesson to learn: do not have identification features in design matrix when running predictive algorithms.

Glass Identification using One-hot Approach

A remaining question is how good an error rate can we obtain not doing the identifier error. An efficient machine learning approach is a full one-hot encoded design matrix fitted using a high dimensional machine learning engine.

In this section we take a look at the Glass Identification problem using a one-hot design matrix approach. For any design matrix consisting of a mix of factors, integers and numeric features. We may turn a factor into a sparse matrix with dimension number of samples x number of levels containing only 0s and 1s by placing 1 at samples with the corresponding level, this is called one hot encoding. Plus, we may turn an integer or numeric feature into a factor by cutting the feature support up into an appropriate number of interval (called breaks). Doing this for the glass data we get a full one-hot encoded design matrix. I have prepared a full one-hot design matrix for this data, let’s load it:

load("glass-onehot-no-id.RData")

Below is a picture of part of the full one-hot matrix.

FIGURE MISSING

One-Hot Matrix, a gray square indicates a 1

Next, fit a high dimensional sequence of models using:

v <- HDcls::cv(
  x = x, 
  y = y, 
  fold = 9, 
  alpha = 0, 
  lambda = 0.01,
  standardize = FALSE
)

## Running HDcls 9 fold cross validation (sparse design matrix)
## 
##  Samples:  Features:  Classes:  Groups:  Parameters: 
##        214         56         6        0          336
## 

And, the results:

v

## 
## Call:
## HDcls::cv(x = x, y = y, lambda = 0.01, fold = 9, alpha = 0, standardize = FALSE)
## 
## Models:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##        1    1.000      1.556        9.333    0.64
##        5    0.464      9.111       54.667    0.47
##       10    0.178     27.444      164.667    0.30
##       15    0.068     46.556      279.333    0.29
##       20    0.026     51.222      307.333    0.30
##       25    0.010         53          318    0.28
## 
## Best model:
## 
##  Index:  Lambda:  Features:  Parameters:  Error: 
##       11     0.15     32.333          194    0.28
## 

Hence, we get a classification error rate around 30%, this rate can properly be tuned slightly but is seems unlikely that we can reach an error rate of around 5% or anywhere near.

Look at the confusion matrix:

table(predicted = classes(v)[,best_model(v)], true = as.integer(y))

##                                       true
## predicted                               1  2  3  4  5  6
##   building_windows_float_processed     58  9 11  0  0  1
##   building_windows_non_float_processed 11 65  5 10  1  4
##   containers                            0  1  0  1  1  0
##   headlamps                             0  1  0  2  1 24
##   tableware                             0  0  0  0  6  0
##   vehicle_windows_float_processed       1  0  1  0  0  0

It can be seen that the predictor hardly works, and would not be of any valuable use in practice. There are several problems:

  1. Float non-float confusion.
  2. Containers, headlamps and vehicle_windows_float_processed classes does not work at all.
  3. General imprecise predictions