The Broken Glass Pitfall

An overlooked pitfall in data science is incorporation of numerical identifiers in the design matrix used for cross or sub-sample validations. It seems innocent but it affect the reported error rate, sometimes substantially. Take for example the Forensic Science Glass Identification data set, the first entries looks like this:

##   identifier refractive index sodium magnesium aluminum silcon potassium
## 1          1          1.52101  13.64      4.49     1.10  71.78      0.06
## 2          2          1.51761  13.89      3.60     1.36  72.73      0.48
## 3          3          1.51618  13.53      3.55     1.54  72.99      0.39
## 4          4          1.51766  13.21      3.69     1.29  72.61      0.57
## 5          5          1.51742  13.27      3.62     1.24  73.08      0.55
## 6          6          1.51596  12.79      3.61     1.62  72.97      0.64
##   calcium barium iron
## 1    8.75      0 0.00
## 2    7.83      0 0.00
## 3    7.78      0 0.00
## 4    8.22      0 0.00
## 5    8.07      0 0.00
## 6    8.07      0 0.26

You can download the original data here. Note the first column it contains an numerical identifier. Furthermore, the data contain 6 classes, the class distribution is:

##     building_windows_float_processed building_windows_non_float_processed 
##                                   70                                   76 
##      vehicle_windows_float_processed                           containers 
##                                   17                                   13 
##                            tableware                            headlamps 
##                                    9                                   29 

This looks relatively standard and if used directly with a high dimensional machine learning algorithm an error rate of less than 1% is achievable. So, it seems we can do an almost 100% correct classification of the glass type in terms of their oxide content. However! This error rate is false! The first column, the identifier, should not have any predictive power, hence we should be able to remove it without effecting the error rate. If we remove the identifier the same producer attains an classification error rate of 35%. What is worse the predictor messes up predictions in every class, this may be seen by inspecting the confusion matrix.

FIGURE MISSING

CONCLUSION: incorporation of an numerical identifier may result in an unrealistic low error rate. Details of the code may be found in the Glass Identification code-snippet.