## C# port of Gorodkin’s generalised MCC algorithm (RkCC)

18/07/2014 Leave a comment

This is something from my MSc project that I thought would be useful to share!

Matthew’s Correlation Coefficent (MCC) is a smart way of measuring the overall accuracy of a classification algorithm. Say you have some data and you want to classify it into two categories, A and B. In classification you initially ‘train’ a classifier and test it using a separate set of data. In both sets you obviously already know the class each data item falls into so you can test it. You could simply record the accuracy by noting the percentage of data items that were classified correctly. The trouble with this approach is that if you had (for instance) 90 items of class A and 10 of class B in your test set the classifier could in theory return an accuracy of 0.9 (90%) even though every one of class B were incorrectly classified! Even using a proper Accuracy measurement of True Positives + True Negatives / Total would still return 0.8. MCC is a clever method of measuring accuracy that takes the disparities of class size into account better. In this case it would return 0, and if only 1 class B item was classified correctly it would return 0.3.

Unfortunately MCC is only useful for binary classification problems such as above. As soon as you add a third class C or more then you can’t use it. A generalised method of MCC was therefore created by Jan Gorodkin, the mathematical details of which are on the paper at his website (http://rk.kvl.dk/). He also supplied some code for it in AWK, but I was using C# so needed to port it. The source code below is my translation, and it seems to get the same results as his!

To use it you need to pass a List of integer arrays to the CalculateMCC() method. The list should be in a format like this (example shown in the code too):

{ Array of class A results[Classed as A, Classed as B, Classed as C],

Array of class B results[Classed as A, Classed as B, Classed as C],

Array of class C results[Classed as A, Classed as B, Classed as C],

…. }

double MCC = MCCCalculator.CalculateMCC(new List<int[]>() { new int[2] {90, 0}, new int[2]{9, 1} });

The class:

public static class MCCCalculator { /// <summary> /// Return the generic MCC value based on Gorodkin (2004) /// See http://rk.kvl.dk/ /// </summary> /// <param name="scores"></param> /// <returns></returns> public static double CalculateMCC(List<int[]> confusionMatrix) { double MCC = 0; // calc total data samples int totalSamples = 0; for (int i = 0; i < confusionMatrix[0].Count(); i++) { totalSamples += confusionMatrix[i].Sum(); } // calc trace (sum of true positives) int trace = 0; for (int i = 0; i < confusionMatrix[0].Count(); i++) { trace += confusionMatrix[i][i]; } // sum row -> column dotproduct int rowcol_sumprod = 0; for (int row = 0; row < confusionMatrix.Count; row++) { for (int col = 0; col < confusionMatrix[0].Count(); col++) { int[] rowArray = getRow(confusionMatrix, row); int[] colArray = getCol(confusionMatrix, col); rowcol_sumprod += dotProduct(rowArray, colArray); } } // sum row -> row dotproduct int rowrow_sumprod = 0; for (int row = 0; row < confusionMatrix.Count; row++) { for (int row2 = 0; row2 < confusionMatrix[0].Count(); row2++) { int[] rowArray = getRow(confusionMatrix, row); int[] rowArray2 = getRow(confusionMatrix, row2); rowrow_sumprod += dotProduct(rowArray, rowArray2); } } // sum col -> col dotproduct int colcol_sumprod = 0; for (int col = 0; col < confusionMatrix.Count; col++) { for (int col2 = 0; col2 < confusionMatrix[0].Count(); col2++) { int[] colArray = getCol(confusionMatrix, col); int[] colArray2 = getCol(confusionMatrix, col2); colcol_sumprod += dotProduct(colArray, colArray2); } } int cov_xy = (totalSamples * trace) - rowcol_sumprod; int cov_xx = (totalSamples * totalSamples) - rowrow_sumprod; int cov_yy = (totalSamples * totalSamples) - colcol_sumprod; double denominator = Math.Sqrt((double)cov_xx * (double)cov_yy); if (denominator == 0) MCC = 1; else MCC = cov_xy / denominator; return MCC; } /// <summary> /// Return the specified row from the confusion matrix /// </summary> /// <param name="confusionMatrix"></param> /// <param name="row"></param> /// <returns></returns> private static int[] getRow(List<int[]> confusionMatrix, int row) { int[] rowArray = new int[confusionMatrix[row].Count()]; for (int i = 0; i < confusionMatrix[row].Count(); i++) { rowArray[i] = confusionMatrix[row][i]; } return rowArray; } /// <summary> /// Return the specified column from the confusion matrix /// </summary> /// <param name="confusionMatrix"></param> /// <param name="row"></param> /// <returns></returns> private static int[] getCol(List<int[]> confusionMatrix, int col) { int[] colArray = new int[confusionMatrix.Count()]; for (int i = 0; i < confusionMatrix.Count(); i++) { colArray[i] = confusionMatrix[i][col]; } return colArray; } /// <summary> /// Return the dotproduct of the two arrays. /// </summary> /// <param name="array1"></param> /// <param name="array2"></param> /// <returns></returns> private static int dotProduct(int[] array1, int[] array2) { int dotProduct = 0; for (int i = 0; i < array1.Count(); i++) { dotProduct += (array1[i] * array2[i]); } return dotProduct; } } }

The only slight possible buggette in the original I’ve carried over into this version deliberately, as I wasn’t quite sure whether it was intentional or not. When the denominator is 0 but there are more than 0 samples I’m pretty sure that it should be returning 0 rather than 1, however I’ll go with the experts for the moment! Just something to consider if using it.