Imagine a case where you have a large set of data, where each row represents an item of data and each column an attribute of that item. It is more and more common nowadays to want to be able to automatically classify this data into categories by looking at their attributes and trying to find a pattern. A sub-discipline of AI called machine learning helps you do that. A classifier such as a neural network or a support vector machine will be trained on the dataset with the aim of creating a pattern recogniser. However in all but the most trivial cases you will usually have to first figure out which of the data attributes (or â€˜featuresâ€™) you want to use in order to train the classifier. This is because using many hundreds of features can a) take a long time to train, b) include lots of redundant features (i.e. their pattern duplicates that of other features, so doesnâ€™t add any new information), c) be irrelevant (maybe that feature never changes at all), and d) actually confuse the training by providing information that might appear to be usable to the human eye but is actually just noise (think of the SETI program!).

There are many algorithms for finding the features that might be useful, but this article is about the Fast Correlation-Based Filter (FCBF) feature selection technique first introduced by Yu and Liu * . FCBF is often used in bioinformatics research, which is a commonly used domain for machine learning. Note that FCBF will only work reliably with discrete data, not continuous. This means that the values of each feature need to be put in â€˜binsâ€™. If your data is continuous in nature then there are binning algorithms (aka discretisation filters) that you can use that transform the values of each feature into the label of the appropriate bin. These are outside the scope of this post however the free Weka software contains at least one.

I will show how FCBF can be implemented in C#. Note that if you know C# then you will certainly see places where this can be more efficient (Iâ€™ve highlighted some in the text), but this is deliberate in order to be more readable. This first post will concentrate on the concept of information entropy and symmetrical uncertainty, which are used by FCBF in order to provide a ranking of the information content of the features. The next article will show how FCBF then uses the ranking to eliminate features.

For this example I am going to assume that a record (i.e. item of data) is called FeatureSequence, and there is one property containing the features for this item called EncodedSequence which is a List of strings. The first element of this list is always the class (category) label for this item of data. If you know that you are always going to have a fixed number of features then you could of course create a separate property for each, but that might become very big and inflexible very quickly!

public class FeatureSequence{
Â Â Â Â // Remember that the first item is the class label
public List<string> EncodedSequence {get;set;}
}

You need to read in your data and populate an instance of the above class for each data item, to create a list of FeatureSequences. Iâ€™m assuming that you already have enough C# knowledge to be able to do this!

Now comes a little bit of maths. Don’t worry, this is just to illustrate – the code comes in a minute! Like most maths notation, it is really just a shorthand for the many steps that we need to do, and whereas papers and Wikipedia will often miss out the crucial bit of maths information you need, I won’t :-). We will need to calculate the symmetrical uncertainty (SU) value for each feature. The mathematical formula for this is below, where H(X) is the information entropy (i.e. amount of information) of a feature X and H(X,Y) is the joint information entropy of feature X and feature Y:

Of course X or Y can also be the classification label of the data too, which is why I’ve implemented class label as simply another feature in EncodedSequence in the code above. As an aside, the numerator is also the formula for mutual information (i.e. the amount that knowing one feature gives information about the other). SU is a little bit of a misnomer (in plain English) because SU actually increases the more certain you can be that feature X helps predict Y and vice-versa, i.e. as you become less ‘uncertain’ SU increases!

These are the formula for information entropy H(X) and joint information entropy H(X,Y). p(x) is the probability function i.e. the probability that value x will appear in the feature being examined, and p(x, y) is the joint probability function, i.e. the probability that the value x will appear in the first feature being examined when the value in second feature is y. Hopefully you understand that, but as a quick example if there are 10 items in your feature, and a particular value appears 3 times, then the probability of that value is 3/10 or 0.3.

So to start with we need to calculate entropy. First of all declare a class called FREQUENCY that will hold a list of all the values for a feature and the number of times that value appears.

public class FREQUENCY
{
public string FeatureValue;
public string Conditioner; // for joint probability.
public double ValueCount;
} ;

The following function is used to calculate entropy. We simply loop through each data item, recording the number of times a particular value appears in a list of FREQUENCY. Then we loop through that frequency list, performing the entropy calculation H(X) above on each item and adding them together. In the code I use logarithm to the base 2. This results in an entropy measure in units of measure bits. Natural log is also commonly used, in which case the unit of measure is nats.

// Somewhere else in your class declare and populate a list of FeatureSequences containing your data:
private List<FeatureSequence> _allDataItems;
/// <summary>
/// Find the Shannon entropy of the passed feature
/// </summary>
/// <param name="feature">Column number of variable (start at 0)</param>
/// <returns>Entropy as a double</returns>
public double Entropy(int feature)
{
double entropy = 0;
List<FREQUENCY> frequencyList = new List<FREQUENCY>();
// First count the number of occurances of each value
// Go through each feature list (i.e. row) on the dataset
for (int i = 0; i < _allDataItems.Count; i++)
{
// If the frequency list already has a place for this value then increment its count
FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[feature]).FirstOrDefault();
if (freq != null)
{
freq.ValueCount++;
}
// else add a new item to the frequency list
else
{
FREQUENCY newFreq = new FREQUENCY();
newFreq.FeatureValue = _allDataItems[i].EncodedSequence[feature];
newFreq.ValueCount = 1;
frequencyList.Add(newFreq);
}
}
// For each item on the frequency list...
for (int i = 0; i < frequencyList.Count; i++)
{
// Calculate the probability
double probability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount);
// increase the entropy value
entropy += (probability * Math.Log(probability, 2));
// Note: can also use entropy += (probability * Math.Log((1/probability), 2));
}
return entropy*-1;
}

The joint entropy function is very similar. Instead of passing 1 feature you pass 2. Then the list of FREQUENCY records the number of times a combination of the first and second feature occurs. Otherwise it is just the same.

/// <summary>
/// Return the joint entropy of 2 features.
/// </summary>
/// <param name="firstFeature">Column number of first variable (start at 0)</param>
/// <param name="secondFeature">Column number of second variable (start at 0)</param>
/// <returns>Joint entropy as a double</returns>
public double JointEntropy(int firstFeature, int secondFeature)
{
double jointEntropy = 0;
List<FREQUENCY> frequencyList = new List<FREQUENCY>();
// First count the number of occurances of each value of feature 1 for each value of feature 2
// Go through each feature list (i.e. row) on the dataset
for (int i = 0; i < _allDataItems.Count; i++)
{
// If the frequency list already has a place for this value then increment its count
FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[firstFeature] &&
x.Conditioner == _allDataItems[i].EncodedSequence[secondFeature]).FirstOrDefault();
if (freq != null)
{
freq.ValueCount++;
}
// else add a new item to the frequency list
else
{
FREQUENCY newFreq = new FREQUENCY();
newFreq.FeatureValue = _allDataItems[i].EncodedSequence[firstFeature];
newFreq.Conditioner = _allDataItems[i].EncodedSequence[secondFeature];
newFreq.ValueCount = 1;
frequencyList.Add(newFreq);
}
}
double total = frequencyList.Sum(x => x.ValueCount);
// For each item on the frequency list...
for (int i = 0; i < frequencyList.Count; i++)
{
// Calculate the probability
double jointProbability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount);
// increase the entropy value
jointEntropy += jointProbability * Math.Log(jointProbability,2);
}
return jointEntropy * -1;
}

Finally a function to produce symmetrical uncertainty.

/// <summary>
/// Returns the symmetrical uncertainty of the first variable with the second
/// </summary>
/// <param name="firstFeature">Column number of first variable (start at 0)</param>
/// <param name="secondFeature">Column number of second variable (start at 0)</param>
/// <returns>Symmetrical uncertainty as a double</returns>
public double SymmetricalUncertainty(int firstFeature, int secondFeature)
{
double firstEntropy = Entropy(firstFeature);
double secondEntropy = Entropy(secondFeature);
return 2 * ((firstEntropy + secondEntropy - JointEntropy(firstFeature, secondFeature)) /
(firstEntropy + secondEntropy)
);
}

You can probably see that ideally it would be more efficient to merge the two entropy functions, and calculate both of them at the same time, but that might have looked a bit confusing to the learner!

The next post will focus on how SU is used within FCBF.

** Yu, Lei and Liu, Huan,(2004) Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research (5) pp 1205-1224.*