# Fast Correlation-Based Filter in C# Part 1: Information Entropy and Symmetrical Uncertainty

09/09/2014 10 Comments

Imagine a case where you have a large set of data, where each row represents an item of data and each column an attribute of that item. It is more and more common nowadays to want to be able to automatically classify this data into categories by looking at their attributes and trying to find a pattern. A sub-discipline of AI called machine learning helps you do that. A classifier such as a neural network or a support vector machine will be trained on the dataset with the aim of creating a pattern recogniser. However in all but the most trivial cases you will usually have to first figure out which of the data attributes (or ‘features’) you want to use in order to train the classifier. This is because using many hundreds of features can a) take a long time to train, b) include lots of redundant features (i.e. their pattern duplicates that of other features, so doesn’t add any new information), c) be irrelevant (maybe that feature never changes at all), and d) actually confuse the training by providing information that might appear to be usable to the human eye but is actually just noise (think of the SETI program!).

There are many algorithms for finding the features that might be useful, but this article is about the Fast Correlation-Based Filter (FCBF) feature selection technique first introduced by Yu and Liu * . FCBF is often used in bioinformatics research, which is a commonly used domain for machine learning. Note that FCBF will only work reliably with discrete data, not continuous. This means that the values of each feature need to be put in ‘bins’. If your data is continuous in nature then there are binning algorithms (aka discretisation filters) that you can use that transform the values of each feature into the label of the appropriate bin. These are outside the scope of this post however the free Weka software contains at least one.

I will show how FCBF can be implemented in C#. Note that if you know C# then you will certainly see places where this can be more efficient (I’ve highlighted some in the text), but this is deliberate in order to be more readable. This first post will concentrate on the concept of information entropy and symmetrical uncertainty, which are used by FCBF in order to provide a ranking of the information content of the features. The next article will show how FCBF then uses the ranking to eliminate features.

For this example I am going to assume that a record (i.e. item of data) is called FeatureSequence, and there is one property containing the features for this item called EncodedSequence which is a List of strings. The first element of this list is always the class (category) label for this item of data. If you know that you are always going to have a fixed number of features then you could of course create a separate property for each, but that might become very big and inflexible very quickly!

public class FeatureSequence{ // Remember that the first item is the class label public List<string> EncodedSequence {get;set;} }

You need to read in your data and populate an instance of the above class for each data item, to create a list of FeatureSequences. I’m assuming that you already have enough C# knowledge to be able to do this!

Now comes a little bit of maths. Don’t worry, this is just to illustrate – the code comes in a minute! Like most maths notation, it is really just a shorthand for the many steps that we need to do, and whereas papers and Wikipedia will often miss out the crucial bit of maths information you need, I won’t :-). We will need to calculate the symmetrical uncertainty (SU) value for each feature. The mathematical formula for this is below, where H(X) is the information entropy (i.e. amount of information) of a feature X and H(X,Y) is the joint information entropy of feature X and feature Y:

Of course X or Y can also be the classification label of the data too, which is why I’ve implemented class label as simply another feature in EncodedSequence in the code above. As an aside, the numerator is also the formula for mutual information (i.e. the amount that knowing one feature gives information about the other). SU is a little bit of a misnomer (in plain English) because SU actually increases the more certain you can be that feature X helps predict Y and vice-versa, i.e. as you become less ‘uncertain’ SU increases!

These are the formula for information entropy H(X) and joint information entropy H(X,Y). p(x) is the probability function i.e. the probability that value x will appear in the feature being examined, and p(x, y) is the joint probability function, i.e. the probability that the value x will appear in the first feature being examined when the value in second feature is y. Hopefully you understand that, but as a quick example if there are 10 items in your feature, and a particular value appears 3 times, then the probability of that value is 3/10 or 0.3.

So to start with we need to calculate entropy. First of all declare a class called FREQUENCY that will hold a list of all the values for a feature and the number of times that value appears.

public class FREQUENCY { public string FeatureValue; public string Conditioner; // for joint probability. public double ValueCount; } ;

The following function is used to calculate entropy. We simply loop through each data item, recording the number of times a particular value appears in a list of FREQUENCY. Then we loop through that frequency list, performing the entropy calculation H(X) above on each item and adding them together. In the code I use logarithm to the base 2. This results in an entropy measure in units of measure bits. Natural log is also commonly used, in which case the unit of measure is nats.

// Somewhere else in your class declare and populate a list of FeatureSequences containing your data: private List<FeatureSequence> _allDataItems; /// <summary> /// Find the Shannon entropy of the passed feature /// </summary> /// <param name="feature">Column number of variable (start at 0)</param> /// <returns>Entropy as a double</returns> public double Entropy(int feature) { double entropy = 0; List<FREQUENCY> frequencyList = new List<FREQUENCY>(); // First count the number of occurances of each value // Go through each feature list (i.e. row) on the dataset for (int i = 0; i < _allDataItems.Count; i++) { // If the frequency list already has a place for this value then increment its count FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[feature]).FirstOrDefault(); if (freq != null) { freq.ValueCount++; } // else add a new item to the frequency list else { FREQUENCY newFreq = new FREQUENCY(); newFreq.FeatureValue = _allDataItems[i].EncodedSequence[feature]; newFreq.ValueCount = 1; frequencyList.Add(newFreq); } } // For each item on the frequency list... for (int i = 0; i < frequencyList.Count; i++) { // Calculate the probability double probability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount); // increase the entropy value entropy += (probability * Math.Log(probability, 2)); // Note: can also use entropy += (probability * Math.Log((1/probability), 2)); } return entropy*-1; }

The joint entropy function is very similar. Instead of passing 1 feature you pass 2. Then the list of FREQUENCY records the number of times a combination of the first and second feature occurs. Otherwise it is just the same.

/// <summary> /// Return the joint entropy of 2 features. /// </summary> /// <param name="firstFeature">Column number of first variable (start at 0)</param> /// <param name="secondFeature">Column number of second variable (start at 0)</param> /// <returns>Joint entropy as a double</returns> public double JointEntropy(int firstFeature, int secondFeature) { double jointEntropy = 0; List<FREQUENCY> frequencyList = new List<FREQUENCY>(); // First count the number of occurances of each value of feature 1 for each value of feature 2 // Go through each feature list (i.e. row) on the dataset for (int i = 0; i < _allDataItems.Count; i++) { // If the frequency list already has a place for this value then increment its count FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[firstFeature] && x.Conditioner == _allDataItems[i].EncodedSequence[secondFeature]).FirstOrDefault(); if (freq != null) { freq.ValueCount++; } // else add a new item to the frequency list else { FREQUENCY newFreq = new FREQUENCY(); newFreq.FeatureValue = _allDataItems[i].EncodedSequence[firstFeature]; newFreq.Conditioner = _allDataItems[i].EncodedSequence[secondFeature]; newFreq.ValueCount = 1; frequencyList.Add(newFreq); } } double total = frequencyList.Sum(x => x.ValueCount); // For each item on the frequency list... for (int i = 0; i < frequencyList.Count; i++) { // Calculate the probability double jointProbability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount); // increase the entropy value jointEntropy += jointProbability * Math.Log(jointProbability,2); } return jointEntropy * -1; }

Finally a function to produce symmetrical uncertainty.

/// <summary> /// Returns the symmetrical uncertainty of the first variable with the second /// </summary> /// <param name="firstFeature">Column number of first variable (start at 0)</param> /// <param name="secondFeature">Column number of second variable (start at 0)</param> /// <returns>Symmetrical uncertainty as a double</returns> public double SymmetricalUncertainty(int firstFeature, int secondFeature) { double firstEntropy = Entropy(firstFeature); double secondEntropy = Entropy(secondFeature); return 2 * ((firstEntropy + secondEntropy - JointEntropy(firstFeature, secondFeature)) / (firstEntropy + secondEntropy) ); }

You can probably see that ideally it would be more efficient to merge the two entropy functions, and calculate both of them at the same time, but that might have looked a bit confusing to the learner!

The next post will focus on how SU is used within FCBF.

** Yu, Lei and Liu, Huan,(2004) Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research (5) pp 1205-1224.*

Pingback: Fast Correlation-Based Filter in C#: Part 2 | Dot Scrap Book

Thanx a lot!

I got that you used SU formulation with Joint probability in nominator, but not Conditional probability. Why so?

It seems that using joint entropy requires less computational operations than conditional: in conditional entropy there is additional multiplication operation on p(y_j): Sum_j[ p(y_j) * Sum_i[p(x_i|y_j)*log_2(p(x_i|y_j))]].

PS, but anyway, it’s possible to use cond entrop too: just aggregate freq for x_i where no occurrence of y_j. But again, it’s not worth it.

Oh, btw! Is that correct that FCBF do not care about intraclass uncorrelation in any way?

Only intraclass correlation maximization and intrafeature redundancy minimization?

And it is not supposed multinomial multivariate distributions?

wrong word here, i meant interclass, not intra >>> ” Is that correct that FCBF do not care about intraclass uncorrelation in any way?”

>>> “And it is not supposed multinomial multivariate distributions?”

– i wonder, what combination of features (sensors) FCBF will choose in case like this:

Suppose we have 2 classes and 6 features, and suppose two 3D combinations of that features: 1st combination will make a two 3D spheres (no overlaps, statistically looks like good distributions), 2nd 3D combination will give universaly distributed noise in 3D space

In second case each p(Fij)->avg (coz of universally distributed)

In first case each p(Fij)->k*avg, where k>1. (frequencies more closely distributed, suppose binomially in case of two 3D spheres)

So, in 1st case Entropy Information Gain of each Fij will be smaller as probability by freq of each Fij is greater than in 2nd case (H(p(x))->0, if p(x)->1) and FCBF will give 2nd combination? Or there is some “devil in details” of formulations?

Sorry, that question sounds like it was a bit beyond the scope of my MSc, I don’t know/remember your terminology I’m afraid. I remember that it took me ages just to get my head around the algorithm as my maths education has never included any stats, so I had to go right back to the basics of ‘what is probability’! That was the main reason I wrote this article actually, it took so much work to get this far I thought I should record it so I wouldn’t need to repeat it.

I needed to implement it in C# as I needed to use it for a project, but I think I just accepted it as a standard algorithm, I didn’t have time to find out why it did what it did, just how it did it. I think I tested it against the output of the Weka module that the authors of the paper had written, to make sure I got everything correct.

I remember that it was still a fairly simple feature selection algorithm – there were other similar ones that used mutual information rather than symmetrical uncertainty. I think it would only work if the data had one classification variable.

I just can’t find any more readable implementation than your. I just want to understand how FCBF deal with data that have more than one class label.

I’ll try to simulate in Excel the case with 1 feature and 1 class: simulate intraclass distribution of that feature drops from max variance to low – and analize SU of that feature. If it raising – then ok, its mean that the feature rise in statistical correlation meaning to the class. But if SU drops – it’s mean that FCBF measure uncorrelation of feature to the class and interpret this as rise of information of that feature with respect to this class, and it’s statistically not good.

Approved. Works well with respect to statistics correlation to class (intraclass maximization!): the lower the variance of feature per Class -> the bigger becomes SU. The secret key in SU, it simply becomes equal to: H(p(ClassFreq))/(H(p(ClassFreq)) + H(p(FeatureFreq))), so works like: if variance of feature becomes lower, then denominator drops.

So cool =) Thanx again for your code implementation it’s very helfull!

PS, also FCBF do INTERclass separability maximization. Checked this in step by step simulation with calculation checks: in SU numerator drops faster than denominator when two centroids of different classes distributions move towards each other.

And it seems to me, that intraclass and interclass cases works with multinomial distribution within each feature too, because entropy changes faster when variance drops. Amazing.

The only restriction of FCBF is that it do calculation of distribution structure per feature, but not per any set of them (there is cases when in multidimensional space the distributions comes to groups and have lower variance around arisen medians, the 2D “xor” case for example, or 2D spiral).

So for complex multidimensional distributions i would advised

1. use FCBF only to throw out redundant features (use low threshold)

2. now we get smaller feature set without redundancy

3. process another heavier search filter based on multidimensional distances

I hope I do not bother.