Fast Correlation-Based Filter in C#: Part 2

In a previous post I started this article about Fast Correlation Based Filter (FCBF). That was quite long, setting up the algorithms used to calculate symmetrical uncertainty (SU) that is the ‘decision’ engine behind FCBF.

Now we have to actually use SU on our entire dataset, to discover which features are considered the most important (at least as far as SU is concerned!). Don’t worry, this post isn’t quite as long. 🙂

The basic premise is this: We first of all need to calculate SU for each feature, with respect to the class label of the data items. In this scenario we used the first ‘feature’ in the EncodedSequence property (which is just a List of strings) to be the class label. So the calculation is SU(feature, 0) where feature is all features other than the class label itself of course.

The features are then ranked in descending SU order. An arbitrary cutoff threshold can be passed (usually just set to 0 initially), and any features that have an SU that falls under that cutoff is eliminated.

Then comes the part where redundant features are removed. FCBF marks feature B as essentially less useful than feature A if the SU between A and B is greater or equal to that between the class label and feature B. So in practice FCBF first selects the most highly ranked feature (A) and then calculates SU with the next most highly ranked (B). If it is greater or equal to B’s SU with the class label then B gets eliminated. FCBF then moves on to perform the same comparison with every feature. Once it gets to the end of the list it then moves to the next non-eliminated feature and starts the process again. By the end of this process it would usually be the case that the majority of features will have been eliminated. The ones that are left are considered to be the useful ones and are selected.

The code for this is shown below. Initially we create a class called UNCERTAINTY to hold the SU information about each feature.


class UNCERTAINTY
{
      public UNCERTAINTY(int _feature, double _su)
      {
          Feature = _feature;
          SymmetricalUncertainty = _su;
          Remove = false;
          AlreadySeen = false;
      }
      public int Feature;
      public double SymmetricalUncertainty;
      public bool Remove;
      public bool AlreadySeen;
       
};

The FCBF function below simply returns a list of feature numbers, which are the selected numbers. Note that this assumes that you are still using the variable _allDataItems to hold your data.

       /// <summary>
        /// Get the best features 
        /// </summary>
        /// <param name="threshold">FCBF threshold (0-1)</param>
        /// <returns>List of rows containing the variables, which is a subset of the set passed into the constructor</returns>
        public List<int> FCBF(double threshold)
        {      
            List<UNCERTAINTY> featuresFound = new List<UNCERTAINTY>();
 
            // Calculate the symmetric uncertainty between each feature and the class (the class is 'feature' 0).
            for (int featureCol = 1; featureCol < _allDataItems[0].EncodedSequence.Count; featureCol++)
            {
                // If symmetrical uncertainty of this feature with the class is greater than threshold then add it to list.
                double SU = SymmetricalUncertainty(featureCol, 0);
                if (SU > threshold)
                {
                    UNCERTAINTY u = new UNCERTAINTY(featureCol, SU);
                    featuresFound.Add(u);
                }
            }

            // Order the features above the threshold by descending SU
            featuresFound = featuresFound.OrderByDescending(x => x.SymmetricalUncertainty).ToList();

            while (true)
            {
                UNCERTAINTY uElement = featuresFound.Where(x => x.Remove == false && x.AlreadySeen == false).FirstOrDefault();
                if (uElement == null)
                    break;

                featuresFound[featuresFound.IndexOf(uElement)].AlreadySeen = true;

                for (int i = featuresFound.IndexOf(uElement) + 1; i < featuresFound.Count; i++)
                {
                    if (featuresFound[i].Remove == true) // Has been removed from list so ignore
                        continue;

                    double SU = SymmetricalUncertainty(featuresFound[i].Feature, uElement.Feature);
                   

                    if (SU >= featuresFound[i].SymmetricalUncertainty)
                    {
                        featuresFound[i].Remove = true;
                    }
                }
            }

            featuresFound = featuresFound.OrderBy(x => x.Feature).ToList();
            SelectedFeatures = featuresFound.Where(x => x.Remove == false).OrderBy(x => x.Feature).Select(x => x.Feature).ToList();
        
            return SelectedFeatures;
        }

I hope someone will find this useful!

Fast Correlation-Based Filter in C# Part 1: Information Entropy and Symmetrical Uncertainty

Imagine a case where you have a large set of data, where each row represents an item of data and each column an attribute of that item. It is more and more common nowadays to want to be able to automatically classify this data into categories by looking at their attributes and trying to find a pattern. A sub-discipline of AI called machine learning helps you do that. A classifier such as a neural network or a support vector machine will be trained on the dataset with the aim of creating a pattern recogniser. However in all but the most trivial cases you will usually have to first figure out which of the data attributes (or ‘features’) you want to use in order to train the classifier. This is because using many hundreds of features can a) take a long time to train, b) include lots of redundant features (i.e. their pattern duplicates that of other features, so doesn’t add any new information), c) be irrelevant (maybe that feature never changes at all), and d) actually confuse the training by providing information that might appear to be usable to the human eye but is actually just noise (think of the SETI program!).

There are many algorithms for finding the features that might be useful, but this article is about the Fast Correlation-Based Filter (FCBF) feature selection technique first introduced by Yu and Liu * . FCBF is often used in bioinformatics research, which is a commonly used domain for machine learning. Note that FCBF will only work reliably with discrete data, not continuous. This means that the values of each feature need to be put in ‘bins’. If your data is continuous in nature then there are binning algorithms (aka discretisation filters) that you can use that transform the values of each feature into the label of the appropriate bin. These are outside the scope of this post however the free Weka software contains at least one.

I will show how FCBF can be implemented in C#. Note that if you know C# then you will certainly see places where this can be more efficient (I’ve highlighted some in the text), but this is deliberate in order to be more readable. This first post will concentrate on the concept of information entropy and symmetrical uncertainty, which are used by FCBF in order to provide a ranking of the information content of the features. The next article will show how FCBF then uses the ranking to eliminate features.

For this example I am going to assume that a record (i.e. item of data) is called FeatureSequence, and there is one property containing the features for this item called EncodedSequence which is a List of strings. The first element of this list is always the class (category) label for this item of data. If you know that you are always going to have a fixed number of features then you could of course create a separate property for each, but that might become very big and inflexible very quickly!


public class FeatureSequence{

     // Remember that the first item is the class label
     public List<string> EncodedSequence {get;set;} 
 }

You need to read in your data and populate an instance of the above class for each data item, to create a list of FeatureSequences. I’m assuming that you already have enough C# knowledge to be able to do this!

Now comes a little bit of maths. Don’t worry, this is just to illustrate – the code comes in a minute! Like most maths notation, it is really just a shorthand for the many steps that we need to do, and whereas papers and Wikipedia will often miss out the crucial bit of maths information you need, I won’t :-). We will need to calculate the symmetrical uncertainty (SU) value for each feature. The mathematical formula for this is below, where H(X) is the information entropy (i.e. amount of information) of a feature X and H(X,Y) is the joint information entropy of feature X and feature Y:

SU(X,Y)=2(\frac{H(X)+H(Y)-H(X,Y)}{H(X)+H(Y)})

Of course X or Y can also be the classification label of the data too, which is why I’ve implemented class label as simply another feature in EncodedSequence in the code above. As an aside, the numerator is also the formula for mutual information (i.e. the amount that knowing one feature gives information about the other). SU is a little bit of a misnomer (in plain English) because SU actually increases the more certain you can be that feature X helps predict Y and vice-versa, i.e. as you become less ‘uncertain’ SU increases!

These are the formula for information entropy H(X) and joint information entropy H(X,Y). p(x) is the probability function i.e. the probability that value x will appear in the feature being examined, and p(x, y) is the joint probability function, i.e. the probability that the value x will appear in the first feature being examined when the value in second feature is y. Hopefully you understand that, but as a quick example if there are 10 items in your feature, and a particular value appears 3 times, then the probability of that value is 3/10 or 0.3.

H(X)=-\sum_x p(x)log_2 p(x)
H(X,Y)=-\sum_x \sum_y p(x,y)log_2 p(x,y)

So to start with we need to calculate entropy. First of all declare a class called FREQUENCY that will hold a list of all the values for a feature and the number of times that value appears.

public class FREQUENCY
{
     public string FeatureValue;
     public string Conditioner; // for joint probability.
     public double ValueCount;
} ;

The following function is used to calculate entropy. We simply loop through each data item, recording the number of times a particular value appears in a list of FREQUENCY. Then we loop through that frequency list, performing the entropy calculation H(X) above on each item and adding them together. In the code I use logarithm to the base 2. This results in an entropy measure in units of measure bits. Natural log is also commonly used, in which case the unit of measure is nats.

// Somewhere else in your class declare and populate a list of FeatureSequences containing your data:
private List<FeatureSequence> _allDataItems; 

 /// <summary>
 /// Find the Shannon entropy of the passed feature
 /// </summary>
 /// <param name="feature">Column number of variable (start at 0)</param>
 /// <returns>Entropy as a double</returns>
 public double Entropy(int feature)
 {
      double entropy = 0;
      List<FREQUENCY> frequencyList = new List<FREQUENCY>();
            
      // First count the number of occurances of each value
      // Go through each feature list (i.e. row) on the dataset
      for (int i = 0; i < _allDataItems.Count; i++)
      {
               
          // If the frequency list already has a place for this value then increment its count
          FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[feature]).FirstOrDefault();
          if (freq != null)
          {
               freq.ValueCount++;
          }
          // else add a new item to the frequency list
          else
          {
               FREQUENCY newFreq = new FREQUENCY();
               newFreq.FeatureValue = _allDataItems[i].EncodedSequence[feature];
               newFreq.ValueCount = 1;
               frequencyList.Add(newFreq);
          }
      }

      // For each item on the frequency list...

      for (int i = 0; i < frequencyList.Count; i++)
      {
          // Calculate the probability
          double probability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount); 

                
          // increase the entropy value
          entropy += (probability * Math.Log(probability, 2));
          // Note: can also use entropy += (probability * Math.Log((1/probability), 2));
      }

      return entropy*-1;
 }

The joint entropy function is very similar. Instead of passing 1 feature you pass 2. Then the list of FREQUENCY records the number of times a combination of the first and second feature occurs. Otherwise it is just the same.

/// <summary>
/// Return the joint entropy of 2 features.
/// </summary>
/// <param name="firstFeature">Column number of first variable (start at 0)</param>
/// <param name="secondFeature">Column number of second variable (start at 0)</param>
/// <returns>Joint entropy as a double</returns>
public double JointEntropy(int firstFeature, int secondFeature)
{
     double jointEntropy = 0;
     List<FREQUENCY> frequencyList = new List<FREQUENCY>();
            
     // First count the number of occurances of each value of feature 1 for each value of feature 2 
     // Go through each feature list (i.e. row) on the dataset
     for (int i = 0; i < _allDataItems.Count; i++)
     {
          // If the frequency list already has a place for this value then increment its count
          FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[firstFeature] &&
                                               x.Conditioner == _allDataItems[i].EncodedSequence[secondFeature]).FirstOrDefault();
           if (freq != null)
           {
                freq.ValueCount++;
           }
           // else add a new item to the frequency list
           else
           {
               FREQUENCY newFreq = new FREQUENCY();
               newFreq.FeatureValue = _allDataItems[i].EncodedSequence[firstFeature];
               newFreq.Conditioner = _allDataItems[i].EncodedSequence[secondFeature];
               newFreq.ValueCount = 1;
               frequencyList.Add(newFreq);
           }
       }

       double total = frequencyList.Sum(x => x.ValueCount);

       // For each item on the frequency list...
       for (int i = 0; i < frequencyList.Count; i++)
       {
           // Calculate the probability
           double jointProbability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount);              
           // increase the entropy value
           jointEntropy += jointProbability * Math.Log(jointProbability,2);
       }

       return jointEntropy * -1;
}

Finally a function to produce symmetrical uncertainty.

/// <summary>
/// Returns the symmetrical uncertainty of the first variable with the second
/// </summary>
/// <param name="firstFeature">Column number of first variable (start at 0)</param>
/// <param name="secondFeature">Column number of second variable (start at 0)</param>
/// <returns>Symmetrical uncertainty as a double</returns>
public double SymmetricalUncertainty(int firstFeature, int secondFeature)
{
      double firstEntropy = Entropy(firstFeature);
      double secondEntropy = Entropy(secondFeature);
 
      return 2 * ((firstEntropy + secondEntropy - JointEntropy(firstFeature, secondFeature)) / 
                 (firstEntropy + secondEntropy)
                 );
}

You can probably see that ideally it would be more efficient to merge the two entropy functions, and calculate both of them at the same time, but that might have looked a bit confusing to the learner!

The next post will focus on how SU is used within FCBF.

* Yu, Lei and Liu, Huan,(2004) Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research (5) pp 1205-1224.