Fast Correlation-Based Filter in C#: Part 2

In a previous post I started this article about Fast Correlation Based Filter (FCBF). That was quite long, setting up the algorithms used to calculate symmetrical uncertainty (SU) that is the ‘decision’ engine behind FCBF.

Now we have to actually use SU on our entire dataset, to discover which features are considered the most important (at least as far as SU is concerned!). Don’t worry, this post isn’t quite as long. 🙂

The basic premise is this: We first of all need to calculate SU for each feature, with respect to the class label of the data items. In this scenario we used the first ‘feature’ in the EncodedSequence property (which is just a List of strings) to be the class label. So the calculation is SU(feature, 0) where feature is all features other than the class label itself of course.

The features are then ranked in descending SU order. An arbitrary cutoff threshold can be passed (usually just set to 0 initially), and any features that have an SU that falls under that cutoff is eliminated.

Then comes the part where redundant features are removed. FCBF marks feature B as essentially less useful than feature A if the SU between A and B is greater or equal to that between the class label and feature B. So in practice FCBF first selects the most highly ranked feature (A) and then calculates SU with the next most highly ranked (B). If it is greater or equal to B’s SU with the class label then B gets eliminated. FCBF then moves on to perform the same comparison with every feature. Once it gets to the end of the list it then moves to the next non-eliminated feature and starts the process again. By the end of this process it would usually be the case that the majority of features will have been eliminated. The ones that are left are considered to be the useful ones and are selected.

The code for this is shown below. Initially we create a class called UNCERTAINTY to hold the SU information about each feature.


class UNCERTAINTY
{
      public UNCERTAINTY(int _feature, double _su)
      {
          Feature = _feature;
          SymmetricalUncertainty = _su;
          Remove = false;
          AlreadySeen = false;
      }
      public int Feature;
      public double SymmetricalUncertainty;
      public bool Remove;
      public bool AlreadySeen;
       
};

The FCBF function below simply returns a list of feature numbers, which are the selected numbers. Note that this assumes that you are still using the variable _allDataItems to hold your data.

       /// <summary>
        /// Get the best features 
        /// </summary>
        /// <param name="threshold">FCBF threshold (0-1)</param>
        /// <returns>List of rows containing the variables, which is a subset of the set passed into the constructor</returns>
        public List<int> FCBF(double threshold)
        {      
            List<UNCERTAINTY> featuresFound = new List<UNCERTAINTY>();
 
            // Calculate the symmetric uncertainty between each feature and the class (the class is 'feature' 0).
            for (int featureCol = 1; featureCol < _allDataItems[0].EncodedSequence.Count; featureCol++)
            {
                // If symmetrical uncertainty of this feature with the class is greater than threshold then add it to list.
                double SU = SymmetricalUncertainty(featureCol, 0);
                if (SU > threshold)
                {
                    UNCERTAINTY u = new UNCERTAINTY(featureCol, SU);
                    featuresFound.Add(u);
                }
            }

            // Order the features above the threshold by descending SU
            featuresFound = featuresFound.OrderByDescending(x => x.SymmetricalUncertainty).ToList();

            while (true)
            {
                UNCERTAINTY uElement = featuresFound.Where(x => x.Remove == false && x.AlreadySeen == false).FirstOrDefault();
                if (uElement == null)
                    break;

                featuresFound[featuresFound.IndexOf(uElement)].AlreadySeen = true;

                for (int i = featuresFound.IndexOf(uElement) + 1; i < featuresFound.Count; i++)
                {
                    if (featuresFound[i].Remove == true) // Has been removed from list so ignore
                        continue;

                    double SU = SymmetricalUncertainty(featuresFound[i].Feature, uElement.Feature);
                   

                    if (SU >= featuresFound[i].SymmetricalUncertainty)
                    {
                        featuresFound[i].Remove = true;
                    }
                }
            }

            featuresFound = featuresFound.OrderBy(x => x.Feature).ToList();
            SelectedFeatures = featuresFound.Where(x => x.Remove == false).OrderBy(x => x.Feature).Select(x => x.Feature).ToList();
        
            return SelectedFeatures;
        }

I hope someone will find this useful!

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: