Fast Correlation-Based Filter in C#: Part 2

In a previous post I started this article about Fast Correlation Based Filter (FCBF). That was quite long, setting up the algorithms used to calculate symmetrical uncertainty (SU) that is the ‘decision’ engine behind FCBF.

Now we have to actually use SU on our entire dataset, to discover which features are considered the most important (at least as far as SU is concerned!). Don’t worry, this post isn’t quite as long. 🙂

The basic premise is this: We first of all need to calculate SU for each feature, with respect to the class label of the data items. In this scenario we used the first ‘feature’ in the EncodedSequence property (which is just a List of strings) to be the class label. So the calculation is SU(feature, 0) where feature is all features other than the class label itself of course.

The features are then ranked in descending SU order. An arbitrary cutoff threshold can be passed (usually just set to 0 initially), and any features that have an SU that falls under that cutoff is eliminated.

Then comes the part where redundant features are removed. FCBF marks feature B as essentially less useful than feature A if the SU between A and B is greater or equal to that between the class label and feature B. So in practice FCBF first selects the most highly ranked feature (A) and then calculates SU with the next most highly ranked (B). If it is greater or equal to B’s SU with the class label then B gets eliminated. FCBF then moves on to perform the same comparison with every feature. Once it gets to the end of the list it then moves to the next non-eliminated feature and starts the process again. By the end of this process it would usually be the case that the majority of features will have been eliminated. The ones that are left are considered to be the useful ones and are selected.

The code for this is shown below. Initially we create a class called UNCERTAINTY to hold the SU information about each feature.


class UNCERTAINTY
{
public UNCERTAINTY(int _feature, double _su)
{
Feature = _feature;
SymmetricalUncertainty = _su;
Remove = false;
}
public int Feature;
public double SymmetricalUncertainty;
public bool Remove;

};


The FCBF function below simply returns a list of feature numbers, which are the selected numbers. Note that this assumes that you are still using the variable _allDataItems to hold your data.

       /// <summary>
/// </summary>
/// <param name="threshold">FCBF threshold (0-1)</param>
/// <returns>List of rows containing the variables, which is a subset of the set passed into the constructor</returns>
public List<int> FCBF(double threshold)
{
List<UNCERTAINTY> featuresFound = new List<UNCERTAINTY>();

// Calculate the symmetric uncertainty between each feature and the class (the class is 'feature' 0).
for (int featureCol = 1; featureCol < _allDataItems[0].EncodedSequence.Count; featureCol++)
{
// If symmetrical uncertainty of this feature with the class is greater than threshold then add it to list.
double SU = SymmetricalUncertainty(featureCol, 0);
if (SU > threshold)
{
UNCERTAINTY u = new UNCERTAINTY(featureCol, SU);
}
}

// Order the features above the threshold by descending SU
featuresFound = featuresFound.OrderByDescending(x => x.SymmetricalUncertainty).ToList();

while (true)
{
UNCERTAINTY uElement = featuresFound.Where(x => x.Remove == false && x.AlreadySeen == false).FirstOrDefault();
if (uElement == null)
break;

for (int i = featuresFound.IndexOf(uElement) + 1; i < featuresFound.Count; i++)
{
if (featuresFound[i].Remove == true) // Has been removed from list so ignore
continue;

double SU = SymmetricalUncertainty(featuresFound[i].Feature, uElement.Feature);

if (SU >= featuresFound[i].SymmetricalUncertainty)
{
featuresFound[i].Remove = true;
}
}
}

featuresFound = featuresFound.OrderBy(x => x.Feature).ToList();
SelectedFeatures = featuresFound.Where(x => x.Remove == false).OrderBy(x => x.Feature).Select(x => x.Feature).ToList();

return SelectedFeatures;
}



I hope someone will find this useful!

Fast Correlation-Based Filter in C# Part 1: Information Entropy and Symmetrical Uncertainty

Imagine a case where you have a large set of data, where each row represents an item of data and each column an attribute of that item. It is more and more common nowadays to want to be able to automatically classify this data into categories by looking at their attributes and trying to find a pattern. A sub-discipline of AI called machine learning helps you do that. A classifier such as a neural network or a support vector machine will be trained on the dataset with the aim of creating a pattern recogniser. However in all but the most trivial cases you will usually have to first figure out which of the data attributes (or ‘features’) you want to use in order to train the classifier. This is because using many hundreds of features can a) take a long time to train, b) include lots of redundant features (i.e. their pattern duplicates that of other features, so doesn’t add any new information), c) be irrelevant (maybe that feature never changes at all), and d) actually confuse the training by providing information that might appear to be usable to the human eye but is actually just noise (think of the SETI program!).

There are many algorithms for finding the features that might be useful, but this article is about the Fast Correlation-Based Filter (FCBF) feature selection technique first introduced by Yu and Liu * . FCBF is often used in bioinformatics research, which is a commonly used domain for machine learning. Note that FCBF will only work reliably with discrete data, not continuous. This means that the values of each feature need to be put in ‘bins’. If your data is continuous in nature then there are binning algorithms (aka discretisation filters) that you can use that transform the values of each feature into the label of the appropriate bin. These are outside the scope of this post however the free Weka software contains at least one.

I will show how FCBF can be implemented in C#. Note that if you know C# then you will certainly see places where this can be more efficient (I’ve highlighted some in the text), but this is deliberate in order to be more readable. This first post will concentrate on the concept of information entropy and symmetrical uncertainty, which are used by FCBF in order to provide a ranking of the information content of the features. The next article will show how FCBF then uses the ranking to eliminate features.

For this example I am going to assume that a record (i.e. item of data) is called FeatureSequence, and there is one property containing the features for this item called EncodedSequence which is a List of strings. The first element of this list is always the class (category) label for this item of data. If you know that you are always going to have a fixed number of features then you could of course create a separate property for each, but that might become very big and inflexible very quickly!


public class FeatureSequence{

// Remember that the first item is the class label
public List<string> EncodedSequence {get;set;}
}



You need to read in your data and populate an instance of the above class for each data item, to create a list of FeatureSequences. I’m assuming that you already have enough C# knowledge to be able to do this!

Now comes a little bit of maths. Don’t worry, this is just to illustrate – the code comes in a minute! Like most maths notation, it is really just a shorthand for the many steps that we need to do, and whereas papers and Wikipedia will often miss out the crucial bit of maths information you need, I won’t :-). We will need to calculate the symmetrical uncertainty (SU) value for each feature. The mathematical formula for this is below, where H(X) is the information entropy (i.e. amount of information) of a feature X and H(X,Y) is the joint information entropy of feature X and feature Y:

$SU(X,Y)=2(\frac{H(X)+H(Y)-H(X,Y)}{H(X)+H(Y)})$

Of course X or Y can also be the classification label of the data too, which is why I’ve implemented class label as simply another feature in EncodedSequence in the code above. As an aside, the numerator is also the formula for mutual information (i.e. the amount that knowing one feature gives information about the other). SU is a little bit of a misnomer (in plain English) because SU actually increases the more certain you can be that feature X helps predict Y and vice-versa, i.e. as you become less ‘uncertain’ SU increases!

These are the formula for information entropy H(X) and joint information entropy H(X,Y). p(x) is the probability function i.e. the probability that value x will appear in the feature being examined, and p(x, y) is the joint probability function, i.e. the probability that the value x will appear in the first feature being examined when the value in second feature is y. Hopefully you understand that, but as a quick example if there are 10 items in your feature, and a particular value appears 3 times, then the probability of that value is 3/10 or 0.3.

$H(X)=-\sum_x p(x)log_2 p(x)$
$H(X,Y)=-\sum_x \sum_y p(x,y)log_2 p(x,y)$

So to start with we need to calculate entropy. First of all declare a class called FREQUENCY that will hold a list of all the values for a feature and the number of times that value appears.

public class FREQUENCY
{
public string FeatureValue;
public string Conditioner; // for joint probability.
public double ValueCount;
} ;



The following function is used to calculate entropy. We simply loop through each data item, recording the number of times a particular value appears in a list of FREQUENCY. Then we loop through that frequency list, performing the entropy calculation H(X) above on each item and adding them together. In the code I use logarithm to the base 2. This results in an entropy measure in units of measure bits. Natural log is also commonly used, in which case the unit of measure is nats.

// Somewhere else in your class declare and populate a list of FeatureSequences containing your data:
private List<FeatureSequence> _allDataItems;

/// <summary>
/// Find the Shannon entropy of the passed feature
/// </summary>
/// <param name="feature">Column number of variable (start at 0)</param>
/// <returns>Entropy as a double</returns>
public double Entropy(int feature)
{
double entropy = 0;
List<FREQUENCY> frequencyList = new List<FREQUENCY>();

// First count the number of occurances of each value
// Go through each feature list (i.e. row) on the dataset
for (int i = 0; i < _allDataItems.Count; i++)
{

// If the frequency list already has a place for this value then increment its count
FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[feature]).FirstOrDefault();
if (freq != null)
{
freq.ValueCount++;
}
// else add a new item to the frequency list
else
{
FREQUENCY newFreq = new FREQUENCY();
newFreq.FeatureValue = _allDataItems[i].EncodedSequence[feature];
newFreq.ValueCount = 1;
}
}

// For each item on the frequency list...

for (int i = 0; i < frequencyList.Count; i++)
{
// Calculate the probability
double probability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount);

// increase the entropy value
entropy += (probability * Math.Log(probability, 2));
// Note: can also use entropy += (probability * Math.Log((1/probability), 2));
}

return entropy*-1;
}


The joint entropy function is very similar. Instead of passing 1 feature you pass 2. Then the list of FREQUENCY records the number of times a combination of the first and second feature occurs. Otherwise it is just the same.

/// <summary>
/// Return the joint entropy of 2 features.
/// </summary>
/// <param name="firstFeature">Column number of first variable (start at 0)</param>
/// <param name="secondFeature">Column number of second variable (start at 0)</param>
/// <returns>Joint entropy as a double</returns>
public double JointEntropy(int firstFeature, int secondFeature)
{
double jointEntropy = 0;
List<FREQUENCY> frequencyList = new List<FREQUENCY>();

// First count the number of occurances of each value of feature 1 for each value of feature 2
// Go through each feature list (i.e. row) on the dataset
for (int i = 0; i < _allDataItems.Count; i++)
{
// If the frequency list already has a place for this value then increment its count
FREQUENCY freq = frequencyList.Where(x => x.FeatureValue == _allDataItems[i].EncodedSequence[firstFeature] &&
x.Conditioner == _allDataItems[i].EncodedSequence[secondFeature]).FirstOrDefault();
if (freq != null)
{
freq.ValueCount++;
}
// else add a new item to the frequency list
else
{
FREQUENCY newFreq = new FREQUENCY();
newFreq.FeatureValue = _allDataItems[i].EncodedSequence[firstFeature];
newFreq.Conditioner = _allDataItems[i].EncodedSequence[secondFeature];
newFreq.ValueCount = 1;
}
}

double total = frequencyList.Sum(x => x.ValueCount);

// For each item on the frequency list...
for (int i = 0; i < frequencyList.Count; i++)
{
// Calculate the probability
double jointProbability = (double)frequencyList[i].ValueCount / (double)frequencyList.Sum(x => x.ValueCount);
// increase the entropy value
jointEntropy += jointProbability * Math.Log(jointProbability,2);
}

return jointEntropy * -1;
}



Finally a function to produce symmetrical uncertainty.

/// <summary>
/// Returns the symmetrical uncertainty of the first variable with the second
/// </summary>
/// <param name="firstFeature">Column number of first variable (start at 0)</param>
/// <param name="secondFeature">Column number of second variable (start at 0)</param>
/// <returns>Symmetrical uncertainty as a double</returns>
public double SymmetricalUncertainty(int firstFeature, int secondFeature)
{
double firstEntropy = Entropy(firstFeature);
double secondEntropy = Entropy(secondFeature);

return 2 * ((firstEntropy + secondEntropy - JointEntropy(firstFeature, secondFeature)) /
(firstEntropy + secondEntropy)
);
}


You can probably see that ideally it would be more efficient to merge the two entropy functions, and calculate both of them at the same time, but that might have looked a bit confusing to the learner!

The next post will focus on how SU is used within FCBF.

* Yu, Lei and Liu, Huan,(2004) Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research (5) pp 1205-1224.

Entity Framework 4.1 with SQL Server 2000

Just a quick post: We normally use NHibernate rather than Entity Framework in my team, for a variety of reasons, not the least that Sharp uses it.

However I needed to develop a quick WCF web service for testing, and MS do provide such easy tools for it that I couldn’t help myself. 🙂 Bit of a mistake as it turns out that Microsoft’s EF 4 doesn’t actually support SQL Server 2000, which we use. I was just about to grit my teeth and try and set up NHibernate for WCF when I eventually found the below link which shows a useful workround. Thought I’d post the link as it took me a while to find. Thanks to Jason Skowronek.

http://www.skonet.com/Articles_Archive/How_To_Use_Entity_Framework_4_With_Visual_Studio_2010_and_SQL_Server_2000.aspx

MVC – Getting User Roles From Multiple Sources (register and resolve arrays of dependencis using the fluent api)

In the last post I talked about encapsulating the authorisation and user credentials (i.e IPrincipal) logic in a separate, re-usable DLL (known as AuthServices).

Part 1 of that post discussed the DLL contents in detail. Part 2 focused upon using that DLL in an MVC application.

This post follows up on “Part 2” specifically adding a user’s roles to the UserData objet prior to serialisation, on-route to becoming the auth cookie.

The code sample that I gave for using the AuthServices DLL is repeated below:

_authenticationService.Authenticate(userNameString, password);
if (_authenticationService.Result.Authenticated)
{
_currentUserData = _authenticationService.Result.CurrentUserData;
_currentUserData.Roles = GetRolesForUser(userName);//THE SUBJECT OF THIS POST
}


This post explores in more detail what lies behind the GetRolesForUser(string userName) method and how this can utilise more than one provider of roles.

Why bother?

The organisation in which I work is big. We have thousands of staff and many thousands more students. Applications written for an organisation of this size usually need to access different sources of data to provide a user’s roles relevant to the specific application. In projects like this you can’t easily use the standard RoleProvider stuff in a config file.

The Answer – Multiple Role Providers

To implement multiple roles I inject (in an IOC stylee) into my login service the interface IRolesMgr. IRolesMgr exposes the following:

public interface IRolesMgr
{
}


The interesting part is in the constructor of the implementation.

public class RolesMgr: IRolesMgr
{
private IRoleMgrComponent[] _roleManagers;

public RolesMgr(IRoleMgrComponent[] roleManagers)
{
_roleManagers = roleManagers;
}

{
List roles = new List();
foreach (var mgr in _roleManagers)
{
}
}
}


As can be seen the constructor has a dependency on an array of IRoleMgrComponents each of which will contain the access logic needed to identify a user’s roles from a different source.
Each IRoleMgrComponent implements the following:

public interface IRoleMgrComponent
{
}


That’s all very good, but the trick is getting the IOC container, in this case, Castle Windsor, to instantiate the appropriate implementations of IRoleMgrComponent in the roleManagers array.

This is done in the Component Registrar class used to register components to Castle by using the “ServiceOveride” method (more details of this can be seen on Mike Hadlow’s post – which was the original inspiration for the castle registration code – thanks Mike!). In effect this allows you to register and resolve arrays of dependencis using the fluent api.

container.Register(
Component.For()
.ImplementedBy()
.ServiceOverrides(
ServiceOverride.ForKey("roleManagers").Eq(
"HumsAuthDbRoleManager",
"StudentAuthRoleMgr"
)
),
AllTypes
.FromAssemblyNamed("ApplicationServices")
.BasedOn()
.WithService.FromInterface(typeof(IRoleMgrComponent))
.Configure(c => c.Named(c.Implementation.Name))
);


Once you have overriden this specific component you can continue to register all the other components in that assembly in the normal way, say:

container.Register(
AllTypes.Pick()
.FromAssemblyNamed("ApplicationServices")
.WithService.FirstInterface());


Conclusion

Implementing RolesMgr(IRoleMgrComponent[]) in conjunction with the AuthServices assembly talked about in the last post, means that for any given new project I will only have to write minimal code yet benefit from User Credential information that:
a) Has the execution logic encapsulated in an (extensible) AuthServices library
b) Can uitilise dependency injection to configure how where and how the roles data comes from in the credential information.

Creating a resusable Authorisation/ User Principal library for MVC projects

Having been using SharpArchitecture/ MVC for a number of recent projects, I’ve finally made the effort to extract some common service features into a separate reusable project (with unit tests of course!). This is very much inkeeping with DRY principles and should shorten the time to (re)-implement common processes in new projects.

The project – AuthUserServices provides several useful interface/implementations that can be used in any (MVC/ or Non-MVC) .NET project. It aint rocket science but provides a number of neat and unit tested resources around Authorisation and GenericPrinicpal.

Part 1 – The AuthServices DLL

The IAuthenticationService

This interface exposes two methods:

void Authenticate(string userName, string password)// Authenticates against a user directory
void FindUser(string userName)// Ascertains whether a user is contained in a user directory


And an AuthenticationResult propterty…

AuthenticationResult Result { get; }


… which is a struct containing result information supplied from “Authenticate” or “FindUser”.

The current implementation(s) of this interface are for LDAP and CAS. Our MVC projects implement dependency injection using Castle Windsor. It is therefore fairly trivial to choose the IAutheniticationService implementation in the component register (specific examples to follow).

AuthenticationResult

The AuthenticationResult struct exposes the following properties:

public bool Authenticated {get;}// bool describing whether a user is authenticated
public string ConnectionMessage {get;}// any message that comes from connecting with/ or communicating with a user directory
public string SearchString {get;} // the string used to search the user directory, i.e. the username
public IUserData UserData {get;}// An object containing user details gleaned from searching the user directory


IUserData

IUserData exposes the following properties:

string Email { get; }
string FName { get; }
string SomeOtherProperty { get; }
List Roles { get; set; }
string SName { get; }
string ToString();


In the AuthServices project there is only one implementation of this interface (cunningly) called UserData. I implemented this as an interface so that it is possible to provide other implementations in the future. For example the

ToString()

implementation is used to serialise the object for use in a cookie. It is possible that a developer may with to implement a different serialisation algorithm.

UserData

UserData object data is built from a directory search or authentication attempt. It has two constuctors.
The first is used to create the UserData from properties gleaned from the user directory

public UserData(string email, string fName, string sName, string someOtherProperty)


The second is used to deserialise a userDataString contained in the UserData property in an auth cookie (in effect the reverse of the

ToString()

implementation.

public UserData(string userDataString)


The UserData class encapsulates the logic involved in managing user data across the Authorisation –> Cookie serialisation/deserialisation –> Building the User Principal process.

The ICurrentUserCookieMgr interface doesn’t expose a huge amount of functionality.

void CreateUserCookie(string userName, IUserData userData);
bool HttpOnly { get; set; }


The current implementation implements FormsAuthentication. It is the only part of the service that doesn’t really have adequate unit tests due to it’s (necessary) reliance upon HttpContext and the Response therein. Still the implemenation encapsulates this functionaly and can easily be stubbed/ mocked in application unit tests.

IHumsPrincipal

The final item of the service is the IHumsPrincipal interface. This exposes the the UserData and UserName properties as well as the IIdentity and property and IsInRole method available in the standard MS IPrincipal interface.

IIdentity Identity { get; }
bool IsInRole(string role);
IUserData UserData { get; }


Part 2 – Example Uses in an (MVC) app

Using AuthServices in the Authorisation logic of an application

I use the AuthServices implementation in a LoginService within an ApplicationServices project.
The LoginService has the following private members

private ICurrentUserMgr _userManager;
private IAuthenticationService _authenticationService;


You can then manage the whole cookie building/ roles setting process with only a few lines of code

_authenticationService.Authenticate(userNameString, password);
if (_authenticationService.Result.Authenticated)
{
_currentUserData = _authenticationService.Result.CurrentUserData;
}


Accessing User Info

Post authentication, developeres can use the implementations of IHumsPrincipal and IUserData in the Application_OnAuthenticateRequest in the Global.asax file of their apps to expose UserData properties to the application. e.g.

FormsIdentity id = (FormsIdentity)HttpContext.Current.User.Identity;
FormsAuthenticationTicket ticket = id.Ticket;
string userId = id.Ticket.Name;
IUserData uData = new UserData(id.Ticket.UserData);
HttpContext.Current.User = new HumsPrincipal(id.Name, uData, id);


You could then expose the IHumsPrincipal members in the application (such as a controller or application by casting the CurrentPrincipal to the IHumsPrincipal as follows:

IHumsPrincipal p = Thread.CurrentPrincipal as IHumsPrincipal;


Better still would be to encapsulate this cast behind an “ICurrentUserManager” interface to remove the dependency within controller/app service methods upon the Thread and therefore make such methods easy to unit test. For example:

public class CurrentUserMgr: ICurrentUserMgr
{
IHumsPrincipal _principal
{
get
{
try
{
}
catch
{
return null;
}
}
}

public CurrentUserMgr()
{

}

public bool IsStudent()
{
if (_principal == null)
return false;
else return _principal.IsInRole("student");
}

{
if (_principal == null)
return false;
}

public bool IsLoggedIn()
{
try
{
return _principal.Identity.IsAuthenticated;
}
catch
{
return false;
}
}
}


Castle Wiring of AuthServices DLL in the Component Register

Finally, in an MVC, project you have to register AuthServices library in an MVC project.

container.Register(
AllTypes.Pick()
.FromAssemblyNamed("Humanities.Reusable.Components.AuthUserServices")
.WithService.FirstInterface());


Automating JavaScript Testing with QUnit

I haven’t had a time to look at this yet, but keen to get some more testing goodness and order into my JS/JQuery.

Automating JavaScript Testing with QUnit

Generating PDFs and returning them from a controller method in MVC

The PDFSharp library for .Net is an excellent and relatively easy way of generating PDFs, but how to return them from a controller method in MVC?

The answer is pretty straightforward – use the File() helper, code example returning after a post action shown below.

Generate the PDF in a service class that implements the below interface, using PDFSharp, and return the PDF as a MemoryStream (it’s easy to create the stream using PDFSharp). Note that the stream has to be closed before returning.


public interface IMyPDFService
{
MemoryStream GeneratePDF();
}


Within controller:

private IMyPDFService _pdfService;

[HttpPost]
public ActionResult ReturnPDF(MyViewModel thisView)
{
MemoryStream reportStream = _pdfService.GeneratePDF();
return File( reportStream, "application/pdf") ;

}


This was just a brief post, as I didn’t have much time. Instructions on how to create PDFs using PDFSharp are available at their website, but I will add in a brief example function if anyone is interested.