And it seems to me, that intraclass and interclass cases works with multinomial distribution within each feature too, because entropy changes faster when variance drops. Amazing.

The only restriction of FCBF is that it do calculation of distribution structure per feature, but not per any set of them (there is cases when in multidimensional space the distributions comes to groups and have lower variance around arisen medians, the 2D “xor” case for example, or 2D spiral).

So for complex multidimensional distributions i would advised

1. use FCBF only to throw out redundant features (use low threshold)

2. now we get smaller feature set without redundancy

3. process another heavier search filter based on multidimensional distances

I hope I do not bother.

]]>So cool =) Thanx again for your code implementation it’s very helfull! ]]>

I’ll try to simulate in Excel the case with 1 feature and 1 class: simulate intraclass distribution of that feature drops from max variance to low – and analize SU of that feature. If it raising – then ok, its mean that the feature rise in statistical correlation meaning to the class. But if SU drops – it’s mean that FCBF measure uncorrelation of feature to the class and interpret this as rise of information of that feature with respect to this class, and it’s statistically not good.

]]>I needed to implement it in C# as I needed to use it for a project, but I think I just accepted it as a standard algorithm, I didn’t have time to find out why it did what it did, just how it did it. I think I tested it against the output of the Weka module that the authors of the paper had written, to make sure I got everything correct.

I remember that it was still a fairly simple feature selection algorithm – there were other similar ones that used mutual information rather than symmetrical uncertainty. I think it would only work if the data had one classification variable.

]]>– i wonder, what combination of features (sensors) FCBF will choose in case like this:

Suppose we have 2 classes and 6 features, and suppose two 3D combinations of that features: 1st combination will make a two 3D spheres (no overlaps, statistically looks like good distributions), 2nd 3D combination will give universaly distributed noise in 3D space

In second case each p(Fij)->avg (coz of universally distributed)

In first case each p(Fij)->k*avg, where k>1. (frequencies more closely distributed, suppose binomially in case of two 3D spheres)

So, in 1st case Entropy Information Gain of each Fij will be smaller as probability by freq of each Fij is greater than in 2nd case (H(p(x))->0, if p(x)->1) and FCBF will give 2nd combination? Or there is some “devil in details” of formulations?

]]>PS, but anyway, it’s possible to use cond entrop too: just aggregate freq for x_i where no occurrence of y_j. But again, it’s not worth it. ]]>