Overview:
Feature Clustering with NPclassify
|
NPclassify is feature classification scheme which attempts to have fewer assumptions about such things as the the size and number of classes. So for instance, as in figure 1, we can see that there seem to be three classes. However, a priori, we do not know this. Thus, NPclassify must be able to group these points in a meaningful manner that allows us to skip the need to know how many classes we have or how big they are. |
|
|
|
Figure 1: Given some input of unknown feature vector clusters, we want to group them together so that we can say we have n number of classes and we know which class each feature vector belongs to. In this example, it looks like feature vectors clump into three classes (groups). What we want to find is the image on the right which tells us that we indeed have three clumps that form their own classes. |
|
To cluster features we developed our own clustering algorithm. We did this because most clustering algorithms make strong assumptions about hard parameters that we believe makes them less flexible. For instance the popular EM (Expectation Maximization) (Dempster, 1977) algorithm requires an assumption about the number of classes which are to be clustered. The method we use is a hybrid of minimum spanning tree clustering with a density function and global statistics which help us cluster based upon soft parameters (figure 2). The clustering method works in several steps. The first step is to find the density and distance between each point in the feature space. For each point with feature vector x density d is defined as:
This is the inverse sum of all distance between this point and each other point. Points which lie in very crowded regions or at the center of clusters will have a very high value for d while points in sparse regions or on the edge of clusters will have a very low value for d. Additionally, since this value is taken over all values, it is an absolute measure and not subject quite so much to local effects. The distance itself is calculated at the same time using Euclidean distance defined as: |
|
|
|
Figure 2: (1) An input set of features is plotted in feature space. (2-3) a density function is plotted based upon the sum inverse Euclidean distance between points. A point which is more dense is shown as darker in this figure. (4) Points are connected to points which are the closest point with a higher density. (5-6) Soft parameters for the relative size of edges determine which edges should separate classes. Thus, a very long edge that is several standard deviations longer than an average edge is cut in this example. |
|
The next step is to connect up points (nodes) into a hierarchy using the following rule: Connect each point to another point as a parent iff the parent will have a higher density d and there is no other point closer which has a higher density than this point. What this means is that you connect a point to another point which is the closest point with a higher density than this point. The density metric is not used as a distance substitution but is intended to ensure that points are linked in the direction of higher density. This creates a kind of a gradient descent situation whereby links in clusters move towards higher density, but are constrained by distance into local clusters.
The next step is to decide how to break the tree into smaller but meaningful clusters. We achieve this with soft as well as some hard parameters. The parameters generally fall into three groups. The first group is the standard deviation of edge lengths. The second parameter is the size of a cluster that will be formed in the number of members by standard deviation of descendant number. This means you cannot be a cluster unless you have so many members compared with the normal number of descendants from any given node. The third set of parameters are hard parameters for length size and cluster members. |
|
|
|
Figure 3: When we plot our points for the feature vectors for distance and density we can see that points in local clusters are connected in a hill climbing like manner. When the top most node in a local cluster is reached, the distance to the next node tends to also be a peak in another cluster. As such, the connecting edge is very long. In this example, the middle edge has been removed to form two new classes since it is very long. |
|
For the standard deviation
metric, we sum the distances l from all children nodes to each parent
node. We then compute the mean Euclidean distance E(l) and the
standard deviation E[(l-E(l))2].
The length to cut is defined as:
That is, if an edge length between a child and a parent
is longer than constant C times the standard deviation, cut that edge and
create two new separate classes. The third constraint is that edge lengths longer than a certain size must always be cut, or clusters must always have at least a minimum number of members. These numbers tend to be set somewhat high, but are used as a heuristic of sorts in cases where it seems obvious that the rule should always be applied. Additionally, we can iterate through and create locally weighted effects by doing a second pruning, but only considering statistics within the new classes from the first pass. We found that a second pass seemed very helpful but that a third pass was extraneous. |
|
|
|
Figure 4: From an actual run of NPclassify on 2d input data we can see how it links together the points in the image which are simulated feature vectors. The bounding boxes are to illustrate the ground truth for where classes have been created. The green lines are edges that link points in the same class while the red line is a long link that has been cut. |
|
|