User:Therustyone
From Wikipedia, the free encyclopedia
Contents |
[edit] Data Clustering using the Information Bottleneck
This application of the bottleneck method to non-Gaussian sampled data is described in [1]. The concept, as treated there, is not without complication as there are two independent phases in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.
[edit] Density Estimation
Since the bottleneck method is framed in probabilistic rather than statistical terms, we first need to estimate the underlying probability density at the sample points X. This is a well known problem with a number of solutions [2]. In the present method, probability densities at the sample points are found by use of a Markov transition matrix method and this has some mathematical synergy with the bottleneck method itself.
Define an arbitrarily increasing distance metric
between all sample pairs and define distance matrix
. Then compute transition probabilities between sample pairs
for some
. Treating samples as states, and
as a Markov state transition probability matrix, the vector of probabilities of the ‘states’ after t steps, conditioned on the initial state
, is
. We are here interested only in the equilibrium probability vector
given, in the usual way, by the dominant left eigenvector of matrix
and is independent of the initialising vector
. This Markov transition method establishes a probability at the sample points which is claimed to be proportional to the probabilities densities here.
[edit] Clusters
In the following, the reference vector
contains sample categories and the joint probability
is assumed known. A cluster
is defined by its probability distribution over the data samples
. In [1] Tishby et al present the following iterative set of equations to determine the clusters
![\begin{cases}
p(\tilde x|x)=Kp(\tilde x) exp \Big( -\beta\,D_{KL} \Big[ p(y|x) \,|| \, p(y| \tilde x)\Big ] \Big)\\
p(y| \tilde x)=\textstyle \sum_x p(y|x)p( \tilde x | x) p(x) \big / p(\tilde x) \\
p(\tilde x) = \textstyle \sum_x p(\tilde x | x) p(x) \\
\end{cases}](../../../../math/b/0/9/b0999a9edac362836d36f411e3eb248a.png)
The function of each line of the iteration is expanded as follows.
Line 1: This is a matrix valued set of conditional probabilities
![A_{i,j} = p(\tilde x_i | x_j )=Kp(\tilde x_i) exp \Big( -\beta\,D^{KL} \Big[ p(y|x_j) \,|| \, p(y| \tilde x_i)\Big ] \Big)](../../../../math/a/d/8/ad8ceb7ebf00b5f6d59b9e1b72a7bf4e.png)
The Kullback Leibler distance
between the
vectors generated by the sample data
and those generated by its reduced information proxy
is applied to assess the fidelity of the compressed vector with respect to the categorical data Y in accordance with the fundamental bottleneck equation.
is the Kullback Leibler distance between distributions 

and
is a scalar normalization. The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback Liebler distance is large, thus successful clusters grow in probability while unsuccessful ones decay.
Line 2: This is a second matrix valued set of conditional probabilities

The steps in deriving this are as follows. We have, by definition

where the Bayes identities
are used. Finally the integral is rewritten as the summation over the sample points k as in the first equation above.
Line 3: this line finds the marginal distribution of 

This is also derived from standard results.
Further inputs to the algorithm are the marginal sample distribution
which has already been determined by the dominant eigenvector of
and the matrix valued Kullback Leibler distance function
derived from the sample spacings and transition probabilities.
The matrices
can be initialised randomly.
[edit] Defining Decision Contours
To categorize a new sample
external to the training set X, first calculate the probabilities that it belongs to each of the various clusters which is the conditional probability
. In order to find this, apply the previous distance metric to find the transition probabilities between
and all samples in
,
. Secondly apply the last two lines of the 3-line algorithm to get cluster, and conditional category probabilities.

Finally we have

Generally the algorithm converges rapidly, often in tens of iterations. However parameter
must be kept under close supervision since, as it is increased from zero, increasing numbers of features, in the category probability space, click into focus at certain critical values.
There is some analogy between this algorithm and a neural network with a single hidden layer. The nodes are represented by the clusters
. The first and second layers of network weights are the conditional probabilities
and
respectively. However, unlike a standard neural network, the present algorithm always uses probabilities of samples as inputs rather than the sample values themselves and non linear function are encapsulated in the Kullback Leibler distances and the transition probabilities rather than sigmoid functions. Compared to a neural network this algorithm seems to converge much more quickly and by varying
and
various levels of focus on features can be achieved. There are also similarities to some varieties of Fuzzy Logic algorithms.
For blind classification and clustering, the transient behaviour of
is analysed and this is discussed in more detail in [2] but this extra complication is not necessary for the supervised training described here.
[edit] An Example
In the following simple case we investigate clustering in a four quadrant multiplier with random inputs
and two categories of output,
, generated by
. This function has the property that there are two spatially separated clusters for each category and so it demonstrates that the method can handle such distributions.
20 samples are taken, uniformly distributed on the square
. The number of clusters used beyond the number of categories, two in this case, has little effect on performance and the results are shown for two clusters using parameters <m ath>\lambda = 3,\, \beta = 2.5</math> adn the distance function
where
. The figure shows the locations of the twenty samples with '0' representing Y = 1 and 'x' representing Y = -1. The contour at the unity likelihood ratio level is shown,
as a new sample
is scanned over the square. Theoretically the contour should align with the
and
coordinates but for such small sample numbers they have instead followed the spurious clusterings of the sample points.
[edit] bibliography
[1] N Tishby, N Slonim: “Data clustering by Markovian Relaxation and the Information Bottleneck Method”, Neural Information Processing Systems (NIPS) 2000, pp. 640-646
[2] B.W. Silverman: “Density Estimation for Statistical Data Analysis”, Chapman and Hall, 1986.

