Multivariate Polya distribution

From Wikipedia, the free encyclopedia

The multivariate Pólya distribution, also called the Dirichlet compound multinomial distribution, is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector α, and a set of discrete samples x is drawn from the multinomial distribution with probability vector p. The compounding corresponds to a Polya urn scheme. In document classification, for example, the distribution is used to represent probabilities over word counts for different document types.

The probability of a vector of counts x given the parameter vector α is obtained by integrating out the parameters p of the multinomial distribution:

\textrm{P}(\mathbf{x}\mid\mathbf{\alpha})=\int_{\mathbf{p}}\textrm{P}(\mathbf{x}\mid \mathbf{p})\textrm{P}(\mathbf{p}\mid\mathbf{\alpha})\textrm{d}\mathbf{p}

which results in the following explicit formula:

\textrm{P}(\mathbf{x}\mid\mathbf{\alpha})=\frac{\left(\sum_{k}n_{k}\right)!}
{\prod_{k}\left(n_{k}!\right)}\frac{\Gamma\left(\sum_{k}\alpha_{k}\right)}
{\Gamma\left(\sum_{k}n_{k}+\alpha_{k}\right)}\prod_{k}\frac{\Gamma(n_{k}+\alpha_{k})}{\Gamma(\alpha_{k})}

where Γ is the gamma function, and nk is the number of times the outcome in x was k.

The multivariate Pólya distribution is used in automated document classification and clustering, genetics, economy, combat modeling, and quantitative marketing.

[edit] See also

[edit] References