Latent Dirichlet Allocation (LDA) is a language topic model.
In LDA, each document has a topic distribution and each topic has a word distribution.
Words are generated from topic-word distribution with respect to the drawn topics in the document.
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of
discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each
item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in
turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of
text modeling, the topic probabilities provide an explicit representation of a document. We present
efficient approximate inference techniques based on variational methods and an EM algorithm for
empirical Bayes parameter estimation. We report results in document modeling, text classification,
and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI
- Text corpus is a large and structured set of texts They are used to do statistical analysis and hypothesis testing.
- Discrete data is information that can be categorized into a classification. Discrete data is based on counts. Only a finite number of values is possible, and the values cannot be subdivided meaningfully.
- In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. (cs229.stanford.edu/notes/cs229-notes8.pdf)
- Probabilistic Latent Semantic Indexing (PLSI) (http://www.ece.umd.edu/~smiran/PLSI.pdf)
- Bayes parameter estimation is a very useful technique to estimate the probability density of random variables or vectors, which in turn is used for decision making or future inference. We can summarize BPE as. Treat the unknown parameters as random variables. Assume a prior distribution for the unknown parameters.
- Implicit and Explicit Representations: Ver https://hal.archives-ouvertes.fr/inria- ... 7/document
- Mixture model: Ver https://en.wikipedia.org/wiki/Mixture_model
- Bayesian hierarchical modelling: Ver https://en.wikipedia.org/wiki/Bayesian_ ... l_modeling
- Finetti’s representation theorem
However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.