https://khanrc.tistory.com/entry/Latent-Dirichlet-Allocation-LDA?category=561977
================================================================================
LDA
* Generative model which finds the topic of the document
* Assumption
* If you know the probability distribution of topics which the document can have
* If you know the probability distribution of words which can show in each topic
* you can generate "the document"
* LDA is reverse process from above assumption
* When there is "documents",
* You can inference "probability distribution"
* And finally, you can inference the topic of the documents
* Probability of each topic occurring in the document follows "Dirichlet distribution"
* That's why you call this as Latent Dirichlet Allocation
================================================================================
Example documents (each sentence is one document)
- I like to eat broccoli and bananas.
- I ate a banana and spinach smoothie for breakfast.
- Chinchillas and kittens are cute.
- My sister adopted a kitten yesterday.
- Look at this cute hamster munching on a piece of broccoli.
================================================================================
Process of finding 2 topics from above documents
- Sentences 1 and 2: 100% Topic A
- Sentences 3 and 4: 100% Topic B
- Sentence 5: 60% Topic A, 40% Topic B
- Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
- Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
================================================================================
How LDA finds above topics?
- LDA expresses "documents" by "mixture of topics" based on probability
- LDA preassumes several things
- You set N number of words which will be inserted into documents, based on Poisson distribution
- From K number of topic set, you determine the topic for the document,
based on Dirichlet distribution
For example, suppose 2 topics "food" and "cute animal"
A document will choose "food" (1/3) and "cute animal" (2/3)
- Select words for each document
- First, select topic based on multinomial distribution
- For example, "food" (1/3) and "cute animal" (2/3)
- By using above topics (food, cute animal),
generate words like "broccoli" (30%), "bananas" (15%), ...
- Summary:
- You suppose "generative model" on the documents
- LDA, reversely, finds the best topic which has highest probability, from the documents
================================================================================
How to learn
- You have document set
- You configure K (number of topics you want to find)
- You want to find "topic" on each document
- You want to find words in each "topic"
How to do it using LDA?
- You use "collapsed Gibbs sampling"
- You assign random one topic (out of K number of topics) to each document
- Now, all documents have topic, all topics have words-distribution
- You need to train above distributions
- Each document d,
- Each word w in each document d,
- Each topic t,
- You calculate following 2 things
- 1. p(topic_t|document_d)
When document d is given, probability of topic_t occurring
- 2. p(word_w|topic_t)
When topic t is given, probability of word w occurring
- Calculate p(topic_t|document_d)*p(word_w|topic_t)
- Based on above value, choose new topic t