https://khanrc.tistory.com/entry/Latent-Dirichlet-Allocation-LDA?category=561977 ================================================================================ LDA * Generative model which finds the topic of the document * Assumption * If you know the probability distribution of topics which the document can have * If you know the probability distribution of words which can show in each topic * you can generate "the document" * LDA is reverse process from above assumption * When there is "documents", * You can inference "probability distribution" * And finally, you can inference the topic of the documents * Probability of each topic occurring in the document follows "Dirichlet distribution" * That's why you call this as Latent Dirichlet Allocation ================================================================================ Example documents (each sentence is one document) - I like to eat broccoli and bananas. - I ate a banana and spinach smoothie for breakfast. - Chinchillas and kittens are cute. - My sister adopted a kitten yesterday. - Look at this cute hamster munching on a piece of broccoli. ================================================================================ Process of finding 2 topics from above documents - Sentences 1 and 2: 100% Topic A - Sentences 3 and 4: 100% Topic B - Sentence 5: 60% Topic A, 40% Topic B - Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) - Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals) ================================================================================ How LDA finds above topics? - LDA expresses "documents" by "mixture of topics" based on probability - LDA preassumes several things - You set N number of words which will be inserted into documents, based on Poisson distribution - From K number of topic set, you determine the topic for the document, based on Dirichlet distribution For example, suppose 2 topics "food" and "cute animal" A document will choose "food" (1/3) and "cute animal" (2/3) - Select words for each document - First, select topic based on multinomial distribution - For example, "food" (1/3) and "cute animal" (2/3) - By using above topics (food, cute animal), generate words like "broccoli" (30%), "bananas" (15%), ... - Summary: - You suppose "generative model" on the documents - LDA, reversely, finds the best topic which has highest probability, from the documents ================================================================================ How to learn - You have document set - You configure K (number of topics you want to find) - You want to find "topic" on each document - You want to find words in each "topic" How to do it using LDA? - You use "collapsed Gibbs sampling" - You assign random one topic (out of K number of topics) to each document - Now, all documents have topic, all topics have words-distribution - You need to train above distributions - Each document d, - Each word w in each document d, - Each topic t, - You calculate following 2 things - 1. p(topic_t|document_d) When document d is given, probability of topic_t occurring - 2. p(word_w|topic_t) When topic t is given, probability of word w occurring - Calculate p(topic_t|document_d)*p(word_w|topic_t) - Based on above value, choose new topic t