https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/20/docsim/ /mnt/1T-5e7/mycodehtml/NLP/Document_similarity/Ratsgo/main.html ================================================================================ Similarity: subjective - Similarity between 2 entities increase when 2 entities share common attributes - Each attribute is independent and addable - Suppose "one document has one entity" "One document" is composed of 5 variables 5 variables are independent (or uncorrelated) to each other If you plot this document onto "vector space" "basis" of each variable is perpendual - You can add the variable up to 6, 7, ... - Abstract level which each attribute has should be identical Variable1: car Variable2: BMW this is not correct Correct case should be Variable1: car Variable2: airplane - Similarity should be enough to explain the "conceptual structure" ================================================================================ - If 2 document has many common words, 2 documents have high similarity ================================================================================ Notation explanation $$$x_{ik}$$$: ith document, kth word, how many times word shows? $$$t_{ik}$$$: 1 if $$$x_{ik}>0$$$, 0 otherwise ================================================================================ ================================================================================ ================================================================================ - Meaning - $$$b_{12}=1$$$, find the word which shows in Doc1, which not shows in Doc2, it's Term1, so $$$b_{12}=1$$$ - $$$c_{12}=2$$$, find the word which not shows in Doc1, which shows in Doc2, they're Term4, Term5, so $$$c_{12}=2$$$ - $$$a_{12}=2$$$, find the word which shows in Doc1, which shows in Doc2, they're Term3, so $$$a_{12}=2$$$ - $$$d_{22}=2$$$, find the word which not shows in Doc1, which not shows in Doc2, they're Term2, so $$$d_{22}=2$$$ ================================================================================ Common features model $$$\dfrac{\text{common_words_in_ith_doc_and_jth_doc}}{\text{number_of_entire_words}}$$$ $$$d_{ij}$$$: word which not shows in Doc_i, which not shows in Doc_j Very large value because 2 documents have only small number of words Similarity between Doc3 and Doc2 is high ================================================================================ ratio model Without $$$d_{ij}$$$ from common features model ================================================================================ simple matching coefficient Use $$$d_{ij}$$$ for upper and lower parts from "common features model" ================================================================================ jaccard similarity This is basically similar metric to "ratio model" ================================================================================ overlap similarity ================================================================================ cosine similarity Most used way to calculate similarity ================================================================================ Conversation Question: - Word2Vec can be used to compare "Doc1" and "Doc2"? - Planed model: - Get new URL from the client - Crawl that new URL and extract keywords from the news text - Search "all news companies" by using extracted keywords - Get similar "news text" - Give similar "news text" to the client Answer - To calculate similarity between "Doc1" and "Doc2", Word2Vec is not much good - W2V converts "word" into "vector" - But "vector" doesn't directly become "document" - So, Doc2Vec is suggested - Doc2Vec: - Perform Word2Vec - Each document has ID, ID of each document is converted itno vector - Vector version documents can be calculated by "cosine similarity" - Doc2Vec is included in gensim https://radimrehurek.com/gensim/models/doc2vec.html - But D2V abstract "document ID" too much, so actual embedding performance is not that good - So, using LSA can be another good choice - Step - Create Term-Document-Matrix from all corpus and all documents - Perform SVD to convert "document" to "vector" - "Document vectors" can be calculated by "cosine similarity" ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/ Additionally, summary of news text is mainly inclued in the title text So, graph based analysis can be also good ratsgo.github.io/natural%20language%20processing/2017/03/13/graphnews/