https://www.youtube.com/watch?v=4f9XC8HHluE&t=24s ================================================================================ n-gram - "contiguous sequence" of N number of tokens - tokens: word, characters, ... ================================================================================ 1-gram (unigram) Given sentence: fine thank you 1-gram in word level: [fine,thank,you] 1-gram in character level: [f,i,n,e, ,t,h,a,n,k, ,y,o,u] ================================================================================ 2-gram (bigram) Given sentence: fine thank you 2-gram in word level: [fine thank,thank you] 2-gram in character level: [fi,in,ne,e , t,th,ha,an,nk,k , y,yo,ou] ================================================================================ 3-gram (trigram) Given sentence: fine thank you 3-gram in word level: [fine thank you] 3-gram in character level: [fin,ine,ne ,e t, th,tha,han,ank,nk ,k y, yo,you] ================================================================================ Why you need to use "n-gram"? - Overcome drawbacks of "Bag Of Words" "Bag Of Words" ignores "sequence characteristic of words" - n-gram can be used for "next word prediction" - n-gram can find "mis-spelling" - etc ================================================================================ Sentence: "machine learning is fun and is not boring" Bag Of Words: words: machine fun is learning and not boring frequency: 1 1 2 1 1 1 1 Drawback1: BOW tells "machine" and "learning" than "machine learning" Drawback2: ambiguous location of "not" machine learning is not fun and is boring = machine learning is fun and is not boring = not machine learning is fun and is boring ... ================================================================================ 2-gram words: machine learning learning is is fun fun and and is is not not boring freq: 1 1 1 1 1 1 1 Good catch: "machine learning" "not boring" - Contextual represention ================================================================================ "Naive" next word prediction example Sentence: how are you doing 3-gram: [how are you, are you doing] Sentence: how are you 3-gram: [how are you] Sentence: how are they 3-gram: [how are they] Above 3-gram sequential cases are trained in NLP model - Situation: user inputs "how are" - Then, what is the next word? "how are you" happens 2 times in training memory "how are they" happens 1 times in training memory - So, next prediction word should be "you" ================================================================================ naive spell checker by using 2-gram vocab in vocabulary file: quality quater quit bigram-frequency table bigram count qu 3 ua 2 al 1 li 1 it 2 ty 1 at 1 te 1 er 1 ui 1 - Suppose user type "qwal" There is no "qw" in bigram-frequency table There is no "wa" in bigram-frequency table "qu" happens most often "ua" happens second-most often