Test "spell check algorithm using N-gram" citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.186.3996&rep=rep1&type=pdf along with Levenshtein distance algorithm ================================================================================ Test "spell check algorithm using N-gram" by using 2 words: helro (misspelled one), hello (correct word) u: helro he el lr ro v: hello he el ll lo u_{1,1}=substring(helro,1,1)=h v_{1,1}=substring(hello,1,1)=h u_{5,5}=substring(helro,5,5)=o v_{5,5}=substring(hello,5,5)=o g(h,h)=1 g(o,o)=1 $$$\sum\limits_{i=2}^{4} \sum\limits_{j=-1}^{1} g(u_{i,i+(n-1)},v_{i+j,i+j+(n-1)})$$$ $$$\sum\limits_{i=2}^{4} g(u_{i,i+1},v_{i-1,i}) + g(u_{i,i+1},v_{i,i+1}) + g(u_{i,i+1},v_{i+1,i+2}) $$$ $$$g(u_{2,3},v_{1,2}) + g(u_{3,4},v_{2,3}) + g(u_{4,5},v_{3,4}) \\ + g(u_{2,3},v_{2,3}) + g(u_{3,4},v_{3,4}) + g(u_{4,5},v_{4,5}) \\ + g(u_{2,3},v_{3,4}) + g(u_{3,4},v_{4,5}) + g(u_{4,5},v_{5,6}) $$$ u: helro v: hello g(el,he) + g(lr,el) + g(ro,ll) + g(el,el) + g(lr,ll) + g(ro,lo) + g(el,ll) + g(lr,lo) + g(ro,o ) = 0+0+0 + 1+0+0 + 0+0+0 helro: he, el, lr, ro hello: he, el, ll, lo The number of unique union 2gram: 5 (he, el, lr, ro, ll, lo) $$$S_{2,3}("hello","helro") = \dfrac{(1+1) + (0+0+0 + 1+0+0 + 0+0+0)}{5} = 0.6$$$ ================================================================================ Test spell check algorithm + Levenshtein distance algorithm by using words of Giama (misspelled one), Giana (correct word candidate1), Giina (correct word candidate2), Giani (correct word candidate3) ================================================================================ Typo: Giama Correct candidates: Giana, Giina, Giani (which correction is the best for the misspelled Giama?) ================================================================================ Case1 Correction candidate_1: Giana u: Giama Gi ia am ma v: Giana Gi ia an na $$$u_{1,1}$$$=substring(Giama,1,1)=G $$$v_{1,1}$$$=substring(Giana,1,1)=G $$$u_{5,5}$$$=substring(Giama,5,5)=a $$$v_{5,5}$$$=substring(Giana,5,5)=a g(G,G)=1 g(a,a)=1 $$$\sum\limits_{i=2}^{4} \sum\limits_{j=-1}^{1} g(u_{i,i+(n-1)},v_{i+j,i+j+(n-1)})$$$ $$$\sum\limits_{i=2}^{4} g(u_{i,i+1},v_{i-1,i}) + g(u_{i,i+1},v_{i,i+1}) + g(u_{i,i+1},v_{i+1,i+2}) $$$ $$$g(u_{2,3},v_{1,2}) + g(u_{3,4},v_{2,3}) + g(u_{4,5},v_{3,4}) \\ + g(u_{2,3},v_{2,3}) + g(u_{3,4},v_{3,4}) + g(u_{4,5},v_{4,5}) \\ + g(u_{2,3},v_{3,4}) + g(u_{3,4},v_{4,5}) + g(u_{4,5},v_{5,6}) $$$ u: Giama v: Giana g(ia,Gi) + g(am,ia) + g(ma,an) + g(ia,ia) + g(am,an) + g(ma,na) + g(ia,an) + g(am,na) + g(ma,a ) = 0+0+0 + 1+0+0 + 0+0+0 Giama: Gi, ia, am, ma Giana: Gi, ia, an, na The number of unique union 2gram: 6 ('am', 'ma', 'ia', 'Gi', 'an', 'na') $$$S_{2,3}("Giama","Giana") = \dfrac{(1+1) + (0+0+0 + 1+0+0 + 0+0+0)}{6} = 0.5$$$ ================================================================================ Case2 Correction candidate_2: Giina u: Giama Gi ia am ma v: Giina Gi ii in na $$$u_{1,1}$$$=substring(Giama,1,1)=G $$$v_{1,1}$$$=substring(Giina,1,1)=G $$$u_{5,5}$$$=substring(Giama,5,5)=a $$$v_{5,5}$$$=substring(Giina,5,5)=a g(G,G)=1 g(a,a)=1 $$$\sum\limits_{i=2}^{4} \sum\limits_{j=-1}^{1} g(u_{i,i+(n-1)},v_{i+j,i+j+(n-1)})$$$ $$$\sum\limits_{i=2}^{4} g(u_{i,i+1},v_{i-1,i}) + g(u_{i,i+1},v_{i,i+1}) + g(u_{i,i+1},v_{i+1,i+2}) $$$ $$$g(u_{2,3},v_{1,2}) + g(u_{3,4},v_{2,3}) + g(u_{4,5},v_{3,4}) \\ + g(u_{2,3},v_{2,3}) + g(u_{3,4},v_{3,4}) + g(u_{4,5},v_{4,5}) \\ + g(u_{2,3},v_{3,4}) + g(u_{3,4},v_{4,5}) + g(u_{4,5},v_{5,6}) $$$ u: Giama v: Giina g(ia,Gi) + g(am,ii) + g(ma,in) + g(ia,ia) + g(am,in) + g(ma,na) + g(ia,in) + g(am,na) + g(ma,a ) = 0+0+0 + 1+0+0 + 0+0+0 Giama: Gi ia am ma Giina: Gi ii in na The number of unique union 2gram: 7 ('am', 'ma', 'ia', 'ii', 'Gi', 'in', 'na') $$$S_{2,3}("Giama","Giana") = \dfrac{(1+1) + (0+0+0 + 1+0+0 + 0+0+0)}{7} = 0.43$$$ ================================================================================ Case3 Correction candidate_2: Giani u: Giama Gi ia am ma v: Giani Gi ia an ni $$$u_{1,1}$$$=substring(Giama,1,1)=G $$$v_{1,1}$$$=substring(Giani,1,1)=G $$$u_{5,5}$$$=substring(Giama,5,5)=a $$$v_{5,5}$$$=substring(Giani,5,5)=i g(G,G)=1 g(a,i)=0 $$$\sum\limits_{i=2}^{4} \sum\limits_{j=-1}^{1} g(u_{i,i+(n-1)},v_{i+j,i+j+(n-1)})$$$ $$$\sum\limits_{i=2}^{4} g(u_{i,i+1},v_{i-1,i}) + g(u_{i,i+1},v_{i,i+1}) + g(u_{i,i+1},v_{i+1,i+2}) $$$ $$$g(u_{2,3},v_{1,2}) + g(u_{3,4},v_{2,3}) + g(u_{4,5},v_{3,4}) \\ + g(u_{2,3},v_{2,3}) + g(u_{3,4},v_{3,4}) + g(u_{4,5},v_{4,5}) \\ + g(u_{2,3},v_{3,4}) + g(u_{3,4},v_{4,5}) + g(u_{4,5},v_{5,6}) $$$ u: Giama v: Giani g(ia,Gi) + g(am,ia) + g(ma,an) + g(ia,ia) + g(am,an) + g(ma,ni) + g(ia,an) + g(am,ni) + g(ma,i ) = 0+0+0 + 1+0+0 + 0+0+0 Giama: Gi ia am ma Giani: Gi ia an ni The number of unique union 2gram: 6 ('am', 'ma', 'ia', 'Gi', 'an', 'ni') $$$S_{2,3}("Giama","Giana") = \dfrac{(1+1) + (0+0+0 + 1+0+0 + 0+0+0)}{6} = 0.5$$$ ================================================================================ Use Levenshtein distance (or edit distance as synonym) # u: Giama Gi ia am ma # v: Giana Gi ia an na # u: Giama Gi ia am ma # v: Giani Gi ia an ni one_misspelled_one="Giama" one_word="Giana" score_val=nltk.edit_distance(one_misspelled_one,one_word) # 1 one_misspelled_one="Giama" one_word="Giani" score_val=nltk.edit_distance(one_misspelled_one,one_word) # 2 ================================================================================ Distance: 1 < Distance: 2 Therefore, "Giana" is the best correction word for the misspelled "Giama"