https://ratsgo.github.io/natural%20language%20processing/2017/03/22/lexicon/
/mnt/1T-5e7/mycodehtml/NLP/Lexical_analysis/Ratsgo/main.html
================================================================================
NLP
================================================================================
Linguistics = Phonology + Morphology + Syntax + Semantics
================================================================================
Phonology
- researches voice
================================================================================
Morphology
- researches "word" and "morphim"
================================================================================
Syntax
- researches grammar
================================================================================
Semantics
- researches contextual information
================================================================================
NLP in computer
================================================================================
Lexical Analysis step
================================================================================
POS(Part Of Speech) tagging
"Word": "POS"
================================================================================
NER(Named Entity Recognition)
Classify "unique noun"
================================================================================
Co-reference
Compares "previous word and phrase" to "current word and phrasee"
Are they same entity?
================================================================================
Basic dependencies
Unlike the previous method where it analyzes sentence by using element of each word (gramatically),
basic dependencies uses dependent relationship between words,
to analyze sentence
================================================================================
================================================================================
Lexical analysis
- Sentence splitting
- Tokenize
- morphological analysis
- POS tagging
================================================================================
Case1: sentence splitting is required
Case2: sentence splitting is optional (topic modeling case)
Before sentence spliting: sentence_a. sentence_b. sentence_c?
After sentence splitting: [sentence_a.], [sentence_b.], [sentence_c?]
================================================================================
Tokenize
Morphim $$$\subset$$$ Word $$$\subset$$$ Token
Tokenize method:
- English: generally simply use "white space"
================================================================================
Morphological analysis (Text normalization)
Tokens --are converted into--> more general form
And you get "reduced number of tokens" to get more analysis efficiency
- Example
cars ---> car
car ---> car
stopped ---> stop
stop ---> stop
- Do "folding" (lower case)
Hello ---> hello
hello ---> hello
- Do "stemming"
Convert "word" into "short format"
- Do "lemmatization"
Convert "word" into the basic form which has POS information
================================================================================
================================================================================
POS tagging
Assing "POS" into each "token"
- Techniques
- Decision Trees
- Hidden Markov Model
- SVM
================================================================================
NLTK, spaCY perform the step of "sentence spliting, tokenization, lemmatization, POS tagging"
================================================================================