Linguistics = Phonology + Morphology + Syntax + Semantics
- researches voice
- researches "word" and "morphim"
- researches grammar
- researches contextual information
NLP in computer
Lexical Analysis step
POS(Part Of Speech) tagging
"Word": "POS"
NER(Named Entity Recognition)
Classify "unique noun"
Compares "previous word and phrase" to "current word and phrasee"
Are they same entity?
Basic dependencies
Unlike the previous method where it analyzes sentence by using element of each word (gramatically),
basic dependencies uses dependent relationship between words,
to analyze sentence
Lexical analysis
- Sentence splitting
- Tokenize
- morphological analysis
- POS tagging
Case1: sentence splitting is required
Case2: sentence splitting is optional (topic modeling case)
Before sentence spliting: sentence_a. sentence_b. sentence_c?
After sentence splitting: [sentence_a.], [sentence_b.], [sentence_c?]
Morphim $$$\subset$$$ Word $$$\subset$$$ Token
Tokenize method:
- English: generally simply use "white space"
Morphological analysis (Text normalization)
Tokens --are converted into--> more general form
And you get "reduced number of tokens" to get more analysis efficiency
- Example
cars ---> car
car ---> car
stopped ---> stop
stop ---> stop
- Do "folding" (lower case)
Hello ---> hello
hello ---> hello
- Do "stemming"
Convert "word" into "short format"
- Do "lemmatization"
Convert "word" into the basic form which has POS information
POS tagging
Assing "POS" into each "token"
- Techniques
- Decision Trees
- Hidden Markov Model
NLTK, spaCY perform the step of "sentence spliting, tokenization, lemmatization, POS tagging"