Code&Data Insights

[Statistical NLP] N-gram models 본문

Artificial Intelligence/Natural Language Processing

[Statistical NLP] N-gram models

paka_corn 2023. 12. 26. 04:09

N-gram models  

: a probability distribution over sequence of events

 

·      Models the order of the events

·      Used when the past sequence of events is a good indicator of the next event to occur in the sequence -> To predict the next event in a sequence of event

 

·      Allows to compute the probability of the next item in a sequence

 

 

·      Or the probability of a complete sequence

 

 

·      Applications of Language Models – Sequence modeling tasks

(1)   Speech recognition

(2)   Machine translation

(3)   Language identification

(4)   Word prediction

(5)   Spelling correction

(6)   Optical character recognition

(7)   Handwriting recognition

(8)   Natural language generation

 

-       Simple approach : Uniform distribution

: each word has an equal probability to follow any other

 

 

 

 

 

 

 

 

-       Unigram model

: take into account the frequency of the words in a training corpus(= bag-of-word)

-       Improved & better approach then these two methods(uniform, bag-of-word models)

=> N-gram models

 

 

-       N-gram : the probability of a word depends on the previous word (the history)

=> Markov assumption

: we can predict a word based on a short history

 

 

·      N-gram Models by Markov assumption

(1)   Bigram models

-       History = 1

-       P(wnwn−1)

-       need to build a V2 matrix(V= size of the vocabulary)

=>  needs to store V2 = 48 = 400 million parameters

 

(2)   Trigram models

-       History = 2

-       P(wnwn−2wn−1)

-       need to build a V3 matrix(V= size of the vocabulary)

=> needs to store V3 = 412 = 8 trillion parameters

a

 

(3)   4-grams/5-grams….

-       Often impractical to have history >2

 

=> Markov approximation is costly!

 

 

·      Build an n-gram model

(1)   Prepare the corpus

-       Find a training corpus (plain text dataset)

-       Clean & tokenize

-       Decide how to deal with the out-of-vocabulary words / sentence boundaries / numbers / URLs / mark-up, etc..

 

(2)   Build the model

-       Count words and fill in the matrix

 

(3)   Smooth the model

 

(4)   Use the model

 

 

 

·      Problem & Adjustment

(1)   Product of probability may lead to numerical underflow for long sentences

 

Solution ) Add log of the probs

 

 

(2)   Containing unseen events, n-gram will be 0

-       P(X) = 0, if n-gram never appears in training corpus

 

Solution) Smoothing

-       Decrease the probability of previously seen events

 

-       Re-distribute theprobability mass to unseen events

=> Robin Hood n-gram models

 

-       Smoothing techniques

(1)   Additive smoothing

=> Add-one (laplace smoothing)

=> Add- δ (δ [0..1])

 

(1)   Witten-Bell

(2)   Good-Turing

(3)   Linear interpolation (take weighted avg)

 

 

·      Use case /Author Identification

-       Texts that resemble each other(same author, same language) share similar character/word sequences

-       Training : built a language model for each language with pre-classified documents

-       Testing : apply the language models to unknown text, chose the language/author with the argmax score

 

 

·      Problems with n-grams

-       Natural language is not linear

-       Short history cannot handle long-distance dependencies.

=> Syntactic/semantic/world knowledge

Comments