Code&Data Insights
[Statistical NLP] N-gram models 본문
[Statistical NLP] N-gram models
paka_corn 2023. 12. 26. 04:09N-gram models
: a probability distribution over sequence of events
· Models the order of the events
· Used when the past sequence of events is a good indicator of the next event to occur in the sequence -> To predict the next event in a sequence of event
· Allows to compute the probability of the next item in a sequence
· Or the probability of a complete sequence
· Applications of Language Models – Sequence modeling tasks
(1) Speech recognition
(2) Machine translation
(3) Language identification
(4) Word prediction
(5) Spelling correction
(6) Optical character recognition
(7) Handwriting recognition
(8) Natural language generation
- Simple approach : Uniform distribution
: each word has an equal probability to follow any other
- Unigram model
: take into account the frequency of the words in a training corpus(= bag-of-word)
- Improved & better approach then these two methods(uniform, bag-of-word models)
=> N-gram models
- N-gram : the probability of a word depends on the previous word (the history)
=> Markov assumption
: we can predict a word based on a short history
· N-gram Models by Markov assumption
(1) Bigram models
- History = 1
- P(wn∣wn−1)
- need to build a V2 matrix(V= size of the vocabulary)
=> needs to store V2 = 48 = 400 million parameters
(2) Trigram models
- History = 2
- P(wn∣wn−2wn−1)
- need to build a V3 matrix(V= size of the vocabulary)
=> needs to store V3 = 412 = 8 trillion parameters
a
(3) 4-grams/5-grams….
- Often impractical to have history >2
=> Markov approximation is costly!
· Build an n-gram model
(1) Prepare the corpus
- Find a training corpus (plain text dataset)
- Clean & tokenize
- Decide how to deal with the out-of-vocabulary words / sentence boundaries / numbers / URLs / mark-up, etc..
(2) Build the model
- Count words and fill in the matrix
(3) Smooth the model
(4) Use the model
· Problem & Adjustment
(1) Product of probability may lead to numerical underflow for long sentences
Solution ) Add log of the probs
(2) Containing unseen events, n-gram will be 0
- P(X) = 0, if n-gram never appears in training corpus
Solution) Smoothing
- Decrease the probability of previously seen events
- Re-distribute theprobability mass to unseen events
=> Robin Hood n-gram models
- Smoothing techniques
(1) Additive smoothing
=> Add-one (laplace smoothing)
=> Add- δ (δ ∈ [0..1])
(1) Witten-Bell
(2) Good-Turing
(3) Linear interpolation (take weighted avg)
· Use case /Author Identification
- Texts that resemble each other(same author, same language) share similar character/word sequences
- Training : built a language model for each language with pre-classified documents
- Testing : apply the language models to unknown text, chose the language/author with the argmax score
· Problems with n-grams
- Natural language is not linear
- Short history cannot handle long-distance dependencies.
=> Syntactic/semantic/world knowledge
'Artificial Intelligence > Natural Language Processing' 카테고리의 다른 글
[Generative AI] Generative AI | Capabilities of Generative AI (0) | 2024.03.23 |
---|---|
[NLP] Large Language Model (LLM) (0) | 2024.03.23 |
[Deep NLP] Attention | Transformer (0) | 2023.12.26 |
[Deep NLP] Word Embeddings | Word2Vec (1) | 2023.12.26 |
[Statistical NLP] Bag of Word Model (1) | 2023.12.26 |