Code&Data Insights
[Statistical NLP] Bag of Word Model 본문
[Statistical NLP] Bag of Word Model
paka_corn 2023. 12. 26. 04:03< NLP- Statistical NLP >
Bag of Word Model
- The order is ignored (in the sentence)
- Fast/simple
(ex) Multinomial Naïve Bayes text classification(spam filtering)
Information Retrieval (google search)
- Representation of a documents => Vectors of pairs <word, value>
=> Word : all words in the vocabulary (aka term)
=> Value: a number associated with the word in the document
- Different possible schemes
(1) Boolean (0 :if term is absent, 1: if term is present)
(2) Term frequency(tf)
(3) Term frequency inverse document frequency (tf, idf)
-> high TF-IDF, if frequently presented in a document, but not in the entire dataset
- Bag of Words Applications
- Term-Document Matrix
Application 1) Text categorization / classification
=> Use supervised ml model
=> Spam filtering, news routing, sentiment analysis
Application 2) Text Clustering
=> Use unsupervised ML model to compute the similarity between documents
=> [Distance Measure] : similarity between two documents (doc & query)
=> Similar technique as used in information retrieval
=> ‘cosine measure’ is the simplest & most popular!
Returns 0 to 1.
cos(d,q) = 1=> same term distribution
cos(d,q) = 0 => NO COMMON term
- Drawback of the Bag-of-word model
=> Pros
(1) Simple
(2) efficient for large collections of documents
(3) basis of many information retrieval, text categorization
=> Cons
(1) Word order is ignored, so meaning of text is lost
(2) Need a stronger model where order is considered
‘n-gram model!’
'Artificial Intelligence > Natural Language Processing' 카테고리의 다른 글
[Generative AI] Generative AI | Capabilities of Generative AI (0) | 2024.03.23 |
---|---|
[NLP] Large Language Model (LLM) (0) | 2024.03.23 |
[Deep NLP] Attention | Transformer (0) | 2023.12.26 |
[Deep NLP] Word Embeddings | Word2Vec (1) | 2023.12.26 |
[Statistical NLP] N-gram models (1) | 2023.12.26 |