[Statistical NLP] Bag of Word Model

Recent Posts

Recent Comments

Tags more

Archives

Today

Total

Code&Data Insights

[Statistical NLP] Bag of Word Model 본문

Artificial Intelligence/Natural Language Processing

[Statistical NLP] Bag of Word Model

paka_corn 2023. 12. 26. 04:03

< NLP- Statistical NLP >

Bag of Word Model

- The order is ignored (in the sentence)

- Fast/simple

(ex) Multinomial Naïve Bayes text classification(spam filtering)

Information Retrieval (google search)

- Representation of a documents => Vectors of pairs <word, value>

=> Word : all words in the vocabulary (aka term)

=> Value: a number associated with the word in the document

- Different possible schemes

(1) Boolean (0 :if term is absent, 1: if term is present)

(2) Term frequency(tf)

(3) Term frequency inverse document frequency (tf, idf)

-> high TF-IDF, if frequently presented in a document, but not in the entire dataset

- Bag of Words Applications

- Term-Document Matrix

Application 1) Text categorization / classification

=> Use supervised ml model

=> Spam filtering, news routing, sentiment analysis

Application 2) Text Clustering

=> Use unsupervised ML model to compute the similarity between documents

=> [Distance Measure] : similarity between two documents (doc & query)

=> Similar technique as used in information retrieval

=> ‘cosine measure’ is the simplest & most popular!

Returns 0 to 1.

cos(d,q) = 1=> same term distribution

cos(d,q) = 0 => NO COMMON term

- Drawback of the Bag-of-word model

=> Pros

(1) Simple

(2) efficient for large collections of documents

(3) basis of many information retrieval, text categorization

=> Cons

(1) Word order is ignored, so meaning of text is lost

(2) Need a stronger model where order is considered

‘n-gram model!’

'Artificial Intelligence > Natural Language Processing' 카테고리의 다른 글

[Generative AI] Generative AI \| Capabilities of Generative AI (0)	2024.03.23
[NLP] Large Language Model (LLM) (0)	2024.03.23
[Deep NLP] Attention \| Transformer (0)	2023.12.26
[Deep NLP] Word Embeddings \| Word2Vec (1)	2023.12.26
[Statistical NLP] N-gram models (1)	2023.12.26