Artificial Intelligence/Natural Language Processing

[Statistical NLP] Bag of Word Model

paka_corn 2023. 12. 26. 04:03

< NLP- Statistical NLP >

Bag of Word Model

- The order is ignored (in the sentence)

-  Fast/simple


(ex) Multinomial Naïve Bayes text classification(spam filtering)

       Information Retrieval (google search)


-   Representation of a documents => Vectors of pairs <word, value>

=> Word : all words in the vocabulary (aka term)

=> Value: a number associated with the word in the document



- Different possible schemes

(1) Boolean (0 :if term is absent, 1: if term is present)


(2) Term frequency(tf)


(3) Term frequency inverse document frequency (tf, idf)

-> high TF-IDF, if frequently presented in a document, but not in the entire dataset


- Bag of Words Applications

-       Term-Document Matrix

Application 1) Text categorization / classification

=> Use supervised ml model

=> Spam filtering, news routing, sentiment analysis


Application 2) Text Clustering

=> Use unsupervised ML model to compute the similarity between documents


=> [Distance Measure] : similarity between two documents (doc & query)

=> Similar technique as used in information retrieval


=> cosine measure’ is the simplest & most popular!


Returns 0 to 1.

cos(d,q) = 1=> same term distribution

cos(d,q) = 0 => NO COMMON term



- Drawback of the Bag-of-word model

=> Pros

(1)    Simple

(2)   efficient for large collections of documents

(3)   basis of many information retrieval, text categorization


=> Cons

(1)   Word order is ignored, so meaning of text is lost

(2)   Need a stronger model where order is considered

‘n-gram model!’