Code&Data Insights

[Statistical NLP] Bag of Word Model 본문

Artificial Intelligence/Natural Language Processing

[Statistical NLP] Bag of Word Model

paka_corn 2023. 12. 26. 04:03

< NLP- Statistical NLP >

Bag of Word Model

- The order is ignored (in the sentence)

-  Fast/simple

 

(ex) Multinomial Naïve Bayes text classification(spam filtering)

       Information Retrieval (google search)

 

-   Representation of a documents => Vectors of pairs <word, value>

=> Word : all words in the vocabulary (aka term)

=> Value: a number associated with the word in the document

 

 

- Different possible schemes

(1) Boolean (0 :if term is absent, 1: if term is present)

 

(2) Term frequency(tf)

 

(3) Term frequency inverse document frequency (tf, idf)

-> high TF-IDF, if frequently presented in a document, but not in the entire dataset

 

- Bag of Words Applications

-       Term-Document Matrix

Application 1) Text categorization / classification

=> Use supervised ml model

=> Spam filtering, news routing, sentiment analysis

 

Application 2) Text Clustering

=> Use unsupervised ML model to compute the similarity between documents

 

=> [Distance Measure] : similarity between two documents (doc & query)

=> Similar technique as used in information retrieval

 

=> cosine measure’ is the simplest & most popular!

 

Returns 0 to 1.

cos(d,q) = 1=> same term distribution

cos(d,q) = 0 => NO COMMON term

 

 

- Drawback of the Bag-of-word model

=> Pros

(1)    Simple

(2)   efficient for large collections of documents

(3)   basis of many information retrieval, text categorization

 

=> Cons

(1)   Word order is ignored, so meaning of text is lost

(2)   Need a stronger model where order is considered

‘n-gram model!’

 

 

Comments