Bag of word is a text to vector conversion technique. Let us assume we have four statements (they are called documents in NLP):
Document 1: This pasta is very tasty and affordable
Document 2: This pasta is not tasty and is affordable
Document 3: This pasta is delicious and cheap
Document 4: Pasta is tasty and pasta tastes good.
The following steps takes place in Bag of Words:
Step 1: A dictionary is created with all unique words from documents (statements). Let us assume that we have
n such documents.
Step 2: From this
n documents, we assume that we have
d unique words. A vector is created of size
d. (with index
d-1). Each word is a different dimension in the d-dimensional vector.
Each document will have this d-dimensional vector. For all the word that appears in a document, the corresponding value in the d-dimensional vector will be non-zero, other values in the vector will be 0. If a word appears 2 times in the document, the correspond value in the vector will be 2 and so on.
This vector, created for each document will be very sparse. (Far more zeros then non-zeros). This vectors are know as Bag of Words Collection of documents is known as document corpus
The purpose of bag of words is that, similar documents should result into closer vectors. (From Linear Algebra).
Limitation of Bag of Words: Completely opposite statements appear as closer vector. For example, the document “Pasta is tasty” and the document, “Pasta is not tasty” will appear as similar vector hence closer but they are completely different in meaning. Also, semantic meaning will not be taken into consideration, i.e. Tasty and Delicious will be treated completely differently, so is words like Male and Female
Binary Bag of Words is a variation of Bag of Words. Where instead of putting number of occurrences in the vector, we put either 0 or 1, indicating weather the word exits in the document or not.
Python Code for Bag of Words implementation
Bag of word implementation is very easy and straight forward in python with the scikit-learn package. Suppose you have a input text as
input['text'], which is a rows of different text, than, BOW can be implemented as:
count_vect = CountVectorizer() #in scikit-learn final_counts = count_vect.fit_transform(input['text'].values)
final_counts will be a sparse matrix.