Converting Text to Features in Natural Language Processing

Converting Text to Features in Natural Language Processing

In this blog, we will discuss text to features methods. It will include:

  1. One Hot Encoding

  2. Count Vectorizer

  3. N-grams

  4. Hash vectorizer

  5. Term Frequency-Inverse Document Frequency(TF-IDF)

Converting text to feature using One Hot Encoding

In [1]:
import pandas as pd
In [2]:
text = "I love my family"
In [3]:
tokenize_words = text.split()
In [4]:
pd.get_dummies(tokenize_words)
Out[4]:
I family love my
0 1 0 0 0
1 0 0 1 0
2 0 0 0 1
3 0 1 0 0

Output has 4 features as the number of unique words present in the input was 4.

Converting text to feature using Count Vectorizer

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>):

Convert a collection of text documents to a matrix of token counts.

CountVectorizer implements both tokenization and occurrence counting in a single class. The vectorization the general process of turning a collection of text documents into numerical feature vectors.

Example 1:

In [6]:
text = ["I love my family and in my family there are ten memebers."]

Create CountVectorizer() object

In [7]:
vectorizer = CountVectorizer()

tokenize the text

In [8]:
vector = vectorizer.fit_transform(text)
vector
Out[8]:
<1x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

View features name

In [9]:
vectorizer.get_feature_names()
Out[9]:
['and', 'are', 'family', 'in', 'love', 'memebers', 'my', 'ten', 'there']

features with index

In [10]:
vectorizer.vocabulary_
Out[10]:
{'love': 4,
 'my': 6,
 'family': 2,
 'and': 0,
 'in': 3,
 'there': 8,
 'are': 1,
 'ten': 7,
 'memebers': 5}
In [11]:
vector.toarray()
Out[11]:
array([[1, 1, 2, 1, 1, 1, 2, 1, 1]])

We can see word 'my' and 'family' are showing twice.

Example 2:

In [12]:
text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]
In [13]:
vectorizer1 = CountVectorizer()
In [14]:
vector1 = vectorizer1.fit_transform(text1)
vector1
Out[14]:
<3x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>
In [15]:
vectorizer1.get_feature_names()
Out[15]:
['all',
 'are',
 'family',
 'in',
 'love',
 'memebers',
 'my',
 'of',
 'stay',
 'ten',
 'them',
 'there',
 'together']
In [16]:
vectorizer1.vocabulary_
Out[16]:
{'love': 4,
 'my': 6,
 'family': 2,
 'in': 3,
 'there': 11,
 'are': 1,
 'ten': 9,
 'memebers': 5,
 'all': 0,
 'of': 7,
 'them': 10,
 'stay': 8,
 'together': 12}
In [17]:
vector1.toarray()
Out[17]:
array([[0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1]])

Converting text to feature using N-grams

Bigram features

Bigram Example 1:

In [18]:
text = ["I love my family and in my family there are ten memebers."]

ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
In [19]:
vectorizer2 = CountVectorizer(ngram_range=(2, 2))
In [20]:
vector2 = vectorizer2.fit_transform(text)
vector2
Out[20]:
<1x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>
In [21]:
vectorizer2.vocabulary_
Out[21]:
{'love my': 5,
 'my family': 6,
 'family and': 2,
 'and in': 0,
 'in my': 4,
 'family there': 3,
 'there are': 8,
 'are ten': 1,
 'ten memebers': 7}
In [22]:
vector2.toarray()
Out[22]:
array([[1, 1, 1, 1, 1, 1, 2, 1, 1]])

We can see "my family" bigram came twice.

Bigram Example 2:

In [23]:
text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]
In [24]:
vectorizer3 = CountVectorizer(ngram_range=(2, 2))
vector3 = vectorizer3.fit_transform(text1)
vector3
Out[24]:
<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>
In [25]:
vectorizer3.vocabulary_
Out[25]:
{'love my': 5,
 'my family': 6,
 'in my': 4,
 'family there': 3,
 'there are': 11,
 'are ten': 2,
 'ten memebers': 9,
 'all of': 0,
 'of them': 7,
 'them are': 10,
 'are stay': 1,
 'stay together': 8}
In [26]:
vector3.toarray()
Out[26]:
array([[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0]])

Trigrams features

Trigram Example1:

In [27]:
vectorizer4 = CountVectorizer(ngram_range=(3, 3))
In [28]:
vector4 = vectorizer4.fit_transform(text)
vector4
Out[28]:
<1x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>
In [29]:
vectorizer4.vocabulary_
Out[29]:
{'love my family': 5,
 'my family and': 6,
 'family and in': 2,
 'and in my': 0,
 'in my family': 4,
 'my family there': 7,
 'family there are': 3,
 'there are ten': 8,
 'are ten memebers': 1}
In [30]:
vector4.toarray()
Out[30]:
array([[1, 1, 1, 1, 1, 1, 1, 1, 1]])

Converting text to feature using Hash vectorizer

In [31]:
from sklearn.feature_extraction.text import HashingVectorizer

class sklearn.feature_extraction.text.HashingVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float64'>)

Convert a collection of text documents to a matrix of token occurrences.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

I. it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.

II. it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.

III. it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

I. there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

II. there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

III. no IDF weighting as this would render the transformer stateful.

Example 1

In [32]:
text = ["I love my family and in my family there are ten memebers."]
In [33]:
vectorizer5 = HashingVectorizer(n_features = 10)
In [34]:
vector5 = vectorizer5.fit_transform(text)
print(vector5.shape)
(1, 10)
In [35]:
vector5.toarray()
Out[35]:
array([[ 0.5547002, -0.2773501,  0.5547002,  0.       ,  0.       ,
        -0.2773501,  0.2773501,  0.2773501,  0.       ,  0.2773501]])

It created vector of size 10.

Example 2

In [36]:
text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]
In [37]:
vectorizer6 = HashingVectorizer(n_features = 10)
vector6 = vectorizer6.fit_transform(text1)
print(vector6.shape)
(3, 10)
In [38]:
vector6.toarray()
Out[38]:
array([[ 0.57735027, -0.57735027,  0.57735027,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.4472136 ,  0.        ,  0.4472136 ,  0.        ,  0.        ,
         0.        ,  0.4472136 ,  0.4472136 ,  0.        ,  0.4472136 ],
       [ 0.        ,  0.31622777,  0.        ,  0.        ,  0.        ,
         0.        ,  0.9486833 ,  0.        ,  0.        ,  0.        ]])

Converting text to feature using Term Frequency-Inverse Document Frequency(TF-IDF)

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Convert a collection of raw documents to a matrix of TF-IDF features.

Example 1

In [40]:
text = ["I love my family", "In my family there are ten memebers."]
In [41]:
vectorizer7 = TfidfVectorizer()
vectorizer7.fit_transform(text)
Out[41]:
<2x8 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>
In [42]:
vectorizer7.vocabulary_
Out[42]:
{'love': 3,
 'my': 5,
 'family': 1,
 'in': 2,
 'there': 7,
 'are': 0,
 'ten': 6,
 'memebers': 4}
In [48]:
vectorizer7.idf_
Out[48]:
array([1.40546511, 1.        , 1.40546511, 1.40546511, 1.40546511,
       1.        , 1.40546511, 1.40546511])

If we observe, "my" and "family" words are appearing two times in both the documents. That's why the vector value are 1, which is less than all the other vector representations of the tokens.

Example 2

In [44]:
text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]
In [45]:
vectorizer8 = TfidfVectorizer()
vectorizer8.fit_transform(text1)
Out[45]:
<3x13 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>
In [46]:
vectorizer8.vocabulary_
Out[46]:
{'love': 4,
 'my': 6,
 'family': 2,
 'in': 3,
 'there': 11,
 'are': 1,
 'ten': 9,
 'memebers': 5,
 'all': 0,
 'of': 7,
 'them': 10,
 'stay': 8,
 'together': 12}
In [47]:
vectorizer8.idf_
Out[47]:
array([1.69314718, 1.28768207, 1.28768207, 1.69314718, 1.69314718,
       1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.69314718, 1.69314718])

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing