Handling Text For Machine Learning

Handling Text For Machine Learning

In this blog, we will see how we can clean text, remove whitespace and punctuation. We will also do text tokenization, remove stop words, stemming words, tagging part of speach etc...

Remove white space

In [1]:
text_data = ["  John is a nice boy  ",
            "He is in class 10th student.",
            "    Today he has pre board exam.      "]
text_data
Out[1]:
['  John is a nice boy  ',
 'He is in class 10th student.',
 '    Today he has pre board exam.      ']

strip whitespaces

In [2]:
strip_whitespace = [string.strip() for string in text_data]
strip_whitespace
Out[2]:
['John is a nice boy',
 'He is in class 10th student.',
 'Today he has pre board exam.']

remove period

In [3]:
remove_periods = [string.replace(".", "") for string in strip_whitespace]
remove_periods
Out[3]:
['John is a nice boy',
 'He is in class 10th student',
 'Today he has pre board exam']

remove capitalization

In [4]:
def remove_capital_letter(string):
    return string.lower()
    pass
In [5]:
cleaned_data = [remove_capital_letter(string) for string in remove_periods]
cleaned_data
Out[5]:
['john is a nice boy',
 'he is in class 10th student',
 'today he has pre board exam']

Word capitalization

In [6]:
def capitalization(string):
    return string.upper()
    pass
In [7]:
[capitalization(string) for string in cleaned_data]
Out[7]:
['JOHN IS A NICE BOY',
 'HE IS IN CLASS 10TH STUDENT',
 'TODAY HE HAS PRE BOARD EXAM']

Remove Punctuation

In [8]:
import unicodedata
import sys
In [9]:
text_data = ['Hello!!!', 'How are you????????????????', 'Are you coming to my place, is that 100% confirmed!!!!', 
             'We will have lot of fun, right!!??', 'Bye!!!!!!!!!!']

classmethod fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

fromkeys() is a class method that returns a new dictionary. value defaults to None. All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.

create a dictionary of punctuation characters

In [10]:
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

punctuation

for each string, remove all punctuations

str.translate(table)
Return a copy of the string in which each character has been mapped through the given translation table. The table must be an object that implements indexing via __getitem__(), typically a mapping or sequence. When indexed by a Unicode ordinal (an integer), the table object can do any of the following: return a Unicode ordinal or a string, to map the character to one or more other characters; return None, to delete the character from the return string; or raise a LookupError exception, to map the character to itself.
In [11]:
[string.translate(punctuation) for string in text_data]
Out[11]:
['Hello',
 'How are you',
 'Are you coming to my place is that 100 confirmed',
 'We will have lot of fun right',
 'Bye']

Parsing and Cleaning HTML

In [12]:
from bs4 import BeautifulSoup
In [13]:
html = """
    <html><head><title>Alice's Adventures in Wonderland Story</title></head>
    <body>
        <p class="title"><b>Alice's Adventures in Wonderland</b></p>

        <p class="story">Alice's Adventures in Wonderland (commonly Alice in Wonderland) is an 1865 English novel by Lewis Carroll.</p>
        <p class="story">A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.</p>
        <p class="story">It is seen as an example of the literary nonsense genre. </p>
    </body>
    </html>
"""
In [14]:
soup = BeautifulSoup(html)

find title from p tag

In [15]:
soup.find("p", {"class": "title"}).text
Out[15]:
"Alice's Adventures in Wonderland"

find text from p tag, which has class "story"

In [16]:
for s in soup.find_all('p', {"class": "story"}):
    print(s.text)
    pass
Alice's Adventures in Wonderland (commonly Alice in Wonderland) is an 1865 English novel by Lewis Carroll.
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.
It is seen as an example of the literary nonsense genre. 

Tokenizing using nltk

What is tokenization in NLP(Natural Language Processing)?

Tokenization is breaking the raw text into words or sentences called tokens.

word tokenization

In [17]:
from nltk.tokenize import word_tokenize
In [18]:
string = "Alice in Wonderland is an 1865 English novel by Lewis Carroll"
In [19]:
tokenized_words = word_tokenize(string)
tokenized_words
Out[19]:
['Alice',
 'in',
 'Wonderland',
 'is',
 'an',
 '1865',
 'English',
 'novel',
 'by',
 'Lewis',
 'Carroll']

sentence tokenization

In [20]:
from nltk.tokenize import sent_tokenize
In [21]:
sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."
In [22]:
sent_tokenize(sentences)
Out[22]:
['Alice in Wonderland is an 1865 English novel by Lewis Carroll.',
 'A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.',
 'It is seen as an example of the literary nonsense genre.']

Removing Stop Words

What is Stop words?

Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. For example these are the stop words "a", "an", "the", "is", "from", "and" etc...

In [23]:
from nltk.corpus import stopwords

We have already tokenized our words, let is use that.

In [24]:
tokenized_words
Out[24]:
['Alice',
 'in',
 'Wonderland',
 'is',
 'an',
 '1865',
 'English',
 'novel',
 'by',
 'Lewis',
 'Carroll']

get the stopwords of English language

In [25]:
stop_words = stopwords.words('english')
In [26]:
len(stop_words)
Out[26]:
179
In [27]:
type(stop_words)
Out[27]:
list
In [28]:
stop_words[0:10]
Out[28]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

remove stopwords from tokenized_words

In [29]:
[word for word in tokenized_words if word not in stop_words]
Out[29]:
['Alice', 'Wonderland', '1865', 'English', 'novel', 'Lewis', 'Carroll']

We can see words "is", "in", "an", "by" are removed.

Stemming Words

What is Stemming Words?

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form.

A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish.

The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.

nltk.stem package

NLTK Stemmers: Interfaces used to remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word. This is a difficult problem due to irregular words (eg. common verbs in English), complicated morphological rules, and part-of-speech and sense ambiguities (eg. ceil- is not the stem of ceiling).

Submodules

    nltk.stem.api module
    nltk.stem.arlstem module
    nltk.stem.arlstem2 module
    nltk.stem.cistem module
    nltk.stem.isri module
    nltk.stem.lancaster module
    nltk.stem.porter module
    nltk.stem.regexp module
    nltk.stem.rslp module
    nltk.stem.snowball module
    nltk.stem.util module
    nltk.stem.wordnet module

The Porter Stemming Algorithm

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

In [30]:
import nltk
from nltk.stem.porter import PorterStemmer

Create stemmer

In [31]:
porter = PorterStemmer()

apply stemmer in tokenized_words

In [32]:
tokenized_words =  ['cats', 'catlike', 'catty', 'playing', 'played', 'plays']
In [33]:
[porter.stem(word) for word in tokenized_words]
Out[33]:
['cat', 'catlik', 'catti', 'play', 'play', 'play']

The Snowball stemmers

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This site describes Snowball, and presents several useful stemmers which have been implemented using it.

Snowball Stemmer supports the following languages: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

In [34]:
from nltk.stem import SnowballStemmer
In [35]:
snowball_stemmer = SnowballStemmer("english")
In [36]:
tokenized_words
Out[36]:
['cats', 'catlike', 'catty', 'playing', 'played', 'plays']
In [37]:
[snowball_stemmer.stem(word) for word in tokenized_words]
Out[37]:
['cat', 'catlik', 'catti', 'play', 'play', 'play']

Tagging Part of Speech

nltk.tag package

NLTK Taggers: This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

In [38]:
from nltk import pos_tag, word_tokenize
In [39]:
text_data = "John's big idea isn't all that bad."
In [40]:
tagged_text = pos_tag(word_tokenize(text_data)) 
tagged_text
Out[40]:
[('John', 'NNP'),
 ("'s", 'POS'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'VBZ'),
 ("n't", 'RB'),
 ('all', 'PDT'),
 ('that', 'DT'),
 ('bad', 'JJ'),
 ('.', '.')]

tagging for multiple text

In [41]:
sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."
In [42]:
tokenize_sentences = sent_tokenize(sentences)
tokenize_sentences
Out[42]:
['Alice in Wonderland is an 1865 English novel by Lewis Carroll.',
 'A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.',
 'It is seen as an example of the literary nonsense genre.']
In [43]:
tagged_texts= []

for sentence in tokenize_sentences:
    tagged_sentence = pos_tag(word_tokenize(sentence)) 
    tagged_texts.append(tagged_sentence)
    pass
In [44]:
tagged_texts
Out[44]:
[[('Alice', 'NNP'),
  ('in', 'IN'),
  ('Wonderland', 'NNP'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('1865', 'CD'),
  ('English', 'NNP'),
  ('novel', 'NN'),
  ('by', 'IN'),
  ('Lewis', 'NNP'),
  ('Carroll', 'NNP'),
  ('.', '.')],
 [('A', 'DT'),
  ('young', 'JJ'),
  ('girl', 'NN'),
  ('named', 'VBN'),
  ('Alice', 'NNP'),
  ('falls', 'VBZ'),
  ('through', 'IN'),
  ('a', 'DT'),
  ('rabbit', 'NN'),
  ('hole', 'NN'),
  ('into', 'IN'),
  ('a', 'DT'),
  ('fantasy', 'JJ'),
  ('world', 'NN'),
  ('of', 'IN'),
  ('anthropomorphic', 'JJ'),
  ('creatures', 'NNS'),
  ('.', '.')],
 [('It', 'PRP'),
  ('is', 'VBZ'),
  ('seen', 'VBN'),
  ('as', 'IN'),
  ('an', 'DT'),
  ('example', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('literary', 'JJ'),
  ('nonsense', 'NN'),
  ('genre', 'NN'),
  ('.', '.')]]

Binarization

Binarization is process that is used to transform any data features into binary numbers(0 or 1).

LabelBinarizer

In [45]:
from sklearn import preprocessing
In [46]:
lb = preprocessing.LabelBinarizer()
In [47]:
string = "Alice in Wonderland is an 1865 English novel by Lewis Carroll"
tokenized_words = word_tokenize(string)
tokenized_words
Out[47]:
['Alice',
 'in',
 'Wonderland',
 'is',
 'an',
 '1865',
 'English',
 'novel',
 'by',
 'Lewis',
 'Carroll']
In [48]:
lb.fit_transform(tokenized_words)
Out[48]:
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

MultiLabelBinarizer

In [49]:
from sklearn.preprocessing import MultiLabelBinarizer
In [50]:
sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."
In [51]:
tagged_sentences = []

for sentence in sentences:
    tagged_sentence = nltk.pos_tag(word_tokenize(sentence))
    tagged_sentences.append([tag for word, tag in tagged_sentence])
    pass
tagged_sentences
In [52]:
mlb = MultiLabelBinarizer()
mlb.fit_transform(tagged_sentences)
Out[52]:
array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Vectorization

The raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

Vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

CountVectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
In [54]:
text_data = ['Hello', 'How are you?', 'Are you coming to my place, is that 100% confirmed', 
             'We will have lot of fun, right.', 'Bye']
In [55]:
count = CountVectorizer()
In [56]:
bag_of_words = count.fit_transform(text_data)
bag_of_words
Out[56]:
<5x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>
In [57]:
bag_of_words.toarray()
Out[57]:
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [58]:
count.vocabulary_.keys()
Out[58]:
dict_keys(['hello', 'how', 'are', 'you', 'coming', 'to', 'my', 'place', 'is', 'that', '100', 'confirmed', 'we', 'will', 'have', 'lot', 'of', 'fun', 'right', 'bye'])

CountVectorizer with n-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

In [59]:
cv = CountVectorizer(ngram_range=(2,2))
In [60]:
bag_of_words = cv.fit_transform(text_data)
bag_of_words
Out[60]:
<5x16 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>
In [61]:
bag_of_words.toarray()
Out[61]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [62]:
cv.vocabulary_.keys()
Out[62]:
dict_keys(['how are', 'are you', 'you coming', 'coming to', 'to my', 'my place', 'place is', 'is that', 'that 100', '100 confirmed', 'we will', 'will have', 'have lot', 'lot of', 'of fun', 'fun right'])

Term Frequency–Inverse Document Frequency

Tf–idf term weighting

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency.

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [64]:
tfidf = TfidfVectorizer()
In [65]:
transformer = tfidf.fit_transform(text_data)
transformer
Out[65]:
<5x20 sparse matrix of type '<class 'numpy.float64'>'
	with 22 stored elements in Compressed Sparse Row format>
In [66]:
transformer.toarray()
Out[66]:
array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.53177225, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.659118  , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.53177225],
       [0.32788062, 0.26453202, 0.        , 0.32788062, 0.32788062,
        0.        , 0.        , 0.        , 0.        , 0.32788062,
        0.        , 0.32788062, 0.        , 0.32788062, 0.        ,
        0.32788062, 0.32788062, 0.        , 0.        , 0.26453202],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.37796447, 0.37796447, 0.        , 0.        , 0.        ,
        0.37796447, 0.        , 0.37796447, 0.        , 0.37796447,
        0.        , 0.        , 0.37796447, 0.37796447, 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])
dir(tfidf)
In [67]:
tfidf.vocabulary_
Out[67]:
{'hello': 7,
 'how': 8,
 'are': 1,
 'you': 19,
 'coming': 3,
 'to': 16,
 'my': 11,
 'place': 13,
 'is': 9,
 'that': 15,
 '100': 0,
 'confirmed': 4,
 'we': 17,
 'will': 18,
 'have': 6,
 'lot': 10,
 'of': 12,
 'fun': 5,
 'right': 14,
 'bye': 2}

According to word weight importance, we got the vector.

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing