Spelling Correction Of The Text Data In Natural Language Processing

Spelling Correction Of The Text Data In Natural Language Processing

In this blog, we will rectify some typo errors using TextBlob, Autocorrect and pyspellchecker module.

Installation

If you have already installed TextBlob, Autocorrect and pyspellchecker then you can skip installation step.

TextBlob

pip install TextBlob

Autocorrect

pip install autocorrect

pyspellchecker

pip install pyspellchecker

Spelling Correction Using TextBlob

In [1]:
from textblob import TextBlob

Example 1:

In [2]:
text = "Machine learnning is a branch of artifecial intelligence and computer sciance."

We can observe, i have written some wrong spelling of "learning, artificial and science". Let us check TextBlob library can correct spelling mistake or not.

Attempt to correct the spelling of a blob

In [3]:
TextBlob(text).correct()
Out[3]:
TextBlob("Machine learning is a branch of artificial intelligence and computer science.")

For this example, it corrected the spelling error.

Example 2:

In [4]:
text1 = "John is gud boy and he plays fotball"
In [5]:
TextBlob(text1).correct()
Out[5]:
TextBlob("John is god boy and he plays football")

In this example, it didn't correct the spelling of good, but did for football.

Example 3:

In this example, we have text data which is list. We will convert list to pandas's dataframe and the apply lambda function to correct the spelling.

In [6]:
import pandas as pd

Define text data

In [7]:
text3 = [
    'Natural languuage procesing is a branch of computer sciance and artifecial intelligance.',
    'The Python programing language provides a wide range of tools and libraries for NLP tassks.', 
    'NLTK, an open source colection of libraries for buildding NLP programs.',
    'Spacy and Textblob libraries are also quite popular.',
    'NLP tasks are very interresting.'
]

I have entered wrong data in list to correct the typo error. Wrong spelling are:

  1. languuage
  2. procesing
  3. sciance
  4. artifecial
  5. intelligance
  6. tassks
  7. colection
  8. buildding
  9. interresting

Set column width 500

This step is not required, if sentences are smaller in text data.

In [8]:
pd.set_option('display.max_colwidth', 500)

Convert into pandas dataframe

In [9]:
df = pd.DataFrame({'text':text3})
print(df)
                                                                                          text
0     Natural languuage procesing is a branch of computer sciance and artifecial intelligance.
1  The Python programing language provides a wide range of tools and libraries for NLP tassks.
2                      NLTK, an open source colection of libraries for buildding NLP programs.
3                                         Spacy and Textblob libraries are also quite popular.
4                                                             NLP tasks are very interresting.

Apply lambda function on df['text] and correct the spellings

In [10]:
df['text'].apply(lambda x: str(TextBlob(x).correct()))
Out[10]:
0       Natural language processing is a branch of computer science and artificial intelligence.
1    The Python programming language provides a wide range of tools and libraries for NLP tasks.
2                        NLTK, an open source collection of libraries for building NLP programs.
3                                             Pay and Textblob libraries are also quite popular.
4                                                                NLP tasks are very interesting.
Name: text, dtype: object

Corrected the spellings of:

  1. languuage - language
  2. procesing - processing
  3. sciance - science
  4. artifecial - artificial
  5. intelligance - intelligence
  6. tassks - tasks
  7. colection - collection
  8. buildding - building
  9. interresting - interesting

All wrong spelling corrected.

Spelling Correction Using Autocorrect

In [11]:
from autocorrect import Speller 

Example 1:

In [12]:
spell = Speller()
In [13]:
text4 = "Machine learnning is a branch of artifecial intelligence and computer sciance."
In [14]:
spell(text4)
Out[14]:
'Machine learning is a branch of artificial intelligence and computer science.'

Corrected spelling of "learning, artificial and science".

Example 2:

In [15]:
spell1 = Speller()
In [16]:
text5 = [
    'Natural languuage procesing is a branch of computer sciance and artifecial intelligance.',
    'The Python programing language provides a wide range of tools and libraries for NLP tassks.', 
    'NLTK, an open source colection of libraries for buildding NLP programs.',
    'Spacy and Textblob libraries are also quite popular.',
    'NLP tasks are very interresting.'
]
In [17]:
df1 = pd.DataFrame({'text':text5})
df1
Out[17]:
text
0 Natural languuage procesing is a branch of computer sciance and artifecial intelligance.
1 The Python programing language provides a wide range of tools and libraries for NLP tassks.
2 NLTK, an open source colection of libraries for buildding NLP programs.
3 Spacy and Textblob libraries are also quite popular.
4 NLP tasks are very interresting.
In [18]:
df1['text'].apply(lambda x: str(spell1(x)))
Out[18]:
0      Natural language processing is a branch of computer science and artificial intelligence.
1    The Python programming language provides a wide range of tools and libraries for LP tasks.
2                          LT, an open source collection of libraries for building LP programs.
3                                          Space and Textbook libraries are also quite popular.
4                                                                LP tasks are very interesting.
Name: text, dtype: object

Spelling Correction Using pyspellchecker

In [19]:
from spellchecker import SpellChecker
In [20]:
spell2 = SpellChecker()
In [21]:
text6 = "Machine learnning is a branch of artifecial intelligence and computer sciance."
In [22]:
text6 = text6.split()
text6
Out[22]:
['Machine',
 'learnning',
 'is',
 'a',
 'branch',
 'of',
 'artifecial',
 'intelligence',
 'and',
 'computer',
 'sciance.']

Get the misspelled text

In [23]:
misspelled_text6 = spell2.unknown(text6)
misspelled_text6
Out[23]:
{'artifecial', 'learnning', 'sciance.'}

Correct the misspelled words

In [24]:
for word in misspelled_text6:
    print(spell2.correction(word))
    
    #Get a list of 'likely' options
    print(spell2.candidates(word))
learning
{'learning'}
science
{'sciencey', 'sciences', 'science'}
artificial
{'artifcial', 'artificial'}
In [ ]:
We got the corrected words.

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing