Sentiment Analysis for Hotel Reviews With NLTK and Keras

Sentiment Analysis for Hotel Reviews With NLTK and Keras

To find out positive, negative or neutral sentiment.

What is sentiment analysis

Sentiment analysis is a text analysis technique that detects people's emotions like positive, negative or neutral within texts. Sentiment analysis is often used in business to detect sentiment in social data, market research, understand customers, customer service, product reviews, product feedback etc...

In this document we are going to use reviews of hotel and find out customer's emotion.

Load Data

I am using reviews.csv file. If you want you can download.

Declare file path

In [4]:
inputFolder = "input/"
filePath = inputFolder + 'reviews.csv'
filePath
Out[4]:
'input/reviews.csv'

Read csv file

read_csv() is an important pandas function which read a comma-separated values (csv) file and converts into DataFrame.

In [5]:
import pandas as pd
In [6]:
df = pd.read_csv(filePath)
df.head()
Out[6]:
Text Sentiment
0 The rooms are extremely small, practically onl... negative
1 Room safe did not work. negative
2 Mattress very comfortable. positive
3 Very uncomfortable, thin mattress, with plasti... negative
4 No bathroom in room negative
In [7]:
print(df.shape)
print(df.dtypes)
(306, 2)
Text         object
Sentiment    object
dtype: object

Remove missing data

Check nan value

DataFrame.isna(): Detect missing values. Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values.

In [8]:
df['Sentiment'].isna()
Out[8]:
0      False
1      False
2      False
3      False
4      False
       ...  
301     True
302     True
303     True
304     True
305     True
Name: Sentiment, Length: 306, dtype: bool

sum(): to count the NaN values for one column df['Sentiment']

In [9]:
df['Sentiment'].isna().sum()
Out[9]:
99

Drop missing rows

Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null values.

In [10]:
df = df.dropna()
print(df.shape)
print(df.head())
(207, 2)
                                                Text Sentiment
0  The rooms are extremely small, practically onl...  negative
1                            Room safe did not work.  negative
2                         Mattress very comfortable.  positive
3  Very uncomfortable, thin mattress, with plasti...  negative
4                                No bathroom in room  negative

Clean, tokenize and get frequencies of positive reviews

Separate positive reviews from dataframe

In [11]:
#check df['Sentiment'] == 'positive'
positive_reviews = df.loc[df['Sentiment'] == 'positive']
print(positive_reviews.head())
                                                Text Sentiment
2                         Mattress very comfortable.  positive
5                           The bed was soooo comfy.  positive
7                       The bed is very comfortable.  positive
8   Very spacious rooms, quiet and very comfortable.  positive
11                    Air conditioning working fine.  positive

Clean positive reviews

In [12]:
#import regular expression module
import re

re module provides regular expression matching operations similar to those found in Perl. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

\w matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

[^\w] will match any character except '[a-zA-Z0-9]'.

In [13]:
positive_reviews = [re.sub("[^\w ]",  " " , x.lower()) for x in positive_reviews['Text']]
positive_reviews = ' '.join(positive_reviews)
In [14]:
positive_reviews[:500]
Out[14]:
'mattress very comfortable  the bed was soooo comfy  the bed is very comfortable  very spacious rooms  quiet and very comfortable  air conditioning working fine  the room had cable tv  a safe  an iron  a hairdryer and free coffee   tea in the downstairs area  the building was under renovation  and hip and clean  comfortable bed and selection of pillows  hot water was abundant and the a c  window unit  was ice cold   excellent  access to fitness facilities  especially the pool was a real value add'

Tokenize positive reviews

In [15]:
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import collections

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

Stopwords are most common words. NLTK supports stop word removal, and we can find the list of stop words in the corpus module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words.

The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.

In [16]:
#create the instance of ToktokTokenizer()
tokenizer = ToktokTokenizer()

#tokenize positive reviews
positive_tokens = tokenizer.tokenize(positive_reviews)

#print ten words
print(positive_tokens[:10])
['mattress', 'very', 'comfortable', 'the', 'bed', 'was', 'soooo', 'comfy', 'the', 'bed']

Remove common words

In [17]:
positive_tokens = [word for word in positive_tokens if not word in stopwords.words()]
positive_tokens[:20]
Out[17]:
['mattress',
 'comfortable',
 'bed',
 'soooo',
 'comfy',
 'bed',
 'comfortable',
 'spacious',
 'rooms',
 'quiet',
 'comfortable',
 'air',
 'conditioning',
 'working',
 'fine',
 'room',
 'cable',
 'tv',
 'safe',
 'iron']

Get word frequencies

In [18]:
word_counts = collections.Counter(positive_tokens)
print(word_counts)
Counter({'room': 19, 'comfortable': 10, 'bed': 9, 'clean': 9, 'spacious': 6, 'rooms': 5, 'facilities': 5, 'great': 5, 'coffee': 4, 'especially': 4, 'nice': 4, 'shower': 4, 'good': 4, 'comfy': 3, 'quiet': 3, 'hot': 3, 'night': 3, 'bathroom': 3, 'large': 3, 'awesome': 3, 'mattress': 2, 'air': 2, 'tv': 2, 'free': 2, 'area': 2, 'window': 2, 'excellent': 2, 'pool': 2, 'best': 2, 'bathtub': 2, 'towels': 2, 'every': 2, 'day': 2, 'super': 2, 'gym': 2, 'floor': 2, 'noise': 2, 'around': 2, 'rooftop': 2, 'well': 2, 'much': 2, 'small': 2, 'flat': 2, 'screen': 2, 'still': 2, 'loved': 2, 'wanted': 2, 'soooo': 1, 'conditioning': 1, 'working': 1, 'fine': 1, 'cable': 1, 'safe': 1, 'iron': 1, 'hairdryer': 1, 'tea': 1, 'downstairs': 1, 'building': 1, 'renovation': 1, 'hip': 1, 'selection': 1, 'pillows': 1, 'water': 1, 'abundant': 1, 'unit': 1, 'ice': 1, 'cold': 1, 'access': 1, 'fitness': 1, 'real': 1, 'value': 1, 'added': 1, 'memory': 1, 'foam': 1, 'agreed': 1, 'sleep': 1, 'ever': 1, 'would': 1, 'lovely': 1, 'expected': 1, 'cleaned': 1, 'location': 1, 'amazing': 1, 'aircon': 1, 'earplugs': 1, 'slept': 1, 'soundly': 1, 'cosy': 1, 'temperature': 1, 'control': 1, 'essential': 1, 'allow': 1, 'workout': 1, 'everything': 1, 'mattresses': 1, 'linens': 1, 'elevators': 1, 'fast': 1, 'never': 1, 'wait': 1, 'long': 1, '15th': 1, 'city': 1, 'bothering': 1, 'whole': 1, 'walking': 1, 'sleeping': 1, 'execellent': 1, 'things': 1, 'need': 1, 'terrace': 1, 'king': 1, 'plenty': 1, 'spread': 1, 'unpack': 1, 'extended': 1, 'stay': 1, 'character': 1, 'lighting': 1, 'beds': 1, 'soundproofing': 1, 'windows': 1, 'showerroom': 1, 'wifi': 1, 'maker': 1, '5': 1, 'star': 1, 'qualiity': 1, 'linen': 1, 'potent': 1, 'pleasant': 1, 'febreeze': 1, 'scent': 1, 'televisions': 1, 'old': 1, '1990': 1, 'tube': 1, 'amenity': 1, 'usually': 1, 'book': 1, 'double': 1, 'thus': 1, 'far': 1, 'comes': 1, 'bit': 1, 'single': 1, 'smaller': 1, 'yet': 1, 'family': 1, '4': 1, 'adults': 1, 'immaculate': 1, 'smelled': 1, 'fresh': 1, 'place': 1, 'size': 1, 'big': 1, 'enough': 1, 'friend': 1, 'huge': 1, 'suitcases': 1, 'fit': 1, 'us': 1, 'move': 1, 'spotless': 1, 'classy': 1, 'interior': 1, 'traffic': 1, 'easy': 1, 'reach': 1, 'decent': 1, 'elegant': 1, 'art': 1, 'deco': 1, 'ambience': 1, 'shop': 1, 'underneath': 1, 'closed': 1, '12am': 1, 'get': 1, 'late': 1, 'jets': 1, 'actually': 1, 'enjoyed': 1, 'knew': 1, 'ahead': 1, 'time': 1, 'try': 1, 'something': 1, 'really': 1, 'different': 1, 'counts': 1, 'tub': 1, 'view': 1, 'intense': 1, 'minimalist': 1, 'cool': 1, 'design': 1, 'nyc': 1, 'main': 1, 'lobby': 1, 'manhattan': 1, 'sized': 1, 'ample': 1, 'closet': 1, 'lifts': 1, 'worked': 1})

Visualize positive reviews

Plot twenty words

In [19]:
import matplotlib.pyplot as plt
In [20]:
plt.figure(figsize=(18, 10))
plt.bar(list(word_counts.keys())[:20], list(word_counts.values())[:20])
plt.show()

Plot word freequency distribution

NLTK in python has a function FreqDist which gives you the frequency of words within a text. FreqDist runs on an array of tokens. You're sending it a an array of characters (a string) where you should have tokenized the input first. The plot() method can be called to draw the frequency distribution as a graph for the most common tokens in the text.

In [21]:
import nltk
plt.figure(figsize=(18, 10))
fd = nltk.FreqDist(word_counts)
fd.plot(30, title='Frequency distribution for top 30 words')
plt.show()

Create word cloud for positive reviews

In [22]:
from wordcloud import WordCloud, STOPWORDS
import numpy as np
from PIL import Image

A word cloud is a visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color.

In [23]:
pic_mask = np.array(Image.open("input/rooms.jpg"))

wordcloud = WordCloud(stopwords=STOPWORDS,
                  background_color='white',
                  width=1600, height=800,
                  mask=pic_mask
                 ).generate(positive_reviews)
plt.figure( figsize=(18,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Clean, tokenize and get frequencies of negative reviews

Separate negative reviews from dataframe

In [24]:
negative_reviews = df.loc[df['Sentiment'] == 'negative']
negative_reviews[:5]
Out[24]:
Text Sentiment
0 The rooms are extremely small, practically onl... negative
1 Room safe did not work. negative
3 Very uncomfortable, thin mattress, with plasti... negative
4 No bathroom in room negative
6 someone must have been smoking in the room nex... negative

Clean negative reviews

In [25]:
negative_reviews = [re.sub("[^\w ]",  " " , x.lower()) for x in negative_reviews['Text']]
negative_reviews = ' '.join(negative_reviews)
In [26]:
negative_reviews[:500]
Out[26]:
'the rooms are extremely small  practically only a bed  room safe did not work  very uncomfortable  thin mattress  with plastic cover that rustles every time you move  no bathroom in room someone must have been smoking in the room next door  for 3 people in a bedroom the sofa bed is a bit unconfortable  lights in the common room were too dim  so if you re the type that likes to let water run a bit before getting wet or it takes a minute to figure out how to make it hot  you re gonna get wet  the '

Tokenize negative reviews

In [27]:
tokenizer = ToktokTokenizer()
negative_tokens = tokenizer.tokenize(negative_reviews)
print(negative_tokens[:10])
['the', 'rooms', 'are', 'extremely', 'small', 'practically', 'only', 'a', 'bed', 'room']

Remove common words

In [28]:
negative_tokens = [word for word in negative_tokens if not word in stopwords.words()]
negative_tokens[:20]
Out[28]:
['rooms',
 'extremely',
 'small',
 'practically',
 'bed',
 'room',
 'safe',
 'work',
 'uncomfortable',
 'thin',
 'mattress',
 'plastic',
 'cover',
 'rustles',
 'every',
 'time',
 'move',
 'bathroom',
 'room',
 'someone']

Get word frequencies

In [29]:
word_counts = collections.Counter(negative_tokens)
print(word_counts)
Counter({'room': 37, 'bed': 15, 'shower': 13, 'rooms': 12, 'small': 12, 'bathroom': 10, 'could': 8, 'noisy': 8, 'cold': 7, 'water': 6, 'terrible': 6, 'clean': 6, 'beds': 6, 'would': 6, 'bit': 5, 'windows': 5, 'walls': 5, 'air': 5, 'loud': 5, 'use': 5, 'poor': 5, 'extremely': 4, 'work': 4, 'people': 4, 'lights': 4, 'outside': 4, 'cleaned': 4, 'extra': 4, 'glass': 4, '2': 4, 'floor': 4, 'way': 4, 'uncomfortable': 3, '3': 3, 'although': 3, 'whole': 3, 'sound': 3, 'tv': 3, 'open': 3, 'little': 3, 'night': 3, 'pressure': 3, 'elevator': 3, 'working': 3, 'bath': 3, 'toilet': 3, 'heating': 3, 'many': 3, 'window': 3, 'dirty': 3, 'elevators': 3, 'long': 3, 'coffee': 3, 'left': 3, 'sheets': 3, 'queen': 3, 'conditioning': 3, 'front': 3, 'safe': 2, 'plastic': 2, 'cover': 2, 'every': 2, 'time': 2, 'getting': 2, 'wet': 2, 'make': 2, 'heat': 2, 'filthy': 2, 'building': 2, 'seem': 2, 'wear': 2, 'needed': 2, 'times': 2, 'really': 2, 'pay': 2, 'feel': 2, 'construction': 2, 'barely': 2, 'inviting': 2, 'stayed': 2, 'shared': 2, 'place': 2, 'amenities': 2, 'minutes': 2, 'slamming': 2, 'guests': 2, 'ac': 2, 'outlets': 2, 'slow': 2, 'garbage': 2, 'great': 2, 'tea': 2, 'turned': 2, 'frosted': 2, 'mind': 2, 'especially': 2, 'wifi': 2, '5': 2, 'area': 2, 'space': 2, 'luggage': 2, 'proof': 2, 'size': 2, '40': 2, 'hard': 2, 'mouse': 2, 'days': 2, 'lock': 2, 'pretty': 2, 'requested': 2, 'towels': 2, 'bathrooms': 2, 'two': 2, 'quality': 2, 'leave': 2, 'practically': 1, 'thin': 1, 'mattress': 1, 'rustles': 1, 'move': 1, 'someone': 1, 'must': 1, 'smoking': 1, 'next': 1, 'bedroom': 1, 'sofa': 1, 'unconfortable': 1, 'common': 1, 'dim': 1, 'type': 1, 'likes': 1, 'let': 1, 'run': 1, 'takes': 1, 'minute': 1, 'figure': 1, 'hot': 1, 'gonna': 1, 'get': 1, 'single': 1, 'glazed': 1, 'escape': 1, 'fair': 1, '6': 1, 'cubbyholes': 1, 'marketed': 1, 'corridors': 1, 'electrical': 1, 'cables': 1, 'smelly': 1, 'repulsive': 1, 'insulation': 1, 'gym': 1, 'basic': 1, 'mattresses': 1, 'springy': 1, 'light': 1, 'comfy': 1, 'unbeatable': 1, 'shows': 1, 'tear': 1, 'thinks': 1, 'didnt': 1, 'well': 1, 'microwave': 1, 'made': 1, 'fluctuated': 1, 'felt': 1, 'draft': 1, 'last': 1, 'fan': 1, 'strong': 1, 'suitcase': 1, 'might': 1, 'challenge': 1, 'tiny': 1, 'walk': 1, '4': 1, 'stories': 1, 'hallways': 1, 'stale': 1, 'across': 1, 'road': 1, 'throughout': 1, 'day': 1, 'bathtub': 1, 'anything': 1, 'else': 1, 'sleep': 1, 'lift': 1, 'pain': 1, 'card': 1, 'access': 1, 'used': 1, 'counter': 1, 'think': 1, 'restaurant': 1, 'looked': 1, 'compared': 1, 'rest': 1, 'highly': 1, 'engineer': 1, 'fixed': 1, 'paper': 1, 'replaced': 1, 'everyday': 1, 'greatest': 1, 'futon': 1, 'sleeper': 1, 'couch': 1, 'cramped': 1, 'cap': 1, 'sanitary': 1, 'bags': 1, 'available': 1, 'freezing': 1, 'seating': 1, 'lovely': 1, 'unbelievably': 1, 'smell': 1, 'restrooms': 1, 'support': 1, 'stiff': 1, '15': 1, 'mostly': 1, 'quiet': 1, 'apart': 1, 'top': 1, 'covered': 1, 'blind': 1, 'detest': 1, 'tub': 1, 'degrees': 1, 'warmer': 1, 'winter': 1, 'rug': 1, 'shoes': 1, 'plug': 1, 'surge': 1, 'protectors': 1, 'give': 1, 'idea': 1, 'shutters': 1, 'go': 1, 'neither': 1, 'hanging': 1, 'side': 1, 'phone': 1, 'lines': 1, 'kitchen': 1, 'isntant': 1, 'packages': 1, 'previous': 1, 'customers': 1, 'roof': 1, 'terrace': 1, 'making': 1, 'facilities': 1, 'bright': 1, 'shining': 1, 'thru': 1, 'panels': 1, 'stains': 1, 'cockroaches': 1, 'functioned': 1, 'picture': 1, 'channel': 1, 'selection': 1, 'quite': 1, 'price': 1, 'travel': 1, 'cot': 1, 'connected': 1, 'years': 1, 'ever': 1, 'showers': 1, 'renovations': 1, 'came': 1, 'persian': 1, 'decor': 1, 'half': 1, 'hearted': 1, 'lengths': 1, 'turkish': 1, 'bunting': 1, 'lanterns': 1, 'roofed': 1, 'bar': 1, '30': 1, 'ran': 1, 'jiggled': 1, 'handle': 1, 'comfortable': 1, 'wardrobe': 1, 'towel': 1, 'change': 1, 'thus': 1, 'full': 1, 'cm': 1, 'smaller': 1, 'spartan': 1, 'mean': 1, 'oher': 1, 'hrs': 1, 'wallpaper': 1, 'peeking': 1, 'places': 1, 'insufficient': 1, 'reach': 1, 'plugs': 1, 'charge': 1, 'devices': 1, 'cleaning': 1, 'service': 1, 'doors': 1, 'anymore': 1, 'main': 1, 'avenue': 1, 'constructions': 1, 'called': 1, 'received': 1, 'shipment': 1, 'liked': 1, 'style': 1, 'hvac': 1, 'vent': 1, 'unclean': 1, 'negative': 1, 'old': 1, 'dust': 1, 'tissue': 1, 'condition': 1, 'hold': 1, 'told': 1, 'double': 1, 'whomever': 1, 'prior': 1, 'improvement': 1, 'fitness': 1, 'center': 1, '7': 1, 'channels': 1, 'four': 1, 'werent': 1, 'english': 1, 'soulless': 1, 'view': 1, 'kettle': 1, 'maker': 1, 'boil': 1, 'tricky': 1, 'rinsed': 1, 'entrance': 1, 'curb': 1, 'appeal': 1, 'retro': 1, 'hand': 1, 'narrow': 1, 'curtain': 1, '650': 1, 'dollar': 1, 'nasty': 1, 'private': 1, 'adjust': 1, 'temperature': 1, 'works': 1, 'hours': 1, 'morning': 1, 'noise': 1, 'pry': 1, 'tidy': 1, 'blanket': 1, 'heavy': 1, 'sink': 1, 'tap': 1, 'leaked': 1, 'aircon': 1, 'properly': 1, 'like': 1, 'broken': 1, 'tiles': 1, 'still': 1, 'hotter': 1, 'usually': 1, 'rough': 1, 'dimly': 1, 'lit': 1, 'holes': 1, 'advertised': 1, 'booking': 1, 'descriptions': 1, 'bedding': 1, 'changed': 1, 'drapes': 1, 'fit': 1, 'always': 1, 'convenient': 1, 'exercise': 1, 'closest': 1, 'filled': 1, 'big': 1, 'screen': 1, 'tvs': 1, 'covering': 1, 'tight': 1, 'either': 1, 'yr': 1, 'hallway': 1, 'sounds': 1, 'carpets': 1, 'coming': 1, 'unstitched': 1, 'bigger': 1, 'nice': 1, 'even': 1, 'stay': 1, 'ordered': 1, 'prepaid': 1, 'painted': 1, 'sep': 1, '8': 1, '302': 1, 'good': 1, 'walking': 1, 'waited': 1, 'tend': 1, 'crumbs': 1, 'snacks': 1, 'hair': 1, 'got': 1, '1': 1, 'twins': 1, 'advised': 1, 'put': 1, 'together': 1, 'issue': 1, 'far': 1, 'sleeping': 1, 'suite': 1, '4th': 1})

Visualize negative reviews

Plot twenty words

In [30]:
plt.figure(figsize=(18, 10))
plt.bar(list(word_counts.keys())[:20], list(word_counts.values())[:20])
plt.show()

Plot word freequency distribution

In [31]:
plt.figure(figsize=(18, 10))
fd = nltk.FreqDist(word_counts)
fd.plot(30, title='Frequency distribution for top 30 words')
plt.show()

Create word cloud for negative reviews

In [32]:
pic_mask = np.array(Image.open("input/rooms.jpg"))

negative_wordcloud = WordCloud(stopwords=STOPWORDS,
                  background_color='white',
                  width=1600, height=800,
                  mask=pic_mask
                 ).generate(negative_reviews)
plt.figure( figsize=(18,8))
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Convert Text into numeric matrix

In [33]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
In [34]:
print(df['Text'].values)
['The rooms are extremely small, practically only a bed.'
 'Room safe did not work.' 'Mattress very comfortable.'
 'Very uncomfortable, thin mattress, with plastic cover that rustles every time you move.'
 'No bathroom in room' 'The bed was soooo comfy.'
 'someone must have been smoking in the room next door.'
 'The bed is very comfortable.'
 'Very spacious rooms, quiet and very comfortable.'
 'For 3 people in a bedroom the sofa bed is a bit unconfortable.'
 'Lights in the common room were too dim.'
 'Air conditioning working fine.'
 "So if you're the type that likes to let water run a bit before getting wet or it takes a minute to figure out how to make it hot, you're gonna get wet."
 'the windows are only single glazed so the heat could escape- although to be fair it was -6 outside!'
 'Terrible, small cubbyholes, which are marketed as rooms.'
 'Corridors filthy\r\nRoom filthy\r\nElectrical cables in room not safe\r\nWhole building smelly\r\nShower repulsive'
 'walls seem to have no sound insulation'
 'The gym was very small and basic'
 'The mattresses were too "springy" and uncomfortable.'
 'very light and the comfy of the bed was unbeatable.'
 'Shows some wear and tear.'
 'Some thinks didnt work well : air, tv , open windows,'
 'Microwave needed!' "Room wasn't cleaned or bed made up"
 'The room had cable TV, a safe, an iron, a hairdryer and free coffee & tea in the downstairs area.'
 'The heat in the room fluctuated -- at times it felt a little draft, and on my last night, the fan was really loud.'
 '- The room was cold.' 'The building was under renovation,'
 'The water pressure was not that strong in the shower.'
 'Room too small to open suitcase.'
 'no elevator might be a challenge for some people'
 'Pay extra.. :( Room was tiny' 'and hip and CLEAN!'
 'Comfortable bed and selection of pillows.' 'The TV was not working.'
 'The walk up 4 stories' "Hallways to room stale and didn't feel clean."
 'there was a building under construction across the road which was extremely noisy throughout the day.'
 'Bathtub not clean.'
 'Hot water was abundant and the A/C (window unit) was ice cold - Excellent!'
 'Access to fitness facilities, especially the pool was a real value added.'
 'Barely room for anything else than sleep.'
 'Lift was a bit of a pain with card access.'
 "Could have used a little more counter room in bath and we didn't think the restaurant looked very inviting compared to the rest of the hotel."
 'The bed was highly uncomfortable, although the engineer fixed it'
 "The memory foam mattress - we all agreed the best night's sleep in a hotel EVER..."
 "Toilet paper wasn't replaced everyday!"
 'The extra beds are not the greatest (a futon and a sleeper couch)'
 'Room was extremely cramped.'
 'Shower cap n sanitary bags not available in the room.'
 'it was freezing cold in my room for the night I stayed.'
 'Outside seating was lovely'
 'The heating was unbelievably noisy during the night.' 'bed, smell.'
 'Bathroom area large would have been nice with a bathtub.'
 'Shared restrooms do not support many rooms and/or many people.'
 'Stiff feel to the place- no amenities.'
 'Lovely room, bed was so comfortable!'
 'Every 15 minutes door were slamming & lights on.'
 'After that time, mostly quiet apart from some door slamming by other guests.'
 'The top of the window was then covered by the dirty blind.'
 'Detest the glass "door" if shower/tub .. with?'
 'this was expected, clean towels and room cleaned every day.'
 'Barely a few degrees warmer than outside during winter, so I had to rug up and wear shoes to use the toilet.'
 'ac was terrible' 'More plug outlets with surge protectors.'
 'Just to give you an idea: the shutters of the windows were not working, did not go neither up or down - just hanging down only one side and the other up....'
 'The phone in my room was not working.'
 'Elevators are slow, very long lines.' 'Location anda facilities'
 'There was also  garbage in our kitchen - there were isntant coffee packages left from previous customers.'
 'Amazing facilities.' 'Only one elevator ,' 'Room was very spacious'
 'The bathroom is dirty.' 'Roof terrace great'
 'No tea or coffee making facilities in the rooms'
 'the room had aircon and we had earplugs and slept soundly.'
 'Also, when the bright bathroom lights are turned on, it lights up the whole hotel room, shining thru the frosted glass panels.'
 'The Bed was SUPER COMFY !' 'Bathroom was extra small,'
 'Rooms are cosy with great temperature control.'
 'Facilities were very clean.'
 "only if you don't mind stains in your bed sheets or cockroaches in your (shared) bathroom."
 'The bathroom functioned o.k.' 'Poor TV picture and channel selection.'
 'No elevator.' 'essential gym to allow for some workout.'
 'Rooms quite small for price, especially if you use a travel cot.'
 'room was not at all inviting.' 'Wifi connected'
 'Everything was very clean.'
 "Windows haven't been cleaned for years (if ever)."
 'Showers could do with some renovations - they seem clean,'
 'Only cold water came out.'
 "the Persian decor is so half-hearted it - I'd take down the 5 lengths of Turkish bunting lanterns above the glass roofed bar or cover the whole area with them - 30 more (!)"
 'The toilet ran until you jiggled the handle.'
 'The bed was terrible and not comfortable at all'
 'Mattresses and linens were all great.'
 'No wardrobe, no space for luggage, no towel change, walls are not sound proof thus very noisy.'
 'The elevators were fast - never we had to wait long for a one.On 15th floor the city noise was not bothering at all and after the whole day walking around we were sleeping execellent.'
 'Has most things you need.'
 'Bed was not a full size 2,40 cm long bed, they were smaller.'
 'very spartan.' 'Room extremely small.'
 'No walls mean oher guests being loud at all hrs.'
 'Very nice rooftop terrace, and gym.'
 'Wallpaper is peeking off in places.'
 'Insufficient and hard to reach plugs to charge devices.'
 "Mouse in the room,\r\nNo Wifi,\r\nNo cleaning service in 3 days,\r\nNo elevators,\r\nDoors to outside of hotel didn't lock anymore,\r\nOn main avenue, with constructions, one of our room was pretty loud."
 'No bath, only a shower, and I requested 2 queen size beds'
 "I called and they just hadn't received the shipment of clean towels."
 'The room was clean.'
 'I don\'t mind the small space,\r\njust would have liked a bit more "style."'
 'King Room was spacious with plenty of room to spread out and unpack, especially awesome for an extended stay.'
 'Hotel was very clean and had character.' 'on HVAC vent.'
 'We just had to place our luggage on the floor.' 'Unclean bathrooms.'
 'Only negative was the old and very loud air conditioning in the rooms.'
 'there was dust and tissue under the bed.'
 'The bed was very comfy & the room spacious.' 'No air conditioning.'
 "Bath and shower were in poor condition and didn''t hold the water."
 'I was told I would have two double beds.'
 'Bathroom and shower were very good!, lighting as well.'
 'Rooms too small.'
 'Our room had left over garbage from whomever stayed there prior.'
 'The floor was not very clean.' 'Very comfortable beds.'
 'The only area of improvement would be the fitness center.'
 'The pool is the best!'
 'good soundproofing to windows), great showerroom with excellent hot shower, good air con, free wifi and coffee maker.'
 '5 star qualiity towels and linen.'
 'we only had 7 channels which four of them werent in english.'
 "It's much more potent and pleasant than\r\nany Febreeze scent.\r\n"
 "Some have small televisions, flat screen, or the old 1990's tube, as well as, some of the rooms do not have that amenity.I usually book a double room, which thus far comes with a window and is a bit more spacious.The single rooms are much smaller, yet comfortable."
 'Room was large for my family of 4 adults.'
 'My room was pretty soulless, although it did have a great view.'
 'The room itself was very clean.'
 'My room was immaculate and smelled so fresh and clean.' 'quiet place.'
 "The room was a nice size, big enough for both my and my friend's huge suitcases to fit on the floor and for us to still move around."
 'The hotel was very clean.'
 'No kettle, had to use a coffee maker to boil water for tea which was tricky as it had to be rinsed through many times.'
 'Rooms were spotless and classy.'
 'The front entrance could use some curb appeal.'
 'Bathroom was way too retro.......shower was very poor... no hand shower, poor water pressure and way too narrow.......It also had a plastic curtain which for a $650 dollar hotel room is way too nasty.'
 'extra small rooms.'
 'Poor beds quality, very very noisy : would have been the same if we would have no windows.'
 'The bathrooms have frosted glass walls,not private at all.'
 "Rooms are cold and you can't adjust the temperature and also heating works only a few hours in the morning."
 'very loud construction noise out front of our window.'
 'The room was interior so no traffic noise.' 'Room was noisy and cold.'
 'The air-con was too noisy to leave on for too long.'
 'still easy to reach.'
 "The front door didn't work so we had to pry it open."
 'clean and tidy.\r\n'
 'The bed is really terrible,especially the blanket is heavy and terrible terrible quality.'
 "Bathroom sink tap leaked and aircon didn't work properly"
 "We didn't like the mouse in the room." 'Decent room'
 'Loved the elegant art deco ambience.' 'Broken tiles in shower.'
 'In the shower it was turned down as cold as it could be and it was still hotter than I would usually have it.'
 'Bed was hard and sheets were rough.'
 'I loved that there was a coffee shop\r\nunderneath, and it closed at 12am every night, so that I can get my late\r\nnight coffee, if I wanted.'
 'Dimly lit.' 'The towels had holes in them.'
 'The hotel  advertised AC as one of its amenities in its booking.com descriptions!'
 'The bed was super comfortable and the bathtub with jets was awesome too.'
 'We actually enjoyed how small our room was because we knew about it ahead of time and wanted to try something really different.'
 'Bedding was changed once,'
 'The same counts for the bathroom, especially the shower was great.'
 'The hotel has a rooftop with a hot tub which has an awesome view.'
 'It was quiet.' 'Rooms are small,'
 'The drapes could have fit the window.' 'It was very not comfortable.'
 'Facilities were great.'
 'Just one bed so in your room you always have to sit on the bed.'
 'So there could have been more convenient outlets.'
 'The exercise room closest to my floor was filled with big screen TVs and sheets covering them.'
 'a bit tight to make people either pay or leave yr room.'
 'Shower was intense.' 'The bed was very comfortable, and room minimalist'
 "Rooms aren't very sound proof from hallway sounds." 'Cool design.'
 'Our room was very spacious, especially for NYC.'
 'The main lobby is nice' 'Carpets are coming unstitched.'
 'It was bigger so that was nice!'
 'Room was not cleaned even once during our stay.'
 'Room not what we ordered or prepaid for.'
 'The room needed to be painted,'
 'Sep 5-8 room 302\r\nRoom was too small, the shower pressure was good.'
 'Very large for Manhattan with good sized bed, flat screen TV & ample closet.'
 '-There was no heating or air conditioning for the bathroom, which left you very cold when walking in or getting out of the shower.'
 'Waited over 40 minutes to use shower.' 'The lifts worked!'
 'Elevators tend to be slow...'
 'Crumbs from snacks were not cleaned up for 3 days!!'
 'the bathroom was a little dirty with hair at the walls'
 'We had requested two queen beds and got a room with 1 queen and 2 twins, we were advised that there were not any other rooms and could put the (2) beds together, which we did and it was no issue as far as sleeping.'
 'Noisy noisy suite on 4th floor.' 'no way to lock the door.']

Tokenizer class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

Arguments:
    num_words: the maximum number of words to keep, based on word frequency. Only the most common `num_words-1` words will
           be kept.
    filters: a string where each element is a character that will be filtered from the texts. The default is all unctuation, plus, tabs and line breaks, minus the `'` character.
    lower: boolean. Whether to convert the texts to lowercase.
    split: str. Separator for word splitting.
    char_level: if True, every character will be treated as a token.
    oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls.
By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.

fit_on_texts(texts): Updates internal vocabulary based on a list of texts. In the case where texts contains lists, we assume each entry of the lists to be a token.

In [35]:
max_fatures = 2000
tokenizer = Tokenizer(num_words = max_fatures, split = ' ')
tokenizer.fit_on_texts(df['Text'].values)

texts_to_sequences(texts): Transforms each text in texts to a sequence of integers.

In [36]:
X = tokenizer.texts_to_sequences(df['Text'].values)
X = pad_sequences(X)
In [37]:
X
Out[37]:
array([[  0,   0,   0, ...,  29,   5,   9],
       [  0,   0,   0, ...,  84,  10,  85],
       [  0,   0,   0, ..., 108,   7,  30],
       ...,
       [  0,   0,   0, ..., 239,  34, 215],
       [  0,   0,   0, ...,  33, 671,  69],
       [  0,   0,   0, ..., 223,   1,  59]], dtype=int32)
In [38]:
X.ndim
Out[38]:
2
In [39]:
X.shape
Out[39]:
(207, 50)

Convert sentiment column to numeric

pandas.get_dummies(data, prefix=None, prefixsep='', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None): Convert categorical variable into dummy/indicator variables.

In [40]:
y = pd.get_dummies(df['Sentiment']).values
y[:10]
Out[40]:
array([[1, 0],
       [1, 0],
       [0, 1],
       [1, 0],
       [1, 0],
       [0, 1],
       [1, 0],
       [0, 1],
       [0, 1],
       [1, 0]], dtype=uint8)

Split data into train and test

In [41]:
from sklearn.model_selection import train_test_split
In [42]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 50)
In [43]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(144, 50)
(144, 2)
(63, 50)
(63, 2)

Create model

In [44]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Dropout

Keras offers an Embedding layer that can be used for neural networks on text data. So that each word is represented by a unique integer.

When we create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).


We must specify three arguments:

    input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
    output_dim: Integer. Dimension of the dense embedding.
    input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

In [45]:
model = Sequential()
In [46]:
output_dim = 128
model.add(Embedding(max_fatures, output_dim, input_length = X.shape[1]))
model.add(Dropout(0.4))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))

Compile model and print summary

In [47]:
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 50, 128)           256000    
_________________________________________________________________
dropout (Dropout)            (None, 50, 128)           0         
_________________________________________________________________
lstm (LSTM)                  (None, 64)                49408     
_________________________________________________________________
dense (Dense)                (None, 2)                 130       
=================================================================
Total params: 305,538
Trainable params: 305,538
Non-trainable params: 0
_________________________________________________________________
None

Train model

By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch.

verbose=0 will show you nothing (silent)
verbose=1 will show you an animated progress bar
verbose=2 will just mention the number of epoch like this:
    Epoch 1/10
    - 0s - loss: 0.6853 - accuracy: 0.5764
In [48]:
model.fit(X_train, y_train, epochs = 20, batch_size = 25, verbose = 2)
Epoch 1/20
6/6 - 0s - loss: 0.6792 - accuracy: 0.6389
Epoch 2/20
6/6 - 0s - loss: 0.6436 - accuracy: 0.6875
Epoch 3/20
6/6 - 0s - loss: 0.6249 - accuracy: 0.6875
Epoch 4/20
6/6 - 0s - loss: 0.5971 - accuracy: 0.6875
Epoch 5/20
6/6 - 0s - loss: 0.5722 - accuracy: 0.6875
Epoch 6/20
6/6 - 0s - loss: 0.5389 - accuracy: 0.6875
Epoch 7/20
6/6 - 0s - loss: 0.4840 - accuracy: 0.6944
Epoch 8/20
6/6 - 0s - loss: 0.4148 - accuracy: 0.7639
Epoch 9/20
6/6 - 0s - loss: 0.3565 - accuracy: 0.8819
Epoch 10/20
6/6 - 0s - loss: 0.2729 - accuracy: 0.9444
Epoch 11/20
6/6 - 0s - loss: 0.2177 - accuracy: 0.9583
Epoch 12/20
6/6 - 0s - loss: 0.1620 - accuracy: 0.9722
Epoch 13/20
6/6 - 0s - loss: 0.1133 - accuracy: 0.9931
Epoch 14/20
6/6 - 0s - loss: 0.0802 - accuracy: 0.9792
Epoch 15/20
6/6 - 0s - loss: 0.0701 - accuracy: 0.9861
Epoch 16/20
6/6 - 0s - loss: 0.0522 - accuracy: 1.0000
Epoch 17/20
6/6 - 0s - loss: 0.0367 - accuracy: 1.0000
Epoch 18/20
6/6 - 0s - loss: 0.0263 - accuracy: 1.0000
Epoch 19/20
6/6 - 0s - loss: 0.0201 - accuracy: 1.0000
Epoch 20/20
6/6 - 0s - loss: 0.0108 - accuracy: 1.0000
Out[48]:
<tensorflow.python.keras.callbacks.History at 0x7f35e426f290>

The accuracy of the trained model

In [49]:
# evaluate the model
loss, accuracy = model.evaluate(X_train, y_train)
print('Loss: %f' % (loss*100))
print('Accuracy: %f' % (accuracy*100))
5/5 [==============================] - 0s 3ms/step - loss: 0.0082 - accuracy: 1.0000
Loss: 0.822126
Accuracy: 100.000000

Predict sentiment with new reviews

Create list of new reviews

In [50]:
reviews = ['Excellent service', 'Very uncomfortable thin mattress very bad', 'worst service']
reviews
Out[50]:
['Excellent service',
 'Very uncomfortable thin mattress very bad',
 'worst service']

Change text to numeric sequence

In [51]:
reviews = tokenizer.texts_to_sequences(reviews)
reviews = pad_sequences(reviews, maxlen=50, dtype='int32', value=0)
print(reviews)
[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0 176 481]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   7 109 251 108   7]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0 481]]

Predict

In [52]:
sentiments = model.predict(reviews, batch_size=10, verbose = 1)
sentiments
1/1 [==============================] - 0s 851us/step
Out[52]:
array([[0.37920734, 0.6207926 ],
       [0.17609848, 0.82390153],
       [0.5980496 , 0.4019504 ]], dtype=float32)

Print sentiment text

In [53]:
import numpy as np

Each sentiment is classified as positive "1" or negative "0".

In [54]:
def printSentiment(sentiment):
    if(np.argmax(sentiment) == 0):
        return "negative"
    elif (np.argmax(sentiment) == 1):
        return "positive"   

    pass
In [55]:
[printSentiment(sentiment) for sentiment in sentiments]
Out[55]:
['positive', 'positive', 'negative']
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing