Breifly Exploring the Reviews from Yelp Dataset Yelp icon

Han Zhang

1. Business Understanding

The Yelp dataset is a subset of its businesses, reviews, and user data for use in personal, educational, and academic purposes.

The whole dataset includes 6 files: business.json, review.json, photos.json, checkin.json, user.json and tip.json. I will go on and just make use of the first two files to extract the review text, rating stars categories and the corresponding business IDs within this exploration. I will try add more features in, such as the number of usefullness of a review, which would be more interesting.

Based on these data, I am curious about what words people would use when leaving a 5-star review for restaurants, such as an Asian fusion restaurant. And also I would like to find out what specific aspects people were not satisfied with from the most frequently used words when they left 1-star reviews.

Once I start my modeling work later, detecting if a review is fake or predicting the possibility of receiving a 5-star rating in the near future would be my options to perform.

Reference

Yelp Challenge Dataset: https://www.yelp.com/dataset/challenge

2. Data overview

2.1 yelp_academic_dataset_review.json

This file contains full review text data including the user_id that wrote the review and the business_id the review is written for.

We will use _'text', 'businessid' and 'stars' in this analysis.

{

string, 22 character unique review id

"review_id": "zdSx_SD6obEhz9VrW9uAWA",

string, 22 character unique user id, maps to the user in user.json

"user_id": "Ha3iJu77CxlrFm-vQRs_8g",

string, 22 character business id, maps to business in business.json

"business_id": "tnhfDv5Il8EaGSXZGiuQGg",

integer, star rating

"stars": 4,

string, date formatted YYYY-MM-DD

"date": "2016-03-09",

string, the review itself

"text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

integer, number of useful votes received

"useful": 0,

integer, number of funny votes received

"funny": 0,

integer, number of cool votes received

"cool": 0 }

2.2 yelp_academic_dataset_businesses.json

Contains business data including location data, attributes, and categories.

In this analysis, we will use the 'categories' and 'business_id'.

{

string, 22 character unique string business id

"business_id": "tnhfDv5Il8EaGSXZGiuQGg",

string, the business's name

"name": "Garaje",

string, the neighborhood's name

"neighborhood": "SoMa",

string, the full address of the business

"address": "475 3rd St",

string, the city

"city": "San Francisco",

......
......
......

an array of strings of business categories

"categories": [ "Mexican", "Burgers", "Gastropubs" ],

an object of key day to value hours, hours are using a 24hr clock

"hours": { "Monday": "10:00-21:00", "Tuesday": "10:00-21:00", "Friday": "10:00-21:00", "Wednesday": "10:00-21:00", "Thursday": "10:00-21:00", "Sunday": "11:00-18:00", "Saturday": "10:00-21:00" } }

3. Data Encoding

3.1 conver JSON to CSV

  • Extract attributes, 'categories' and 'business_id', from _yelp_academic_datasetbuinesses.json.
In [1]:
import csv
import json
import sys

# extract reviews, rating stars and business id
# open for writting
outfile_reviews = open("review_with_id.csv",'w',newline='')
sfile_reviews = csv.writer(outfile_reviews, delimiter =",", quoting=csv.QUOTE_MINIMAL)
sfile_reviews.writerow(['business_id','stars', 'text'])

with open('yelp_academic_dataset_review.json') as reviews:
    for line in reviews:
        row = json.loads(line)
        # some special char must be encoded in 'utf-8'
        sfile_reviews.writerow([row['business_id'],row['stars'], (row['text']).encode('utf-8')])
outfile_reviews.close()
  • Extract attributes, 'business_id', 'stars' and 'text', from _yelp_academic_datasetreviews.json.
In [2]:
# extract business id and coresponding categories
# open for writting
outfile_categories = open("category_with_id.csv",'w',newline='')
sfile_categories = csv.writer(outfile_categories, delimiter =",", quoting=csv.QUOTE_MINIMAL)
sfile_categories.writerow(['business_id','categories'])
with open('yelp_academic_dataset_business.json') as businesses:
    for line in businesses:
        row = json.loads(line)
        # some special char must be encoded in 'utf-8'
        sfile_categories.writerow([row['business_id'],(row['categories'])])
outfile_categories.close()

Let's take a look at what the data looks like.

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
import warnings

df_review_with_id = pd.read_csv('review_with_id.csv') 
df_review_with_id.head()
Out[3]:
business_id stars text
0 uYHaNptLzDLoV_JZ_MuzUA 5 b'My girlfriend and I stayed here for 3 nights...
1 uYHaNptLzDLoV_JZ_MuzUA 3 b"If you need an inexpensive place to stay for...
2 uYHaNptLzDLoV_JZ_MuzUA 3 b"Mittlerweile gibt es in Edinburgh zwei Ableg...
3 uYHaNptLzDLoV_JZ_MuzUA 4 b"Location is everything and this hotel has it...
4 uYHaNptLzDLoV_JZ_MuzUA 5 b'gute lage im stadtzentrum. shoppingmeile und...
In [4]:
df_category_with_id = pd.read_csv('category_with_id.csv') 
df_category_with_id.head()
Out[4]:
business_id categories
0 YDf95gJZaq05wvo7hTQbbQ ['Shopping', 'Shopping Centers']
1 mLwM-h2YhXl2NCgdS84_Bw ['Food', 'Soul Food', 'Convenience Stores', 'R...
2 v2WhjAB3PIBA8J8VxG3wEg ['Food', 'Coffee & Tea']
3 CVtCbSB1zUcUWg-9TNGTuQ ['Professional Services', 'Matchmakers']
4 duHFBe87uNSXImQmvBh87Q ['Sandwiches', 'Restaurants']
In [5]:
df_category_with_id.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156639 entries, 0 to 156638
Data columns (total 2 columns):
business_id    156639 non-null object
categories     156639 non-null object
dtypes: object(2)
memory usage: 2.4+ MB
In [6]:
df_review_with_id.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4736897 entries, 0 to 4736896
Data columns (total 3 columns):
business_id    object
stars          int64
text           object
dtypes: int64(1), object(2)
memory usage: 108.4+ MB

Great! That's what we want.

Now we need to Merge the two CSV files according to 'business_id'.

In [7]:
df = pd.merge(df_review_with_id,df_category_with_id, on="business_id", how="outer")
In [8]:
df.head()
Out[8]:
business_id stars text categories
0 uYHaNptLzDLoV_JZ_MuzUA 5.0 b'My girlfriend and I stayed here for 3 nights... ['Hotels', 'Hotels & Travel', 'Event Planning ...
1 uYHaNptLzDLoV_JZ_MuzUA 3.0 b"If you need an inexpensive place to stay for... ['Hotels', 'Hotels & Travel', 'Event Planning ...
2 uYHaNptLzDLoV_JZ_MuzUA 3.0 b"Mittlerweile gibt es in Edinburgh zwei Ableg... ['Hotels', 'Hotels & Travel', 'Event Planning ...
3 uYHaNptLzDLoV_JZ_MuzUA 4.0 b"Location is everything and this hotel has it... ['Hotels', 'Hotels & Travel', 'Event Planning ...
4 uYHaNptLzDLoV_JZ_MuzUA 5.0 b'gute lage im stadtzentrum. shoppingmeile und... ['Hotels', 'Hotels & Travel', 'Event Planning ...
In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4736898 entries, 0 to 4736897
Data columns (total 4 columns):
business_id    object
stars          float64
text           object
categories     object
dtypes: float64(1), object(3)
memory usage: 180.7+ MB
In [10]:
pd.isnull(df['stars'])  == True
Out[10]:
0          False
1          False
2          False
3          False
4          False
5          False
6          False
7          False
8          False
9          False
10         False
11         False
12         False
13         False
14         False
15         False
16         False
17         False
18         False
19         False
20         False
21         False
22         False
23         False
24         False
25         False
26         False
27         False
28         False
29         False
           ...  
4736868    False
4736869    False
4736870    False
4736871    False
4736872    False
4736873    False
4736874    False
4736875    False
4736876    False
4736877    False
4736878    False
4736879    False
4736880    False
4736881    False
4736882    False
4736883    False
4736884    False
4736885    False
4736886    False
4736887    False
4736888    False
4736889    False
4736890    False
4736891    False
4736892    False
4736893    False
4736894    False
4736895    False
4736896    False
4736897     True
Name: stars, Length: 4736898, dtype: bool

There is a large amount of data. We will just make use of the first 500000 of them.

In [11]:
df = df[0:499999]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 499999 entries, 0 to 499998
Data columns (total 4 columns):
business_id    499999 non-null object
stars          499999 non-null float64
text           499999 non-null object
categories     499999 non-null object
dtypes: float64(1), object(3)
memory usage: 19.1+ MB

3.2 Verify data quality

Remove anything except words which we only care about.

In [12]:
from nltk import word_tokenize
from nltk.corpus import stopwords
#nltk.download()
stop = set(stopwords.words('english'))
df['text_without_stopwords'] =df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# [i for i in word_tokenize(sentence.lower()) if i not in stop]
In [13]:
print(df['text_without_stopwords'][100])
b"Have even received food yet already bad experience. The soda machine completely diet drinks. So I went front let know pointed someone else said tell her. Because job? Then I'm trying tell person I treated rude snapped me. Very poor customer service."
In [14]:
from sklearn.feature_extraction.text import CountVectorizer
n_features = 1000
count_vect = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') 
summary_text = df['text_without_stopwords']
# Learn the vocabulary dictionary and return term-document matrix.
bag_words = count_vect.fit_transform(summary_text)
In [15]:
print(bag_words.shape) # this is a sparse matrix
print('=========')
print(bag_words[0])
(499999, 1000)
=========
  (0, 822)	1
  (0, 212)	1
  (0, 707)	1
  (0, 329)	1
  (0, 697)	1
  (0, 387)	1
  (0, 552)	1
  (0, 322)	1
  (0, 295)	1
  (0, 338)	1
  (0, 483)	1
  (0, 347)	1
  (0, 402)	2
  (0, 767)	1
  (0, 718)	3
  (0, 276)	1
  (0, 719)	1
  (0, 164)	1
  (0, 790)	1
  (0, 812)	2
  (0, 59)	1
  (0, 665)	1
  (0, 553)	1
  (0, 571)	1
  (0, 413)	2
  (0, 592)	2
  (0, 412)	1
  (0, 177)	1
  (0, 153)	1
  (0, 522)	1
  (0, 714)	2
  (0, 829)	1
  (0, 240)	1
  (0, 936)	2
  (0, 207)	2
  (0, 25)	1
  (0, 495)	1
  (0, 664)	1
  (0, 208)	1
  (0, 398)	3
  (0, 474)	2
  (0, 488)	1
  (0, 558)	1
  (0, 821)	1
In [16]:
print(len(count_vect.vocabulary_))
print(count_vect.vocabulary_)
1000
{'stayed': 821, 'nights': 558, 'loved': 488, 'location': 474, 'hotel': 398, 'decent': 208, 'price': 664, 'makes': 495, 'amazing': 25, 'deal': 207, 'walk': 936, 'door': 240, 'street': 829, 'right': 714, 'minute': 522, 'close': 153, 'corner': 177, 'including': 412, 'opened': 592, 'incredible': 413, 'nthe': 571, 'nice': 553, 'priced': 665, 'bar': 59, 'staff': 812, 'small': 790, 'comfortable': 164, 'rooms': 719, 'excellent': 276, 'room': 718, 'service': 767, 'huge': 402, 'giving': 347, 'lost': 483, 'gave': 338, 'fancy': 295, 'free': 322, 'ni': 552, 'highly': 387, 'recommend': 697, 'friends': 329, 'return': 707, 'definitely': 212, 'staying': 822, 'need': 545, 'place': 638, 'stay': 820, 'night': 557, 'longer': 478, 'better': 75, 'station': 819, 'old': 586, 'town': 888, 'new': 550, 'seeing': 756, 'food': 317, 'shopping': 774, 'walking': 938, 'clean': 149, 'good': 353, 'bed': 69, 'phone': 630, 'given': 346, 'lot': 484, 'attitude': 45, 'husband': 404, 'quite': 684, 'charged': 129, '15': 5, 'double': 241, 'sure': 840, 'felt': 302, 'like': 460, 'money': 530, 'grab': 356, 'kind': 434, 'nif': 556, 'book': 83, 'desk': 217, 'bathroom': 64, 'isn': 420, 'awful': 53, 'know': 438, 'getting': 343, 'es': 269, 'der': 216, 'ist': 423, 'das': 202, 'die': 223, 'wir': 967, 'xc3': 989, 'im': 408, 'und': 906, 'man': 498, 'mit': 525, 'zu': 999, 'ich': 406, 'war': 942, 'auch': 46, 'nicht': 555, 'ein': 258, 'sehr': 758, '50': 11, 'person': 626, 'yelp': 994, 'open': 591, 'hours': 400, 'helpful': 385, 'questions': 680, 'type': 904, 'breakfast': 92, 'style': 834, 'want': 940, 'look': 479, 'don': 239, 'far': 297, 'easy': 251, 'airport': 24, 'turn': 900, 'view': 927, 'got': 354, 'bit': 78, 'just': 431, 'quiet': 683, 'plus': 648, 'floor': 316, 'pricey': 667, 'best': 74, 'personal': 627, 'super': 838, 'beautiful': 68, 'space': 802, 'great': 358, 'rock': 715, 'tiny': 875, 'expected': 280, 'barely': 60, 'review': 709, 'fairly': 292, 'think': 870, 'company': 166, 'won': 970, 'break': 91, 'really': 692, 'near': 544, 'wanted': 941, 'day': 205, 'time': 873, 'group': 363, 'pretty': 662, 'big': 76, 'weekend': 954, 'middle': 517, 'main': 493, 'actually': 15, 'didn': 222, 'use': 911, 'hard': 375, 'prefer': 660, 'soft': 794, 'come': 162, 'paid': 609, 'say': 743, 'value': 917, 'located': 473, 'chips': 141, 'drinks': 246, 'xc2': 988, 'people': 622, 'friend': 327, 'hair': 369, 'wash': 944, 'cold': 159, 'easily': 250, 'cooked': 174, 'different': 224, 'idea': 407, 'center': 121, 'city': 147, 'modern': 528, 'stuff': 832, 'true': 894, 'perfectly': 625, 'ok': 584, 'nservice': 569, 'friendly': 328, 'available': 48, 'love': 487, 'chain': 123, 'fast': 298, 'attention': 43, 'market': 501, 'venue': 925, 'ahead': 22, 'completely': 170, 'reservation': 703, 'process': 671, 'credit': 189, 'card': 116, 'maybe': 504, 'minutes': 523, 'away': 51, 'warm': 943, 'check': 133, 'area': 33, 'serve': 763, 'buffet': 99, 'morning': 533, 'nmy': 560, 'dark': 201, 'single': 784, 'tea': 859, 'coffee': 158, 'making': 496, 'tv': 902, 'limited': 462, 'fine': 306, 'days': 206, 'hands': 372, 'used': 912, 'general': 340, 'called': 113, 'black': 80, 'worth': 980, 'standard': 814, 'ago': 21, 'saw': 742, 'knew': 437, 'weeks': 955, 'pay': 620, 'normal': 563, 'compared': 167, 'welcoming': 957, 'professional': 674, 'nwe': 576, 'received': 695, 'large': 447, 'despite': 218, 'busy': 107, 'thanks': 867, 'solid': 795, 'works': 976, 'doesn': 235, 'places': 639, 'hit': 388, 'fresh': 324, 'cheese': 136, 'toast': 877, 'fruit': 332, 'quality': 679, 'looking': 481, 'reasonable': 694, 'choice': 143, 'short': 776, 'half': 370, 'hour': 399, 'home': 390, 'try': 897, 'slightly': 788, 'non': 561, 'cool': 175, 'tables': 846, 'feel': 300, 'sat': 735, 'served': 764, 'sort': 799, 'local': 472, 'hand': 371, 'needed': 546, 'eat': 252, 'came': 114, 'paying': 621, 'help': 383, 'going': 351, 'website': 951, 'finish': 307, 'second': 754, 'cup': 194, 'hot': 397, 'sorry': 798, 'decor': 210, 'especially': 270, 'horrible': 395, 'excited': 277, 'birthday': 77, 'went': 958, 'ordered': 598, 'meal': 505, 'way': 950, 'things': 869, 'share': 771, 'asian': 37, 'chicken': 138, 'salad': 725, 'impressed': 410, 'veggies': 924, 'overall': 604, 'steak': 823, 'fried': 326, 'rice': 713, 'decided': 209, 'nthey': 573, 'understand': 907, 'menu': 514, 'order': 597, 'online': 590, 'picked': 632, 'ready': 690, 'smell': 792, 'car': 115, 'wait': 931, 'delicious': 213, 'spicy': 809, 'line': 463, 'store': 826, 'sauce': 738, 'saying': 744, 've': 919, 'tasted': 855, 'italian': 424, 'said': 724, 'offered': 580, 'chinese': 140, 'restaurants': 706, 'portions': 656, 'prices': 666, 'considering': 171, 'cheaper': 132, 'overpriced': 605, 'forgot': 320, 'tomato': 880, 'dressing': 244, 'wife': 962, 'immediately': 409, 'restaurant': 705, 'manager': 500, '30': 8, 'asked': 39, 'told': 879, 'counter': 180, 'bag': 57, 'set': 770, 'explained': 284, 'lettuce': 456, 'note': 566, 'taken': 849, 'plate': 641, 'noverall': 568, 'customer': 196, 'option': 594, 'surprise': 841, 'spot': 810, 'times': 874, 'salty': 732, 'difficult': 225, 'items': 426, 'add': 16, 'quickly': 682, 'expecting': 281, 'sitting': 786, 'heat': 382, 'worked': 974, 'seating': 752, 'glad': 348, 'wonderful': 971, 'quick': 681, 'reviews': 710, 'flavor': 313, 'tried': 892, 'variety': 918, 'taste': 854, 'beef': 70, 'pieces': 636, 'cut': 198, 'shrimp': 779, 'ate': 41, 'meat': 509, 'dish': 232, 'pork': 654, 'egg': 256, 'roll': 716, 'sour': 801, 'soup': 800, 'chili': 139, 'instead': 418, 'worst': 979, 'enjoyed': 264, 'dishes': 233, 'plates': 642, 'enjoy': 263, 'typical': 905, 'tastes': 856, 'waste': 946, 'garlic': 337, 'sweet': 844, 'write': 983, 'based': 62, 'issue': 421, '25': 7, 'employee': 259, 'past': 616, 'years': 993, 'drink': 245, 'simple': 782, 'appetizer': 29, 'expect': 278, 'customers': 197, 'experience': 283, 'care': 117, 'today': 878, 'thai': 865, 'curry': 195, 'favorite': 299, 'lunch': 491, 'closed': 154, 'daughter': 204, 'fan': 294, 'noodles': 562, 'coming': 165, 'orange': 596, 'high': 386, 'expectations': 279, 'comes': 163, 'sad': 723, 'bad': 56, 'make': 494, 'wish': 968, 'stars': 816, 'special': 804, 'lack': 443, 'entrees': 267, 'crab': 184, 'filling': 304, 'bland': 81, 'plenty': 647, 'spend': 806, 'save': 741, 'disappointed': 230, 'waiting': 934, 'pick': 631, 'twice': 903, 'wrong': 984, 'fix': 311, 'visit': 928, 'salt': 731, 'vegetarian': 922, 'options': 595, 'bowl': 86, 'looks': 482, 'texture': 864, 'business': 106, 'average': 49, 'kinda': 435, 'offer': 579, 'affordable': 19, 'nthis': 574, 'servers': 766, 'mind': 520, 'leaving': 453, 'foods': 318, 'environment': 268, 'usually': 915, 'entree': 266, 'convenient': 172, 'able': 13, 'drive': 247, 'little': 466, 'house': 401, 'certainly': 122, 'orders': 600, 'guess': 365, 'unfortunately': 908, 'couple': 181, 'tonight': 881, 'bite': 79, 'course': 182, 'started': 818, 'honestly': 393, 'exactly': 275, 'haven': 377, 'establishment': 271, 'fabulous': 288, 'nas': 543, 'bucks': 98, 'expensive': 282, 'patio': 619, 'weird': 956, 'usual': 914, 'family': 293, '20': 6, 'sit': 785, 'oh': 582, 'running': 722, 'okay': 585, 'moved': 535, 'meals': 506, 'couldn': 179, 'wouldn': 981, 'greasy': 357, 'ended': 262, 'eating': 254, 'brought': 96, 'atmosphere': 42, 'spring': 811, 'rolls': 717, 'long': 477, 'korean': 441, 'authentic': 47, 'american': 27, 'leave': 452, 'unless': 910, 'aren': 34, 'left': 454, 'nall': 542, 'soon': 797, 'took': 882, 'charge': 128, 'run': 721, 'ingredients': 416, 'regular': 700, 'spent': 807, 'white': 961, 'saturday': 737, 'extra': 285, 'change': 126, 'needs': 547, 'total': 885, 'real': 691, 'liked': 461, 'let': 455, 'mix': 526, 'item': 425, '99': 12, 'red': 699, 'onions': 589, 'flavors': 315, 'fantastic': 296, 'tasty': 858, 'straight': 828, 'wasn': 945, 'thing': 868, 'filled': 303, 'll': 470, 'probably': 668, 'eaten': 253, 'beat': 67, 'point': 650, 'pleased': 646, 'surprised': 842, 'inside': 417, 'looked': 480, 'japanese': 427, 'fact': 290, 'employees': 260, 'ordering': 599, 'poor': 653, 'management': 499, 'problem': 669, 'buy': 109, 'dinner': 227, 'apparently': 28, 'spice': 808, 'crazy': 186, 'table': 845, 'baked': 58, 'yum': 997, 'treated': 891, 'taco': 847, 'gone': 352, 'multiple': 537, 'forward': 321, 'smaller': 791, 'server': 765, 'pleasant': 645, 'serving': 769, 'generous': 341, 'forget': 319, 'perfect': 624, 'month': 531, 'dining': 226, 'tell': 861, 'girl': 344, 'happened': 373, 'girls': 345, 'lived': 468, 'heard': 381, 'remember': 701, 'number': 575, 'ask': 38, 'touch': 887, 'beer': 71, 'week': 953, 'son': 796, 'clear': 151, 'sent': 761, 'fixed': 312, 'bring': 93, 'later': 450, 'guy': 366, 'says': 745, 'terrible': 863, 'thank': 866, 'appreciate': 32, 'stop': 824, 'class': 148, 'healthy': 379, 'purchased': 678, 'evening': 273, 'believe': 73, 'cost': 178, 'awesome': 52, 'mexican': 516, 'yeah': 991, 'finally': 305, 'interesting': 419, 'opinion': 593, 'yes': 995, 'water': 949, '12': 4, 'dry': 248, 'simply': 783, 'seat': 750, 'party': 614, 'dirty': 229, 'stopped': 825, 'head': 378, 'vegan': 920, 'scottsdale': 747, 'phoenix': 629, 'doctor': 234, 'baby': 54, 'end': 261, 'taking': 851, 'year': 992, 'reason': 693, 'knows': 440, 'cook': 173, 'reading': 689, 'joint': 429, 'min': 519, 'noticed': 567, 'weren': 959, 'hope': 394, 'absolutely': 14, '100': 2, 'lady': 445, 'changed': 127, 'rating': 687, 'locations': 475, 'walked': 937, 'kept': 432, 'face': 289, 'visiting': 930, 'live': 467, 'satisfied': 736, 'covered': 183, 'entire': 265, 'watching': 948, 'extremely': 286, 'watch': 947, 'seriously': 762, 'chef': 137, 'supposed': 839, 'finished': 308, 'fish': 309, 'nicely': 554, 'choices': 144, '45': 10, 'work': 973, 'job': 428, 'trying': 898, 'rude': 720, 'happy': 374, 'hold': 389, 'low': 490, 'working': 975, 'polite': 651, 'tuna': 899, 'added': 17, 'goes': 350, 'traditional': 889, 'gets': 342, 'addition': 18, 'cash': 119, 'friday': 325, 'loud': 486, 'case': 118, 'kitchen': 436, 'star': 815, 'life': 458, 'medium': 511, 'fun': 333, 'building': 100, 'suggest': 835, 'yummy': 998, 'packed': 608, 'seated': 751, 'salsa': 730, 'sunday': 837, 'arrived': 35, 'nit': 559, 'previous': 663, 'waiter': 933, 'playing': 644, 'takes': 850, 'la': 442, 'size': 787, 'mean': 507, 'piece': 635, 'burrito': 105, 'onion': 588, 'seafood': 748, 'green': 359, 'checked': 134, 'prepared': 661, 'flavorful': 314, 'mixed': 527, 'kids': 433, 'portion': 655, 'greeted': 360, 'slow': 789, 'lobster': 471, 'corn': 176, 'tip': 876, 'living': 469, 'dollars': 238, 'disappointing': 231, 'unique': 909, 'included': 411, 'means': 508, 'cute': 199, 'beans': 66, 'color': 160, 'ones': 587, 'specials': 805, 'seasoned': 749, 'using': 913, 'trip': 893, 'outside': 602, 'waitress': 935, 'combo': 161, 'miss': 524, 'strip': 830, 'mall': 497, 'vibe': 926, 'talking': 853, 'asking': 40, 'summer': 836, 'sign': 781, 'eye': 287, 'tacos': 848, 'totally': 886, 'sauces': 739, 'homemade': 391, 'mouth': 534, '10': 1, 'feeling': 301, 'west': 960, 'boyfriend': 88, 'dessert': 219, 'cream': 187, 'eggs': 257, 'dip': 228, 'stuffed': 833, 'outstanding': 603, 'world': 977, 'thought': 872, 'attentive': 44, 'chairs': 124, 'shared': 772, 'issues': 422, 'rare': 686, 'lovely': 489, 'pm': 649, 'nfor': 551, 'sandwiches': 734, 'salads': 726, 'sandwich': 733, 'cheap': 131, 'sides': 780, 'fair': 291, 'potatoes': 659, 'oil': 583, 'tasting': 857, 'cocktail': 156, 'list': 464, 'speak': 803, 'french': 323, 'selection': 759, 'wine': 965, 'owner': 606, 'bottle': 84, 'rest': 704, 'brunch': 97, 'lamb': 446, 'bacon': 55, 'deep': 211, 'bunch': 102, 'appetizers': 30, 'cocktails': 157, 'chocolate': 142, 'chance': 125, 'wedding': 952, 'recently': 696, 'incredibly': 414, 'ribs': 712, 'lots': 485, 'choose': 145, 'cake': 112, 'turned': 901, 'mediocre': 510, 'young': 996, 'wow': 982, 'literally': 465, 'gem': 339, 'bartender': 61, 'early': 249, 'start': 817, 'creamy': 188, 'tender': 862, 'bread': 90, 'potato': 658, 'na': 539, 'recommended': 698, 'fries': 330, 'afternoon': 20, 'window': 964, 'knowledgeable': 439, 'butter': 108, 'waited': 932, 'glass': 349, 'pepper': 623, 'light': 459, 'strong': 831, '40': 9, 'mins': 521, 'toronto': 884, 'blue': 82, 'date': 203, '11': 3, 'le': 451, 'parking': 613, 'pancakes': 611, 'sausage': 740, 'crispy': 190, 'pain': 610, 'crowd': 191, 'gotten': 355, 'pasta': 617, 'desserts': 220, 'purchase': 677, 'sell': 760, 'stand': 813, 'hungry': 403, 'section': 755, 'problems': 670, 'matter': 503, 'shot': 777, 'xa9': 987, 'moving': 536, 'hear': 380, 'complete': 169, 'groupon': 364, 'hate': 376, 'ambiance': 26, 'late': 449, 'notch': 565, 'word': 972, 'trust': 896, 'ice': 405, 'treat': 890, 'chose': 146, 'veggie': 923, 'did': 221, 'crowded': 192, 'obviously': 578, 'ramen': 685, 'xa0': 986, 'bought': 85, 'music': 538, 'woman': 969, 'read': 688, 'event': 274, 'efficient': 255, 'basically': 63, 'worse': 978, 'nthere': 572, 'met': 515, 'salmon': 728, 'mentioned': 513, 'willing': 963, 'truly': 895, 'guys': 367, 'air': 23, 'clearly': 152, 'showed': 778, 'school': 746, 'stores': 827, 'patient': 618, 'plan': 640, 'broth': 95, 'juicy': 430, 'provided': 675, 'downtown': 242, 'normally': 564, 'hostess': 396, 'pizza': 637, 'office': 581, 'months': 532, 'seen': 757, 'pulled': 676, 'talk': 852, 'honest': 392, 'negative': 548, 'gym': 368, 'burger': 103, 'wall': 939, 'shops': 775, 'shop': 773, 'vegas': 921, 'pictures': 633, 'las': 448, 'smile': 793, 'products': 673, 'casino': 120, 'sales': 727, 'thinking': 871, 'art': 36, 'dr': 243, 'mention': 512, 'buying': 110, 'nso': 570, 'valley': 916, 'level': 457, 'ladies': 444, 'game': 335, 'burgers': 104, 'crust': 193, 'bun': 101, 'toppings': 883, 'mac': 492, 'returning': 708, 'future': 334, 'grilled': 362, '00': 0, 'play': 643, 'games': 336, 'cafe': 111, 'checking': 135, 'nwhen': 577, 'milk': 518, 'sushi': 843, 'owners': 607, 'pass': 615, 'neighborhood': 549, 'delivery': 215, 'complaint': 168, 'craving': 185, 'seats': 753, 'mom': 529, 'possible': 657, 'visited': 929, 'product': 672, 'salon': 729, 'wings': 966, 'delivered': 214, 'avoid': 50, 'dance': 200, 'massage': 502, 'park': 612, 'frozen': 331, 'club': 155, 'grill': 361, 'box': 87, 'dogs': 237, 'charlotte': 130, 'bbq': 65, 'original': 601, 'indian': 415, 'nail': 540, 'team': 860, 'lol': 476, 'helped': 384, 'nails': 541, 'appointment': 31, 'brand': 89, 'repair': 702, 'beers': 72, 'fit': 310, 'dog': 236, 'services': 768, 'rib': 711, 'cleaning': 150, 'pie': 634, 'pool': 652, 'xe3': 990, 'brisket': 94, 'x81': 985, 'et': 272, 'pho': 628}
In [17]:
## Convert the data into a sparse encoded tf-idf representation.
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,max_features=n_features,stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(summary_text)
In [18]:
print(tfidf.shape) # this is a sparse matrix
print('=========')
print(bag_words[0])
(499999, 1000)
=========
  (0, 822)	1
  (0, 212)	1
  (0, 707)	1
  (0, 329)	1
  (0, 697)	1
  (0, 387)	1
  (0, 552)	1
  (0, 322)	1
  (0, 295)	1
  (0, 338)	1
  (0, 483)	1
  (0, 347)	1
  (0, 402)	2
  (0, 767)	1
  (0, 718)	3
  (0, 276)	1
  (0, 719)	1
  (0, 164)	1
  (0, 790)	1
  (0, 812)	2
  (0, 59)	1
  (0, 665)	1
  (0, 553)	1
  (0, 571)	1
  (0, 413)	2
  (0, 592)	2
  (0, 412)	1
  (0, 177)	1
  (0, 153)	1
  (0, 522)	1
  (0, 714)	2
  (0, 829)	1
  (0, 240)	1
  (0, 936)	2
  (0, 207)	2
  (0, 25)	1
  (0, 495)	1
  (0, 664)	1
  (0, 208)	1
  (0, 398)	3
  (0, 474)	2
  (0, 488)	1
  (0, 558)	1
  (0, 821)	1
In [19]:
print(len(tfidf_vectorizer.vocabulary_))
print(tfidf_vectorizer.vocabulary_)
1000
{'stayed': 821, 'nights': 558, 'loved': 488, 'location': 474, 'hotel': 398, 'decent': 208, 'price': 664, 'makes': 495, 'amazing': 25, 'deal': 207, 'walk': 936, 'door': 240, 'street': 829, 'right': 714, 'minute': 522, 'close': 153, 'corner': 177, 'including': 412, 'opened': 592, 'incredible': 413, 'nthe': 571, 'nice': 553, 'priced': 665, 'bar': 59, 'staff': 812, 'small': 790, 'comfortable': 164, 'rooms': 719, 'excellent': 276, 'room': 718, 'service': 767, 'huge': 402, 'giving': 347, 'lost': 483, 'gave': 338, 'fancy': 295, 'free': 322, 'ni': 552, 'highly': 387, 'recommend': 697, 'friends': 329, 'return': 707, 'definitely': 212, 'staying': 822, 'need': 545, 'place': 638, 'stay': 820, 'night': 557, 'longer': 478, 'better': 75, 'station': 819, 'old': 586, 'town': 888, 'new': 550, 'seeing': 756, 'food': 317, 'shopping': 774, 'walking': 938, 'clean': 149, 'good': 353, 'bed': 69, 'phone': 630, 'given': 346, 'lot': 484, 'attitude': 45, 'husband': 404, 'quite': 684, 'charged': 129, '15': 5, 'double': 241, 'sure': 840, 'felt': 302, 'like': 460, 'money': 530, 'grab': 356, 'kind': 434, 'nif': 556, 'book': 83, 'desk': 217, 'bathroom': 64, 'isn': 420, 'awful': 53, 'know': 438, 'getting': 343, 'es': 269, 'der': 216, 'ist': 423, 'das': 202, 'die': 223, 'wir': 967, 'xc3': 989, 'im': 408, 'und': 906, 'man': 498, 'mit': 525, 'zu': 999, 'ich': 406, 'war': 942, 'auch': 46, 'nicht': 555, 'ein': 258, 'sehr': 758, '50': 11, 'person': 626, 'yelp': 994, 'open': 591, 'hours': 400, 'helpful': 385, 'questions': 680, 'type': 904, 'breakfast': 92, 'style': 834, 'want': 940, 'look': 479, 'don': 239, 'far': 297, 'easy': 251, 'airport': 24, 'turn': 900, 'view': 927, 'got': 354, 'bit': 78, 'just': 431, 'quiet': 683, 'plus': 648, 'floor': 316, 'pricey': 667, 'best': 74, 'personal': 627, 'super': 838, 'beautiful': 68, 'space': 802, 'great': 358, 'rock': 715, 'tiny': 875, 'expected': 280, 'barely': 60, 'review': 709, 'fairly': 292, 'think': 870, 'company': 166, 'won': 970, 'break': 91, 'really': 692, 'near': 544, 'wanted': 941, 'day': 205, 'time': 873, 'group': 363, 'pretty': 662, 'big': 76, 'weekend': 954, 'middle': 517, 'main': 493, 'actually': 15, 'didn': 222, 'use': 911, 'hard': 375, 'prefer': 660, 'soft': 794, 'come': 162, 'paid': 609, 'say': 743, 'value': 917, 'located': 473, 'chips': 141, 'drinks': 246, 'xc2': 988, 'people': 622, 'friend': 327, 'hair': 369, 'wash': 944, 'cold': 159, 'easily': 250, 'cooked': 174, 'different': 224, 'idea': 407, 'center': 121, 'city': 147, 'modern': 528, 'stuff': 832, 'true': 894, 'perfectly': 625, 'ok': 584, 'nservice': 569, 'friendly': 328, 'available': 48, 'love': 487, 'chain': 123, 'fast': 298, 'attention': 43, 'market': 501, 'venue': 925, 'ahead': 22, 'completely': 170, 'reservation': 703, 'process': 671, 'credit': 189, 'card': 116, 'maybe': 504, 'minutes': 523, 'away': 51, 'warm': 943, 'check': 133, 'area': 33, 'serve': 763, 'buffet': 99, 'morning': 533, 'nmy': 560, 'dark': 201, 'single': 784, 'tea': 859, 'coffee': 158, 'making': 496, 'tv': 902, 'limited': 462, 'fine': 306, 'days': 206, 'hands': 372, 'used': 912, 'general': 340, 'called': 113, 'black': 80, 'worth': 980, 'standard': 814, 'ago': 21, 'saw': 742, 'knew': 437, 'weeks': 955, 'pay': 620, 'normal': 563, 'compared': 167, 'welcoming': 957, 'professional': 674, 'nwe': 576, 'received': 695, 'large': 447, 'despite': 218, 'busy': 107, 'thanks': 867, 'solid': 795, 'works': 976, 'doesn': 235, 'places': 639, 'hit': 388, 'fresh': 324, 'cheese': 136, 'toast': 877, 'fruit': 332, 'quality': 679, 'looking': 481, 'reasonable': 694, 'choice': 143, 'short': 776, 'half': 370, 'hour': 399, 'home': 390, 'try': 897, 'slightly': 788, 'non': 561, 'cool': 175, 'tables': 846, 'feel': 300, 'sat': 735, 'served': 764, 'sort': 799, 'local': 472, 'hand': 371, 'needed': 546, 'eat': 252, 'came': 114, 'paying': 621, 'help': 383, 'going': 351, 'website': 951, 'finish': 307, 'second': 754, 'cup': 194, 'hot': 397, 'sorry': 798, 'decor': 210, 'especially': 270, 'horrible': 395, 'excited': 277, 'birthday': 77, 'went': 958, 'ordered': 598, 'meal': 505, 'way': 950, 'things': 869, 'share': 771, 'asian': 37, 'chicken': 138, 'salad': 725, 'impressed': 410, 'veggies': 924, 'overall': 604, 'steak': 823, 'fried': 326, 'rice': 713, 'decided': 209, 'nthey': 573, 'understand': 907, 'menu': 514, 'order': 597, 'online': 590, 'picked': 632, 'ready': 690, 'smell': 792, 'car': 115, 'wait': 931, 'delicious': 213, 'spicy': 809, 'line': 463, 'store': 826, 'sauce': 738, 'saying': 744, 've': 919, 'tasted': 855, 'italian': 424, 'said': 724, 'offered': 580, 'chinese': 140, 'restaurants': 706, 'portions': 656, 'prices': 666, 'considering': 171, 'cheaper': 132, 'overpriced': 605, 'forgot': 320, 'tomato': 880, 'dressing': 244, 'wife': 962, 'immediately': 409, 'restaurant': 705, 'manager': 500, '30': 8, 'asked': 39, 'told': 879, 'counter': 180, 'bag': 57, 'set': 770, 'explained': 284, 'lettuce': 456, 'note': 566, 'taken': 849, 'plate': 641, 'noverall': 568, 'customer': 196, 'option': 594, 'surprise': 841, 'spot': 810, 'times': 874, 'salty': 732, 'difficult': 225, 'items': 426, 'add': 16, 'quickly': 682, 'expecting': 281, 'sitting': 786, 'heat': 382, 'worked': 974, 'seating': 752, 'glad': 348, 'wonderful': 971, 'quick': 681, 'reviews': 710, 'flavor': 313, 'tried': 892, 'variety': 918, 'taste': 854, 'beef': 70, 'pieces': 636, 'cut': 198, 'shrimp': 779, 'ate': 41, 'meat': 509, 'dish': 232, 'pork': 654, 'egg': 256, 'roll': 716, 'sour': 801, 'soup': 800, 'chili': 139, 'instead': 418, 'worst': 979, 'enjoyed': 264, 'dishes': 233, 'plates': 642, 'enjoy': 263, 'typical': 905, 'tastes': 856, 'waste': 946, 'garlic': 337, 'sweet': 844, 'write': 983, 'based': 62, 'issue': 421, '25': 7, 'employee': 259, 'past': 616, 'years': 993, 'drink': 245, 'simple': 782, 'appetizer': 29, 'expect': 278, 'customers': 197, 'experience': 283, 'care': 117, 'today': 878, 'thai': 865, 'curry': 195, 'favorite': 299, 'lunch': 491, 'closed': 154, 'daughter': 204, 'fan': 294, 'noodles': 562, 'coming': 165, 'orange': 596, 'high': 386, 'expectations': 279, 'comes': 163, 'sad': 723, 'bad': 56, 'make': 494, 'wish': 968, 'stars': 816, 'special': 804, 'lack': 443, 'entrees': 267, 'crab': 184, 'filling': 304, 'bland': 81, 'plenty': 647, 'spend': 806, 'save': 741, 'disappointed': 230, 'waiting': 934, 'pick': 631, 'twice': 903, 'wrong': 984, 'fix': 311, 'visit': 928, 'salt': 731, 'vegetarian': 922, 'options': 595, 'bowl': 86, 'looks': 482, 'texture': 864, 'business': 106, 'average': 49, 'kinda': 435, 'offer': 579, 'affordable': 19, 'nthis': 574, 'servers': 766, 'mind': 520, 'leaving': 453, 'foods': 318, 'environment': 268, 'usually': 915, 'entree': 266, 'convenient': 172, 'able': 13, 'drive': 247, 'little': 466, 'house': 401, 'certainly': 122, 'orders': 600, 'guess': 365, 'unfortunately': 908, 'couple': 181, 'tonight': 881, 'bite': 79, 'course': 182, 'started': 818, 'honestly': 393, 'exactly': 275, 'haven': 377, 'establishment': 271, 'fabulous': 288, 'nas': 543, 'bucks': 98, 'expensive': 282, 'patio': 619, 'weird': 956, 'usual': 914, 'family': 293, '20': 6, 'sit': 785, 'oh': 582, 'running': 722, 'okay': 585, 'moved': 535, 'meals': 506, 'couldn': 179, 'wouldn': 981, 'greasy': 357, 'ended': 262, 'eating': 254, 'brought': 96, 'atmosphere': 42, 'spring': 811, 'rolls': 717, 'long': 477, 'korean': 441, 'authentic': 47, 'american': 27, 'leave': 452, 'unless': 910, 'aren': 34, 'left': 454, 'nall': 542, 'soon': 797, 'took': 882, 'charge': 128, 'run': 721, 'ingredients': 416, 'regular': 700, 'spent': 807, 'white': 961, 'saturday': 737, 'extra': 285, 'change': 126, 'needs': 547, 'total': 885, 'real': 691, 'liked': 461, 'let': 455, 'mix': 526, 'item': 425, '99': 12, 'red': 699, 'onions': 589, 'flavors': 315, 'fantastic': 296, 'tasty': 858, 'straight': 828, 'wasn': 945, 'thing': 868, 'filled': 303, 'll': 470, 'probably': 668, 'eaten': 253, 'beat': 67, 'point': 650, 'pleased': 646, 'surprised': 842, 'inside': 417, 'looked': 480, 'japanese': 427, 'fact': 290, 'employees': 260, 'ordering': 599, 'poor': 653, 'management': 499, 'problem': 669, 'buy': 109, 'dinner': 227, 'apparently': 28, 'spice': 808, 'crazy': 186, 'table': 845, 'baked': 58, 'yum': 997, 'treated': 891, 'taco': 847, 'gone': 352, 'multiple': 537, 'forward': 321, 'smaller': 791, 'server': 765, 'pleasant': 645, 'serving': 769, 'generous': 341, 'forget': 319, 'perfect': 624, 'month': 531, 'dining': 226, 'tell': 861, 'girl': 344, 'happened': 373, 'girls': 345, 'lived': 468, 'heard': 381, 'remember': 701, 'number': 575, 'ask': 38, 'touch': 887, 'beer': 71, 'week': 953, 'son': 796, 'clear': 151, 'sent': 761, 'fixed': 312, 'bring': 93, 'later': 450, 'guy': 366, 'says': 745, 'terrible': 863, 'thank': 866, 'appreciate': 32, 'stop': 824, 'class': 148, 'healthy': 379, 'purchased': 678, 'evening': 273, 'believe': 73, 'cost': 178, 'awesome': 52, 'mexican': 516, 'yeah': 991, 'finally': 305, 'interesting': 419, 'opinion': 593, 'yes': 995, 'water': 949, '12': 4, 'dry': 248, 'simply': 783, 'seat': 750, 'party': 614, 'dirty': 229, 'stopped': 825, 'head': 378, 'vegan': 920, 'scottsdale': 747, 'phoenix': 629, 'doctor': 234, 'baby': 54, 'end': 261, 'taking': 851, 'year': 992, 'reason': 693, 'knows': 440, 'cook': 173, 'reading': 689, 'joint': 429, 'min': 519, 'noticed': 567, 'weren': 959, 'hope': 394, 'absolutely': 14, '100': 2, 'lady': 445, 'changed': 127, 'rating': 687, 'locations': 475, 'walked': 937, 'kept': 432, 'face': 289, 'visiting': 930, 'live': 467, 'satisfied': 736, 'covered': 183, 'entire': 265, 'watching': 948, 'extremely': 286, 'watch': 947, 'seriously': 762, 'chef': 137, 'supposed': 839, 'finished': 308, 'fish': 309, 'nicely': 554, 'choices': 144, '45': 10, 'work': 973, 'job': 428, 'trying': 898, 'rude': 720, 'happy': 374, 'hold': 389, 'low': 490, 'working': 975, 'polite': 651, 'tuna': 899, 'added': 17, 'goes': 350, 'traditional': 889, 'gets': 342, 'addition': 18, 'cash': 119, 'friday': 325, 'loud': 486, 'case': 118, 'kitchen': 436, 'star': 815, 'life': 458, 'medium': 511, 'fun': 333, 'building': 100, 'suggest': 835, 'yummy': 998, 'packed': 608, 'seated': 751, 'salsa': 730, 'sunday': 837, 'arrived': 35, 'nit': 559, 'previous': 663, 'waiter': 933, 'playing': 644, 'takes': 850, 'la': 442, 'size': 787, 'mean': 507, 'piece': 635, 'burrito': 105, 'onion': 588, 'seafood': 748, 'green': 359, 'checked': 134, 'prepared': 661, 'flavorful': 314, 'mixed': 527, 'kids': 433, 'portion': 655, 'greeted': 360, 'slow': 789, 'lobster': 471, 'corn': 176, 'tip': 876, 'living': 469, 'dollars': 238, 'disappointing': 231, 'unique': 909, 'included': 411, 'means': 508, 'cute': 199, 'beans': 66, 'color': 160, 'ones': 587, 'specials': 805, 'seasoned': 749, 'using': 913, 'trip': 893, 'outside': 602, 'waitress': 935, 'combo': 161, 'miss': 524, 'strip': 830, 'mall': 497, 'vibe': 926, 'talking': 853, 'asking': 40, 'summer': 836, 'sign': 781, 'eye': 287, 'tacos': 848, 'totally': 886, 'sauces': 739, 'homemade': 391, 'mouth': 534, '10': 1, 'feeling': 301, 'west': 960, 'boyfriend': 88, 'dessert': 219, 'cream': 187, 'eggs': 257, 'dip': 228, 'stuffed': 833, 'outstanding': 603, 'world': 977, 'thought': 872, 'attentive': 44, 'chairs': 124, 'shared': 772, 'issues': 422, 'rare': 686, 'lovely': 489, 'pm': 649, 'nfor': 551, 'sandwiches': 734, 'salads': 726, 'sandwich': 733, 'cheap': 131, 'sides': 780, 'fair': 291, 'potatoes': 659, 'oil': 583, 'tasting': 857, 'cocktail': 156, 'list': 464, 'speak': 803, 'french': 323, 'selection': 759, 'wine': 965, 'owner': 606, 'bottle': 84, 'rest': 704, 'brunch': 97, 'lamb': 446, 'bacon': 55, 'deep': 211, 'bunch': 102, 'appetizers': 30, 'cocktails': 157, 'chocolate': 142, 'chance': 125, 'wedding': 952, 'recently': 696, 'incredibly': 414, 'ribs': 712, 'lots': 485, 'choose': 145, 'cake': 112, 'turned': 901, 'mediocre': 510, 'young': 996, 'wow': 982, 'literally': 465, 'gem': 339, 'bartender': 61, 'early': 249, 'start': 817, 'creamy': 188, 'tender': 862, 'bread': 90, 'potato': 658, 'na': 539, 'recommended': 698, 'fries': 330, 'afternoon': 20, 'window': 964, 'knowledgeable': 439, 'butter': 108, 'waited': 932, 'glass': 349, 'pepper': 623, 'light': 459, 'strong': 831, '40': 9, 'mins': 521, 'toronto': 884, 'blue': 82, 'date': 203, '11': 3, 'le': 451, 'parking': 613, 'pancakes': 611, 'sausage': 740, 'crispy': 190, 'pain': 610, 'crowd': 191, 'gotten': 355, 'pasta': 617, 'desserts': 220, 'purchase': 677, 'sell': 760, 'stand': 813, 'hungry': 403, 'section': 755, 'problems': 670, 'matter': 503, 'shot': 777, 'xa9': 987, 'moving': 536, 'hear': 380, 'complete': 169, 'groupon': 364, 'hate': 376, 'ambiance': 26, 'late': 449, 'notch': 565, 'word': 972, 'trust': 896, 'ice': 405, 'treat': 890, 'chose': 146, 'veggie': 923, 'did': 221, 'crowded': 192, 'obviously': 578, 'ramen': 685, 'xa0': 986, 'bought': 85, 'music': 538, 'woman': 969, 'read': 688, 'event': 274, 'efficient': 255, 'basically': 63, 'worse': 978, 'nthere': 572, 'met': 515, 'salmon': 728, 'mentioned': 513, 'willing': 963, 'truly': 895, 'guys': 367, 'air': 23, 'clearly': 152, 'showed': 778, 'school': 746, 'stores': 827, 'patient': 618, 'plan': 640, 'broth': 95, 'juicy': 430, 'provided': 675, 'downtown': 242, 'normally': 564, 'hostess': 396, 'pizza': 637, 'office': 581, 'months': 532, 'seen': 757, 'pulled': 676, 'talk': 852, 'honest': 392, 'negative': 548, 'gym': 368, 'burger': 103, 'wall': 939, 'shops': 775, 'shop': 773, 'vegas': 921, 'pictures': 633, 'las': 448, 'smile': 793, 'products': 673, 'casino': 120, 'sales': 727, 'thinking': 871, 'art': 36, 'dr': 243, 'mention': 512, 'buying': 110, 'nso': 570, 'valley': 916, 'level': 457, 'ladies': 444, 'game': 335, 'burgers': 104, 'crust': 193, 'bun': 101, 'toppings': 883, 'mac': 492, 'returning': 708, 'future': 334, 'grilled': 362, '00': 0, 'play': 643, 'games': 336, 'cafe': 111, 'checking': 135, 'nwhen': 577, 'milk': 518, 'sushi': 843, 'owners': 607, 'pass': 615, 'neighborhood': 549, 'delivery': 215, 'complaint': 168, 'craving': 185, 'seats': 753, 'mom': 529, 'possible': 657, 'visited': 929, 'product': 672, 'salon': 729, 'wings': 966, 'delivered': 214, 'avoid': 50, 'dance': 200, 'massage': 502, 'park': 612, 'frozen': 331, 'club': 155, 'grill': 361, 'box': 87, 'dogs': 237, 'charlotte': 130, 'bbq': 65, 'original': 601, 'indian': 415, 'nail': 540, 'team': 860, 'lol': 476, 'helped': 384, 'nails': 541, 'appointment': 31, 'brand': 89, 'repair': 702, 'beers': 72, 'fit': 310, 'dog': 236, 'services': 768, 'rib': 711, 'cleaning': 150, 'pie': 634, 'pool': 652, 'xe3': 990, 'brisket': 94, 'x81': 985, 'et': 272, 'pho': 628}

4. Data Visualization

Visualize statistical summaries of the text data such as word frequencies, document lengths, most relevant words, vocabulary size, etc.

4.1 Word Frequencies

Let's first find out the most common words used in reviews regardless of the categaries and rating stars.

In [20]:
df_cv = pd.DataFrame(data=bag_words.toarray(),columns=count_vect.get_feature_names())
df['length_of_text'] = df['text'].map(lambda x: len(x))
In [21]:
word_fre = df_cv.sum().sort_values()[-15:]/df['length_of_text'].sum()
In [22]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib
%matplotlib inline
import seaborn as sns

plt.style.use('ggplot')
ax = word_fre.plot(kind='barh')
plt.title('15 Most Common Words Frequency', color='black')
Out[22]:
<matplotlib.text.Text at 0x124f52e48>
In [129]:
# just print them out for convenience
print(plt.style.available)
['_classic_test', 'bmh', 'classic', 'dark_background', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn']

Let's find out what businesses are given longer reviews by customers.

In [80]:
%matplotlib inline
plt.style.use('seaborn')
df_categoryIndex = df.set_index('categories')
ax = df_categoryIndex['length_of_text'].sort_values()[-15:].plot(kind='barh')
plt.title('15 Longest Reviews')
Out[80]:
<matplotlib.text.Text at 0x172918ba8>

People are giving longer reviews for Japenese or Sushi restaurants. Let's take a closer look at it. More positive ones or negative ones?

However, both of them didnt give us much meaningful information.

So we need to do this sepecifically.

4.2 Japanese Sushi Restaurant and Pho

In [136]:
import seaborn as sns 
g = sns.countplot(df.stars[df.categories.str.contains("Japanese', 'Sushi")])
#g.set(xticklabels=[])
plt.xticks(rotation=90) # Number of reviews in each rating level
Out[136]:
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)
In [137]:
import seaborn as sns 
g = sns.countplot(df.stars[df.categories.str.contains("Pho")])
#g.set(xticklabels=[])
plt.xticks(rotation=90) # Number of reviews in each rating level
Out[137]:
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)

It seems that people left more negtive reviews for Pho restaurants, but also more positive ones.

4.3 Number of reviews in each rating level

In [51]:
import seaborn as sns 
g = sns.countplot(df.stars)
plt.xticks(rotation=90) # Number of reviews in each rating level
Out[51]:
(array([0, 1, 2, 3, 4]), <a list of 5 Text xticklabel objects>)

People seem more likely to give a high rating.

Let's plot the length of reviews which are 5-star rated.

In [52]:
g = sns.countplot(df['length_of_text'][df.stars == 5])
g.set(xticklabels=[])
plt.xticks(rotation=90) # Number of reviews in each rating level
Out[52]:
(array([   0,    1,    2, ..., 3636, 3637, 3638]),
 <a list of 3639 Text xticklabel objects>)
In [53]:
df['length_of_text'][df.stars == 5].describe()
Out[53]:
count    206649.000000
mean        516.910834
std         492.641342
min          14.000000
25%         213.000000
50%         364.000000
75%         639.000000
max       22980.000000
Name: length_of_text, dtype: float64

We choose the reviews with 5 rating stars to see how long a review most people left if they had a great experience. We can see that lengths of words between 364 and 639 are most common.

Let's also take a look at the situation for 1-star rated reviews.

In [81]:
g = sns.countplot(df['length_of_text'][df.stars == 1])
g.set(xticklabels=[])
plt.xticks(rotation=90) # Number of reviews in each rating level
Out[81]:
(array([   0,    1,    2, ..., 3884, 3885, 3886]),
 <a list of 3887 Text xticklabel objects>)
In [82]:
df['length_of_text'][df.stars == 1].describe()
Out[82]:
count    69826.000000
mean       782.270687
std        729.553710
min          4.000000
25%        303.000000
50%        558.000000
75%        993.000000
max       5602.000000
Name: length_of_text, dtype: float64

People were really upset and used more words to describe their bad experience!

4.4 most common relevant words

What words people used frequently when they left a 5-star review for a modern european restaurants.

In [ ]:
df.head(-100)
In [111]:
from sklearn.feature_extraction.text import CountVectorizer
n_features = 1000
count_vect_5 = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') 
summary_text_5 = df['text_without_stopwords'][(df.stars == 5) & df.categories.str.contains("Modern European', 'Restaurants")]
# Learn the vocabulary dictionary and return term-document matrix.
bag_words_5 = count_vect_5.fit_transform(summary_text_5)
In [112]:
print(len(count_vect_5.vocabulary_))
print(count_vect_5.vocabulary_)
1000
{'love': 485, 'place': 632, 'went': 963, 'boyfriend': 115, 'celebrate': 154, 'anniversary': 38, 'glad': 361, 'suggested': 856, 'food': 341, 'really': 694, 'good': 366, 'great': 373, 'cocktail': 191, 'list': 469, 'nwe': 557, 'shared': 787, 'hot': 409, 'balls': 70, 'delicious': 246, 'loved': 486, 'beef': 82, 'tartare': 872, 'potatoes': 653, 'cooked': 207, 'duck': 272, 'awesome': 65, 'finish': 330, 'lemon': 461, 'far': 318, 'best': 92, 'dessert': 249, 've': 925, 'tried': 906, 'person': 619, 'big': 95, 'perfect': 616, 'nthe': 553, 'service': 781, 'slow': 807, 'don': 267, 'waiter': 939, 'nice': 539, 'french': 344, 'ni': 538, 'definitely': 243, 'sad': 736, 'kind': 446, 'live': 472, 'lovely': 487, 'evening': 300, 'night': 542, 'table': 866, '17': 5, 'selection': 775, 'wine': 972, 'owner': 590, 'excellent': 302, 'heard': 393, 'mind': 510, 'price': 660, 'point': 644, 'meal': 501, 'choices': 179, 'overall': 588, 'bit': 98, 'pricey': 663, 'bottle': 111, 'south': 820, '50': 11, 'just': 442, 'atmosphere': 59, 'polenta': 645, 'starter': 842, 'chef': 172, 'different': 254, 'day': 236, 'ones': 571, 'saw': 758, 'tables': 867, 'looked': 481, 'meals': 502, 'prices': 662, 'ordered': 581, 'menu': 508, 'order': 580, 'delish': 248, 'work': 982, 'way': 954, 'ok': 567, 'closed': 190, 'water': 953, 'nfood': 536, 'amazing': 34, 'bunch': 131, 'appetizers': 42, 'mains': 491, 'cocktails': 192, 'thought': 888, 'sure': 861, 'like': 465, 'restaurant': 713, 'end': 287, 'cool': 208, 'inspired': 430, 'occasion': 558, 'private': 664, 'room': 728, 'small': 808, 'wedding': 956, 'recently': 699, '14': 3, 'guests': 381, 'incredibly': 426, 'easy': 278, 'took': 896, 'care': 148, 'quickly': 681, 'able': 13, 'come': 198, 'dinner': 257, 'family': 316, 'style': 853, 'included': 423, 'kale': 443, 'citrus': 184, 'salad': 740, 'green': 374, 'beans': 77, 'lamb': 453, 'bolognese': 108, 'braised': 116, 'shank': 785, 'short': 789, 'ribs': 719, 'lots': 484, 'choose': 180, 'busy': 135, 'eating': 281, 'cake': 141, 'brought': 123, 'got': 368, 'home': 402, 'decadent': 239, 'nit': 543, 'fantastic': 317, 'think': 886, 'better': 93, 'host': 408, 'nmy': 544, 'eat': 279, 'days': 237, 'll': 474, 'certificate': 159, 'generous': 357, 'right': 723, 'money': 517, 'flavor': 335, 'prompt': 669, 'saturday': 752, 'came': 145, '30': 9, 'wasn': 950, 'filled': 326, 'time': 889, 'left': 460, 'crowd': 227, 'decor': 242, 'rustic': 734, 'risotto': 724, 'mushroom': 522, 'goat': 363, 'cheese': 169, 'honey': 405, 'little': 471, 'sized': 801, 'reasonable': 695, 'wished': 975, 'portion': 649, 'finished': 331, 'sausage': 754, 'slice': 803, 'tree': 905, 'trunk': 911, 'allowed': 33, 'spread': 833, 'perfectly': 618, 'dish': 262, 'creamy': 222, 'addictive': 21, 'meat': 504, 'sauce': 753, 'seriously': 776, 'frites': 351, 'incredible': 425, 'yes': 996, 'dishes': 263, 'disappointed': 261, 'literally': 470, 'favorite': 321, 'new': 535, 'given': 360, 'toronto': 898, 'gem': 356, 'friend': 348, 'raved': 688, 'vegetarian': 929, 'absolutely': 15, 'outstanding': 587, 'experience': 308, 'friendly': 349, 'funny': 354, 'staff': 837, 'spectacular': 824, 'brussel': 127, 'sprouts': 834, 'scallops': 760, 'pasta': 606, 'light': 464, 'tender': 880, 'delicate': 245, 'flavors': 337, 'soon': 813, '10': 0, 'nservice': 552, 'noverall': 551, 'appetizer': 41, 'main': 490, 'special': 822, 'tell': 879, 'prepared': 655, 'server': 779, 'extremely': 312, 'helpful': 397, 'selecting': 774, 'recommendations': 702, 'didn': 252, 'feel': 323, 'line': 467, 'wonderful': 977, 'equally': 296, 'arrived': 50, 'early': 274, 'reservation': 710, 'problem': 667, 'remember': 707, 'suggestions': 857, 'knew': 448, 'ingredients': 427, 'ceviche': 160, 'start': 840, 'blown': 103, 'away': 64, 'wife': 968, 'looking': 482, 'forward': 342, 'venison': 931, 'probably': 666, 'tasted': 874, 'wild': 969, 'boar': 106, 'roasted': 725, 'maple': 496, 'panna': 596, 'cotta': 211, 'sticky': 846, 'pudding': 672, 'dining': 256, 'hidden': 398, 'wonderfully': 978, 'entr': 292, 'xc3': 993, 'xa9': 990, 'environment': 295, 'thing': 884, 'clean': 188, 'location': 478, 'ahead': 29, 'ask': 54, 'confit': 205, 'worth': 986, 'try': 912, 'area': 49, 'date': 235, 'brunch': 126, 'doesn': 265, 'super': 860, 'stuff': 851, 'healthy': 392, 'look': 480, 'consisted': 206, 'spices': 828, 'berries': 90, 'tasty': 877, 'pork': 648, 'idea': 418, 'used': 921, 'hint': 401, 'people': 613, 'drinks': 271, 'value': 923, 'return': 715, 'peanut': 611, 'butter': 136, 'thank': 883, 'later': 457, 'servers': 780, 'attentive': 61, 'accommodating': 17, 'wrong': 989, 'pleased': 641, 'weekend': 958, 'reasonably': 696, 'priced': 661, 'sunday': 859, 'enjoyed': 290, 'entire': 291, 'fabulous': 313, 'byob': 139, 'corkage': 209, 'bottles': 112, 'couple': 214, 'noted': 549, 'complete': 204, 'present': 656, 'maybe': 499, 'visiting': 936, 'case': 150, 'exactly': 301, 'chic': 175, 'treat': 903, 'know': 449, 'phenomenal': 623, 'stuffed': 852, 'finally': 328, 'crab': 219, 'nfor': 537, 'puree': 675, 'perfection': 617, 'steak': 845, 'arugula': 52, 'fresh': 345, 'presentation': 657, 'celebration': 155, 'culinary': 230, 'unique': 918, 'things': 885, 'highly': 400, 'recommend': 700, 'huge': 413, 'set': 783, 'comment': 202, 'normal': 546, 'beer': 83, 'going': 364, 'review': 716, 'want': 947, 'kept': 444, 'secret': 772, 'husband': 416, 'hour': 410, 'seated': 768, 'places': 634, 'served': 778, 'rushed': 731, 'bread': 117, 'comes': 199, 'tart': 871, 'warm': 949, 'sweet': 864, 'flavours': 338, 'appropriate': 47, 'plates': 637, 'presented': 658, 'spots': 832, 'nour': 550, 'spent': 826, 'afternoon': 27, 'sitting': 799, 'friends': 350, 'make': 492, 'walked': 943, 'welcome': 960, 'change': 163, 'interior': 433, 'dark': 234, 'wood': 979, 'quite': 684, 'cozy': 218, 'dressed': 269, 'welcomed': 961, 'eggs': 285, 'tomato': 894, 'entrees': 294, 'walk': 942, 'decided': 241, 'crispy': 226, 'salty': 744, 'totally': 900, 'fish': 332, 'belly': 89, 'greens': 375, 'fixe': 333, '35': 10, 'die': 253, 'free': 343, 'simply': 796, 'actually': 18, 'getting': 358, 'restaurants': 714, 'dinners': 258, 'quick': 680, 'party': 602, 'groupon': 380, 'usually': 922, 'purchase': 674, 'reservations': 711, 'quiet': 682, 'job': 439, 'explained': 309, 'specials': 823, 'gave': 355, 'items': 437, 'add': 19, 'platter': 638, 'glass': 362, 'tip': 891, 'worked': 983, 'hard': 390, 'bad': 68, 'taste': 873, 'addition': 22, 'fee': 322, 'extra': 311, 'kids': 445, 'toast': 892, 'slices': 805, 'heavy': 394, 'pretty': 659, '12': 2, 'coming': 201, 'wow': 988, 'cuisine': 229, 'parking': 598, 'reading': 691, 'reviews': 717, 'pay': 609, 'gotta': 369, 'vibe': 932, 'advantage': 24, 'xa9e': 992, 'town': 902, 'potato': 652, 'bacon': 67, 'soup': 817, 'added': 20, 'delectable': 244, 'chicken': 176, 'liver': 473, 'tangy': 870, 'cut': 232, 'mayo': 500, 'executed': 303, 'juicy': 441, 'word': 981, 'wanted': 948, 'onion': 572, 'squash': 835, 'plate': 636, 'choice': 178, 'topped': 897, 'provided': 671, 'satisfying': 751, 'bite': 99, 'wait': 938, 'trying': 913, 'downtown': 268, 'surprise': 862, 'meant': 503, 'turned': 914, 'especially': 297, 'past': 605, 'window': 970, 'happy': 388, 'casual': 151, 'isn': 435, 'offers': 564, 'variety': 924, 'soft': 812, 'elevated': 286, 'bone': 109, 'high': 399, 'desserts': 250, 'share': 786, 'partner': 600, 'visit': 935, 'felt': 325, 'based': 74, 'center': 156, 'said': 738, 'romantic': 727, 'setting': 784, 'large': 454, 'space': 821, 'prix': 665, 'impressed': 422, 'entree': 293, 'stay': 844, 'non': 545, 'charred': 166, 'spot': 831, 'crazy': 220, 'questions': 679, 'group': 379, 'open': 574, 'asked': 55, 'sit': 798, 'reserve': 712, 'number': 556, 'lot': 483, 'st': 836, 'week': 957, 'boy': 114, 'salmon': 742, 'shrimp': 791, 'absolute': 14, 'quality': 677, 'close': 189, 'salads': 741, 'offered': 562, 'tastes': 875, 'pecan': 612, 'chocolate': 177, 'truffle': 909, 'star': 838, 'knowledgeable': 450, 'sausalido': 755, 'times': 890, 'mother': 519, 'birthday': 96, 'catering': 152, '100': 1, 'picky': 626, 'nick': 540, 'making': 494, 'personable': 620, 'options': 579, 'orders': 582, 'attention': 60, 'snack': 811, 'veggies': 930, 'professional': 668, 'cheeses': 171, 'terrific': 882, 'wines': 973, 'cauliflower': 153, 'ravioli': 689, 'baby': 66, 'spinach': 829, 'pieces': 628, 'quinoa': 683, 'beautifully': 80, 'coffee': 193, 'creme': 224, 'pumpkin': 673, 'cheesecake': 170, 'world': 985, 'pittsburgh': 631, 'serving': 782, 'sizes': 802, 'yummy': 998, 'lunch': 489, 'check': 167, 'graduation': 372, 'seasoned': 766, 'innovative': 428, 'stars': 839, 'white': 965, 'bean': 76, 'city': 185, 'beverage': 94, 'comfortable': 200, 'satisfied': 750, 'dip': 259, 'balsamic': 71, 'beets': 85, 'fruit': 352, 'moist': 516, 'truly': 910, 'memorable': 506, 'recommended': 703, 'coworker': 217, 'including': 424, 'treated': 904, 'divine': 264, 'hummus': 414, 'gouda': 370, 'cakes': 142, 'couldn': 212, 'wouldn': 987, '15': 4, '20': 7, 'sounds': 816, 'regular': 706, 'told': 893, 'girlfriend': 359, 'eaten': 280, 'brussels': 128, 'portions': 650, 'bring': 120, 'street': 848, 'asking': 56, 'parties': 599, 'mix': 513, 'feeling': 324, 'ready': 692, 'touch': 901, 'cream': 221, 'smoked': 810, 'complaint': 203, 'mushrooms': 523, 'ricotta': 722, 'personal': 621, 'mashed': 498, 'plus': 643, 'easily': 276, 'leave': 458, 'similar': 794, 'tomatoes': 895, 'range': 686, 'asian': 53, 'needs': 534, 'creative': 223, 'seasonal': 765, 'seafood': 763, 'outside': 586, 'filling': 327, 'rich': 721, 'flavorful': 336, 'dined': 255, 'friday': 346, 'half': 383, 'wish': 974, 'importantly': 421, 'hours': 411, 'low': 488, 'split': 830, 'happier': 387, 'disappoint': 260, 'sub': 854, 'pictures': 627, 'help': 395, 'caramel': 147, 'honestly': 404, 'ended': 288, 'courses': 216, 'immediately': 419, 'needed': 533, 'app': 39, 'seat': 767, 'larger': 455, 'mentioned': 507, 'business': 134, 'starting': 843, 'hands': 386, 'sample': 745, 'calamari': 143, 'mussels': 525, 'fried': 347, 'vegetables': 928, 'broth': 122, 'signature': 793, 'paired': 593, 'cherry': 174, 'daily': 233, 'seared': 764, 'sounded': 815, 'apple': 43, 'brulee': 125, 'offering': 563, 'dollars': 266, 'received': 697, 'mixed': 514, 'say': 759, 'waitstaff': 941, 'homemade': 403, 'sangria': 747, 'weren': 964, 'apparently': 40, 'red': 704, 'pepper': 614, 'breast': 119, 'asparagus': 57, 'plenty': 642, 'tasting': 876, 'spend': 825, 'let': 462, 'black': 100, 'need': 532, 'pop': 647, 'provide': 670, 'kitchen': 447, 'cold': 194, 'taken': 868, 'save': 756, 'ice': 417, 'shows': 790, 'recommendation': 701, 'grilled': 377, 'peppers': 615, 'rare': 687, 'enjoy': 289, 'minutes': 512, 'sandwich': 746, 'bathroom': 75, 'wall': 945, 'accommodate': 16, 'course': 215, 'apples': 44, 'berry': 91, 'simple': 795, 'combination': 197, 'local': 475, 'european': 299, 'true': 908, 'fine': 329, 'working': 984, 'certainly': 158, 'impeccable': 420, 'started': 841, 'house': 412, 'believe': 88, 'chance': 161, 'bed': 81, 'blue': 104, 'quaint': 676, 'apps': 48, 'makes': 493, 'single': 797, 'chop': 181, 'mouth': 520, 'eggplant': 284, 'note': 548, 'la': 452, 'walking': 944, 'flatbread': 334, 'cinnamon': 183, 'manager': 495, 'opinion': 576, 'strip': 849, 'recent': 698, 'real': 693, 'years': 995, 'personally': 622, 'serve': 777, 'watch': 951, 'watching': 952, 'head': 391, 'takes': 869, 'okay': 568, 'ate': 58, 'parents': 597, 'bar': 72, 'option': 578, 'walls': 946, 'opened': 575, 'odd': 560, 'fact': 315, 'savory': 757, 'decent': 240, 'surprised': 863, 'horseradish': 407, 'passed': 603, 'trip': 907, 'reminded': 708, 'charlotte': 165, 'halycon': 384, 'drink': 270, 'nif': 541, 'beet': 84, 'wellington': 962, 'item': 436, 'locally': 476, 'sourced': 819, 'appreciate': 45, 'chefs': 173, 'ambiance': 35, 'museum': 521, 'mint': 511, 'adventurous': 25, 'octopus': 559, 'halcyon': 382, 'earth': 275, 'delightful': 247, 'scene': 761, 'gorgeous': 367, 'smaller': 809, 'expect': 304, 'bars': 73, 'beautiful': 79, 'balance': 69, 'country': 213, 'cornbread': 210, 'breakfast': 118, 'board': 107, 'salt': 743, 'appreciated': 46, 'nafter': 526, 'opted': 577, 'nthey': 554, 'interesting': 432, 'mood': 518, 'medium': 505, 'peach': 610, 'expected': 306, 'size': 800, 'face': 314, 'blueberry': 105, 'spiced': 827, 'cup': 231, 'nthis': 555, 'inside': 429, 'art': 51, 'modern': 515, 'nc': 529, 'farm': 320, 'folks': 340, 'classic': 186, 'located': 477, 'uptown': 920, 'begin': 86, 'colors': 195, 'chandeliers': 162, 'outdoor': 585, 'seating': 769, 'view': 933, 'placed': 633, 'butters': 137, 'olive': 570, 'oil': 566, 'onions': 573, 'packed': 591, 'weather': 955, 'egg': 283, 'long': 479, 'brown': 124, 'oh': 565, 'crowded': 228, 'yum': 997, 'expectations': 305, 'beat': 78, 'stone': 847, 'changes': 164, 'plan': 635, 'burger': 132, 'chow': 182, 'aioli': 30, 'requested': 709, 'beignets': 87, 'called': 144, 'eclectic': 282, 'parts': 601, 'windows': 971, 'thinly': 887, 'won': 976, 'rabbit': 685, 'grits': 378, 'rice': 720, 'pick': 624, 'organic': 584, 'exquisite': 310, 'wooden': 980, 'pastry': 607, 'views': 934, 'blew': 101, 'hungry': 415, 'vegetable': 927, 'bison': 97, 'sat': 749, 'terrace': 881, 'polite': 646, 'avocado': 63, 'rib': 718, 'buttery': 138, 'classy': 187, 'upscale': 919, 'wide': 967, 'saddle': 737, 'floor': 339, 'pairings': 595, 'greeted': 376, 'certain': 157, 'patio': 608, 'sugar': 855, 'instead': 431, 'pickled': 625, 'possible': 651, 'ago': 28, 'seeing': 773, 'ample': 37, 'slightly': 806, 'ndessert': 530, 'pig': 629, 'overly': 589, 'root': 729, 'additional': 23, 'round': 730, 'expensive': 307, 'xa9cor': 991, 'notch': 547, 'nand': 527, 'type': 916, 'old': 569, 'refreshing': 705, 'bowl': 113, 'read': 690, 'did': 251, 'bringing': 121, 'typical': 917, 'sort': 814, 'year': 994, 'neat': 531, 'juice': 440, 'whites': 966, 'russian': 733, 'second': 771, 'fun': 353, 'buds': 130, 'marc': 497, 'carved': 149, 'seats': 770, 'sliced': 804, 'inviting': 434, 'pink': 630, 'tea': 878, 'ordinary': 583, 'fare': 319, 'offer': 561, 'music': 524, 'summer': 858, 'helped': 396, 'checking': 168, 'hamachi': 385, 'passion': 604, 'weeks': 959, 'sign': 792, 'waitress': 940, 'affordable': 26, 'ambience': 36, 'alcoholic': 32, 'pleasant': 640, 'eastern': 277, 'schnitzel': 762, 'europa': 298, 'total': 899, 'vodka': 937, 'russia': 732, '18': 6, 'borscht': 110, 'sour': 818, 'cabbage': 140, 'crepes': 225, 'car': 146, 'rusty': 735, 'playing': 639, 'deal': 238, '25': 8, 'harbord': 389, 'messis': 509, 'stroganoff': 850, 'tverskaya': 915, 'blinchiki': 102, 'dumplings': 273, 'buckwheat': 129, 'praga': 654, 'napoleon': 528, 'paid': 592, 'authentic': 62, 'las': 456, 'vegas': 926, 'goulash': 371, 'golubtsy': 365, 'kvas': 451, 'liquor': 468, 'license': 463, 'tabaka': 865, 'liked': 466, 'hope': 406, 'shashlik': 788, 'rolls': 726, 'alcohol': 31, 'com': 196, 'leaving': 459, 'question': 678, 'pairing': 594, 'sake': 739, 'yunaghi': 999, 'sashimi': 748, 'japanese': 438, 'busier': 133, '80': 12}
In [113]:
count_vect_5.inverse_transform(bag_words_5[0])
pd.options.display.max_columns = 999
df_5 = pd.DataFrame(data=bag_words_5.toarray(),columns=count_vect_5.get_feature_names())
In [114]:
# print out 10 most common words in our data
df_5.sum().sort_values()[-10:]
Out[114]:
like           78
amazing        80
service        96
delicious      98
restaurant    108
place         119
menu          122
good          130
great         176
food          226
dtype: int64
In [115]:
# print out 10 least common words in our data
df_5.sum().sort_values()[:10] # small sample size means most words occur one time
Out[115]:
pig            3
carved         3
purchase       3
casual         3
provided       3
cauliflower    3
center         3
certain        3
certificate    3
puree          3
dtype: int64
In [128]:
# Convert dataframe into string
summary_text_5_array = np.array(summary_text_5)
summary_text_5_str = np.array2string(summary_text_5_array)

# Generate a word cloud image
wordcloud_5 = WordCloud(max_font_size=40,background_color='white',width=500, height=250,colormap = 'plasma').generate(summary_text_5_str)

# Display the generated image:
plt.imshow(wordcloud)
plt.axis("off")
Out[128]:
(-0.5, 499.5, 249.5, -0.5)

We can find may big positive words above. People are very generous with their compliments when having a great experience.

What words people used frequently when they left a 1-star review for a modern european restaurants.

In [106]:
count_vect_1 = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') 
summary_text_1 = df['text_without_stopwords'][(df.stars == 1) & df.categories.str.contains("Modern European', 'Restaurants")]
# Learn the vocabulary dictionary and return term-document matrix.
bag_words_1 = count_vect_1.fit_transform(summary_text_1)
In [107]:
count_vect_1.inverse_transform(bag_words_1[0])
pd.options.display.max_columns = 999
df_1 = pd.DataFrame(data=bag_words_1.toarray(),columns=count_vect_1.get_feature_names())
In [109]:
# print out 10 most common words in our data
df_1.sum().sort_values()[-10:]
Out[109]:
place          9
didn           9
left           9
like           9
waiter        14
minutes       15
good          15
service       19
restaurant    20
food          21
dtype: int64
In [110]:
# print out 10 least common words in our data
df_1.sum().sort_values()[:10] # small sample size means most words occur one time
Out[110]:
half       2
non        2
ni         2
nearly     2
museum     2
mistake    2
tiny       2
mention    2
maybe      2
treated    2
dtype: int64
In [126]:
# Convert dataframe into string
summary_text_1_array = np.array(summary_text_1)
summary_text_1_str = np.array2string(summary_text_1_array)

# Generate a word cloud image
wordcloud = WordCloud(max_font_size=40,background_color='white',width=500, height=250,colormap = 'plasma').generate(summary_text_1_str)

# Display the generated image:
plt.imshow(wordcloud)
plt.axis("off")
Out[126]:
(-0.5, 499.5, 249.5, -0.5)

There are not many negative words as I expected. That's cool. People are friendly. And I would think that those bigger words represent the aspects in which the restaurants did bad.

Exceptional Work

In [67]:
# Exception work
from os import path
from wordcloud import WordCloud
In [127]:
# Convert dataframe into string
#summary_text = df['text_without_stopwords']
summary_text_array = np.array(summary_text[df.stars == 5])
summary_text_str = np.array2string(summary_text_array)

# lower max_font_size
wordcloud = WordCloud(max_font_size=40,background_color='white',width=500, height=250,colormap = 'magma').generate(summary_text_str)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
In [ ]: