Savoring Insights: Unveiling Hidden Gems in Yelp Restaurant Reviews
Savoring Insights: Unveiling Hidden Gems in Yelp Restaurant Reviews
Complete Github repository here: https://github.com/utkaxxh/YelpTips
Introduction:
In the vast landscape of online reviews, uncovering valuable insights can be like finding hidden treasures. Recently, I embarked on a project that delved into the world of text mining and data visualization, aiming to extract and present tips from restaurant reviews on Yelp. This enthralling journey involved web scraping in Python, the creation of a robust logic to filter diverse tips, and the visualization of these nuggets of wisdom. Join me as I share the steps and discoveries made in this flavorful exploration.
Harvesting the Bounty: Web Scraping with Python
The project kicked off with a quest for raw data, and what better source than the vast repository of Yelp restaurant reviews. Leveraging the power of Python, I engaged in web scraping to extract the raw HTML files containing user-generated reviews. This initial step set the stage for the subsequent extraction of valuable tips embedded within the sea of textual information.
Crafting a Logic for Comprehensive Tips
The true beauty of this project lay in the development of a logic that went beyond the conventional. Instead of merely focusing on renowned cuisines, the goal was to extract tips covering a spectrum of aspects—ranging from the taste of the dishes to the availability of parking space and the overall ambiance of the restaurant.
The logic became a symphony of algorithms, carefully orchestrated to sift through the textual landscape and identify tips that transcended the ordinary. Ambiguous mentions and diverse perspectives were embraced, enriching the dataset with nuanced insights that reflected the multifaceted nature of user experiences.
Text Mining Unveils Hidden Gems
With a curated dataset in hand, the next chapter unfolded in the realm of text mining. Natural Language Processing (NLP) algorithms became the tools to unravel the sentiments and nuances embedded within the reviews. Through the magic of NLP, I decoded the language of users, transforming raw text into a treasure trove of actionable insights.
"""
@author: utkarsh
"""
# all the pacakages required for the code
from bs4 import BeautifulSoup
import os
import sys
##########################################################################################################################################
##########################################################################################################################################
'''
In this function returns a dictionary which contains restro folder name as its key and the scrapped html pages as its values
eg : {'FishmarketRestaurant.txt': [reviewpage_1.txt,reviewpage_2.txt,reviewpage_3.txt,reviewpage_4.txt,reviewpage_5.txt,reviewpage_6.txt,reviewpage_7.txt,
reviewpage_8.txt,reviewpage_9.txt,reviewpage_12.txt,reviewpage_13.txt,reviewpage_14.txt,reviewpage_15.txt,reviewpage_16.txt,
reviewpage_17.txt,reviewpage_18.txt,reviewpage_19.txt]}
'''
def readRestaurant(inputdir):
restro= {}
for restroName in os.listdir(inputdir):
if restroName =='.DS_Store':
continue
else:
folder = os.path.join(inputdir,restroName)
for reviews in os.listdir(folder):
restro.setdefault(restroName,[]).append((str(os.path.join(folder,reviews))))
return restro
##########################################################################################################################################
##########################################################################################################################################
'''
The function accepts input and output dir paths as parameters and creates a text files of the restro withs all the extracted reviews for that restro
'''
def extractRestaurant(inputpath,outputpath):
restro = readRestaurant(inputpath)
reviews_counter= 0
for k,v in restro.items():
#removing all the spaces if any in the file name
k = k.replace(" ",'')
filename = outputpath+'/'+str(k)+'.txt'
try:
restro_name = open(filename,'a')
except IOError:
print ("Could not read file:", filename)
sys.exit()
for files in v:
try:
html = open(files,'r')
except IOError:
print ("Could not read file:", files)
sys.exit()
soup = BeautifulSoup(html,'xml')
reviews = soup.findAll('div', {'class':'review-content'})
for review in reviews:
review_content = review.find('p',{'lang':'en'})
data = review_content.text
restro_name.write(data)
restro_name.write('\n')
reviews_counter = reviews_counter+1
restro_name.close()
print('Reviews successfully wrote in file '+k)
print("total reviews extracted: "+str(reviews_counter))
if __name__ == "__main__":
base_input_path = '/Users/utkarsh/Desktop/Project/Final_model/Data/Chinese'
base_output_path ='/Users/utkarsh/Desktop/Project/Final_model/Data/test'
extractRestaurant(base_input_path,base_output_path)
Data Visualization: From Words to Pictures
To make these insights accessible and engaging, the journey culminated in the realm of data visualization. The power of visualization turned the extracted tips into a captivating narrative. Through charts and graphs, the patterns of user preferences, the distribution of positive and negative sentiments, and the prominence of different aspects in reviews were brought to life.
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import collections
import re
from nltk.corpus import stopwords
import sys
import time
## this function will return the unique sentences in the review file for a given restaurant
#######################################################################################
def UniqueReviews(filepath):
#read the input
try:
f=open(filepath)
except IOError:
print ("Could not read file:", filename)
sys.exit()
text=f.read().strip()
f.close()
#split sentences
sentences=sent_tokenize(text)
adj_sentence = set()
#check for unique sentences in the reviews
counter = 0
for sentence in sentences:
counter += 1
adj_sentence.add(sentence)
# print(counter)
adj_sentence=list(adj_sentence)
# print(len(adj_sentence))
print("The following tips are generated from {} unique sentences".format(len(adj_sentence)))
return(adj_sentence)
#######################################################################################
def tokenize(string):
filtered_words = [word for word in re.findall(r'\w+', string.lower()) if word not in stopwords.words('english')]
return filtered_words
#######################################################################################
def count_ngrams(lines, min_length=2, max_length=4):
lengths = range(min_length, max_length + 1)
ngrams = {length: collections.Counter() for length in lengths}
queue = collections.deque(maxlen=max_length)
# Helper function to add n-grams at start of current queue to dict
def add_queue():
current = tuple(queue)
for length in lengths:
if len(current) >= length:
ngrams[length][current[:length]] += 1
# Loop through all lines and words and add n-grams to dict
for line in lines:
for word in tokenize(line):
queue.append(word)
if len(queue) >= max_length:
add_queue()
# Make sure we get the n-grams at the tail end of the queue
while len(queue) > min_length:
queue.popleft()
add_queue()
return (ngrams)
#######################################################################################
def loadwords(fname):
newLex=set()
try:
lex_conn=open(fname)
except IOError:
print ("Could not read file:", filename)
sys.exit()
for line in lex_conn:
line = line.lower()
newLex.add(line.strip())
lex_conn.close()
return newLex
#######################################################################################
def list_of_ngrams(ngrams, num=20):
"""Print num most common n-grams of each length in n-grams dict."""
list_of_ngrams = set()
for n in sorted(ngrams):
for gram, count in ngrams[n].most_common(num):
list_of_ngrams.add(' '.join(gram))
return list(list_of_ngrams)
#######################################################################################
def print_most_frequent(ngrams, num=20):
"""Print num most common n-grams of each length in n-grams dict."""
print("\nLogic behind our tips prediction:-")
for n in sorted(ngrams):
print('----- {} most common {}-grams -----'.format(num, n))
for gram, count in ngrams[n].most_common(num):
print('{0}: {1}'.format(' '.join(gram), count))
print('')
#######################################################################################
def run(path,filename):
dishes = loadwords('/Users/utkarsh/Desktop/BIA_project/dishes.txt')
recommender_list = loadwords('/Users/utkarsh/Desktop/BIA_project/List.txt')
unique_sentences = UniqueReviews(path)
time.sleep(1)
print('\nExtracted from the restaurant: {}'.format(filename))
ngrams = count_ngrams(unique_sentences)
print_most_frequent(ngrams)
time.sleep(4)
ngrams_list=list_of_ngrams(ngrams)
ngrams_sent = dict()
for grams in ngrams_list:
for sentence in unique_sentences:
if grams in sentence:
ngrams_sent.setdefault(grams,[]).append(str(sentence))
for grams,sentences in ngrams_sent.items():
if grams in dishes:
print("\nYou can try dish: {}".format(grams))
print("------------------------------------------------")
max_sentence = len(max(sentences))
count = 0
for sentence in sentences:
count += 1
words = sent_tokenize(sentence)
for word in words:
if (len(sentence)== max_sentence or word in recommender_list):
sentence= sentence.replace(grams,'**'+str(grams)+'**')
print("Sample review : {}".format(sentence))
print("Also the tip appears in another {} reviews.".format(count))
dirpath = '/Users/utkarsh/Desktop/BIA_project/Restaurants/'
filename = '2.TheMasalaWala.txt'
path = dirpath+filename
run(path,filename)
Discoveries and Reflections:
As the project reached its zenith, the insights uncovered were not just about restaurants; they mirrored the diverse perspectives and expectations of users. The project became a testament to the richness of data mining and visualization, demonstrating the transformative power of turning raw information into a meaningful story.
Conclusion:
In the world of Yelp restaurant reviews, where opinions abound, this project stood out as a testament to the art of extraction and interpretation. The combination of web scraping, logic crafting, text mining, and data visualization transformed a sea of words into a mosaic of insights. Through this project, I not only explored the depths of data exploration but also savored the satisfaction of uncovering hidden gems within the digital tapestry of user experiences on Yelp.