Data Science

A New Way to “See” Reviews with Word Clouds

Information overload is a very real phenomena as digital and information technology progresses. While companies relish in harvesting personal data for increasingly effective advertising, the individual can rarely make use of the same amount of information. Consider the simple act of choosing a restaurant. Going on Yelp or any review site would be a good start to narrowing things down, but how does one exactly choose?

Stop Chasing Variety And Get Results - Seven Stars Fitness

With a service like Yelp, there are currently two primary ways of assessing a restaurant. First, there is the star rating–a number score between 1 and 5, increasing by tenths. For some, this could be a useful enough indicator as to whether or not to visit the venue. Yet, the star rating is usually followed by the total number of reviews, which makes things more confusing because a restaurant rated 4.1 stars with over 300 reviews is not necessarily worse than a restaurant rated 4.9 stars with only 50 reviews.

Then, there are reviews, which offer a slew of problems all on its own. While we tend to highly value reviews for the insight they give, not all are created equal, and they all have their biases. Choosing to read reviews is not a quick process, and as human nature dictates, we often scan for the negative reviews while paying far less attention to many more positive reviews. Depending on the person, the entire process may even take them longer than actually going to the restaurant; we all know someone like this.

How Reading Hundreds of Reviews Can Feel

Therefore, this project offers a simple, but novel way for users to assess businesses. Using word clouds, users can easily grasp a summary of all reviews for a location. This approach offers a deeper look into what a place has to offer than a star rating without forcing users to be bogged down by an endless torrent of reviews. Using parameters within the wordcloud library, the size of the word can also be scaled to its frequency, so a few bad review will not overshadow an abundance of positive ones.

Foursquare API

This project was done with a free personal developer account with Foursquare, a review site similar to Yelp or Google Maps. Foursquare offers a good amount of daily calls, but without a paid account, limits the amount of tips (aka. reviews) called for each venue to only two. Therefore, while the script shown is intended to work with an enterprise account, the word cloud generated below was created from manually scraped tips.

def wordcloud_tips(client_id, client_secret, version, venue_id, fig_length=10, fig_height=10):

""""
Create a word cloud from user tips(reviews) from Foursquare
    based on frequency and ranking after tokenizing and cleaning with
    NLTK

Arguments:
    client_id -string- API client ID
    client_secret -string- API client secret
    version - string- client version 
    venue_id -string- unique identification code from API; also embedded in URL on Foursquare site
    fig_length-int- length of the output word cloud
    fig_height -int- height of word cloud

Returns: 
    Plot of WordCloud object

"""
#create the URL
reviews_url = 'https://api.foursquare.com/v2/venues/{}/tips?&client_id={}&client_secret={}&v={}&limit=100&offset=100'.format(
        venue_id,
        client_id, 
        client_secret,
        version)

#make the API request
request = requests.get(url).json()

#create data list of all tip text
reviews = []

for item in reviews['response']['tips']['items']:
    reviews.append(item['text'])

' '.join(reviews)

#tokenize
tokens = word_tokenize(reviews)

# Remove punctuation
tokens = [word for word in tokens if word.isalpha()]

# Lowercase
tokens = [word.lower() for word in tokens]

# Remove stopwords
tokens = [word for word in tokens if not word in stopwords.words("english")]

# Lemmatize
lemma = WordNetLemmatizer()
tokens = [lemma.lemmatize(word, pos = "v") for word in tokens]
tokens = [lemma.lemmatize(word, pos = "n") for word in tokens]

#rejoin tokens into one string
final_text = ' '.join(tokens)

# Create and generate a word cloud image:
wordcloud = WordCloud(background_color='white', width = 800, height = 400).generate(final_text)

# Display the generated image:
plt.figure(figsize = (fig_length,fig_height))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Using manual scraping (Ie. copy & paste), all reviews for compiled for two restaurants: Brother Taco and Tout Suite, which are both highly rated places in Downtown Houston. Both have a rating of 8.9/10 at the time of writing, but without reading explicit reviews and risk getting sucked into them, it would be hard to infer any other information about these places.

Instead, using Natural Language Processing and Word Cloud, the following comes up for Brothers Taco:

And Tout Suite:

While still verbal in nature, there is a visual aspect to word clouds that allow users to get an instant notion of how other customers felted about the venue. Various adjectives, like delicious, great, homemade, fresh, really paints an persuading tone for users to try the venue. Meanwhile, popular items like macaroons and red salsa are easily visible as recommendations.

Furthermore, this method is still unbiased, since the word clouds are generated straight from user reviews. As a result, it can continually be updated over time to reflect the feelings of the populace. Whereas the significance of a rating can dwindle over time as thousands rate a venue, thus reaching a consensus about general quality, word clouds can change based on seasonal items or even decor!

As the middle-ground, word clouds are not meant to replace star ratings or reviews but instead augment them. Those who wish to know more about a business without having to dig through reviews can get a more holistic sense with a word cloud. Even regular review readers can use word clouds as a good launch pad to find pertinent reviews about a specific matter or dish.

Future Considerations

  • Use bigrams to provide more context on nouns
  • Allow specific word clouds for individual categories like service, parking, etc.
  • Implement sentiment analysis to further filter out ambiguous words
  • Circumvent free account restrictions with other webscraping approaches

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s