Amazon Review Scrapper

Chul Kim
Sep 3, 2013
2 min read

Product reviews contain valuable and priceless information. They reflect word-of-mouth, consumers' preferences, VOC (voice of customer), etc. Thus, keep monitoring product reviews is necessary for manufactures. The manufacturers can not only proactively react to VOC but also incorporate consumers' preferences and ideas into their product design.

I would like to introduce my Amazon review scrapper (the python code is available upon request). It automatically scraps all reviews of all cell phones listed in "Cell Phones & Accessories" category in Amazon. Also, you can customize the cell phone list based on your interests to reveal. Scraping all reviews of all cell phones may take more than a week.

We need further steps to extract meaningful information. We have to quantify the TEXTs to retrieve intrinsic meanings, contexts, and sentiments of each review. We can easily score the level of positiveness and/or negativeness of each review (sentiment analysis). A list of positive and negative opinion words or sentiment words for English was downloaded from here. Also, you can extract hidden topics of each reviews and group them (topic model, Latent Dirichlet allocation). R package "topicmodels" can help you applying topic models to your text data.

The above figure called "wordcloud" represents the importance of words that appear in amazon reviews for iPhone 6. The more a word appears, the larger the letters is. Let’s create the wordcloud figure of the amazon reviews with R. I collected some Amazon reviews of iPhone 6. R package "wordcloud" helps us to visualize the most common words in the reviews and have a general feeling of the reviews.

Step 0) Load Libraries

library(wordcloud) library(tm) library(SnowballC) library(ColorBrewer)

Step 1) Create Corpus

You need to load the text into a so-called corpus. The tm package can process it. A corpus is a collection of documents.

temp_text = readLines("review.txt")

jeopCorpus = Corpus(VectorSource(temp_text))

Step 2) Text Cleansing

We will convert the corpus to a plain text document.

jeopCorpus = tm_map(jeopCorpus, PlainTextDocument)

Then, we will remove all punctuation and stopwords. Stopwords are commonly used words in the English language such as I, me, my, etc. You can see the full list of stopwords using stopwords('english').

jeopCorpus = tm_map(jeopCorpus, removePunctuation) jeopCorpus = tm_map(jeopCorpus, removeWords, stopwords('english'))

Step 3) Stemming

We will perform stemming. This means that all the words are converted to their stem (Ex: learning -> learn, walked -> walk, etc.). This will ensure that different forms of the word are converted to the same form and plotted only once in the wordcloud.

jeopCorpus = tm_map(jeopCorpus, stemDocument)

Step 4) Plot wordcloud

Finally, the following commend plots the wordcloud! Enjoy!

wordcloud(jeopCorpus, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8,"BrBG"))

Chul Kim
Assistant Professor of Marketing
Baruch College, City University of New York (CUNY)

Amazon Review Scrapper

Comments