A couple of decades ago, we can only know about a company after we start working or from rumors. Those certainly are not the best ways. Let’s use R to systematically harvest reviews about companies in Glassdoor and visualize with its great NLP and graphic packages
As I mainly use R, I used Hadley Wickham’s rvest package. First, we need to obtain a URL and figure out how to move between pages. I found that we can only change the page we view by modifying a page number (much easier than I thought.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
##### Load Goodies ##### library(tidyverse) library(rvest) ##### Initial Setup ##### i <- 1 # For While Loop and changing URL max <- X #Page to scrape URL_Begin <- ('Begin') URL_End <- ('END') ##### Initial Data Frame ##### data <- data.frame(Posted_Date = c('xx'), Summary = c('xx'), Pros = c('xx'), Cons = c('xx'), Title = c('xx')) |
We need to obtain a URL which will be specific to each company in Glassdoor. The URL is separated into three parts. The first part is unique to a company. The page number is next. Then, the third part depends on what filter you set. For example, if you only want reviews from full-time employees,
filter.employmentStatus=REGULAR will be included in the URL.
The next thing to do before scraping is to figure out the HTML tag or CSS tag. I used SelectorGadget (Creator’s Web) As I am only interested in a few attributes, figuring it out was very simple. Now we are ready to scrape reviews with these codes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
while (i <= max) { #Save URL save<- read_html((paste(URL_Begin,i,URL_End, sep = ""))) Posted_Date <- save %>% html_nodes('div time') %>% html_text() #Posted_Date Summary <- save %>% html_nodes('.summary span') %>% html_text() #Summary Pros <- save %>% html_nodes('.pros') %>% html_text() #Pros Cons <- save %>% html_nodes('.cons') %>% html_text() #Cons Title <- save %>% html_nodes('#ReviewsFeed .hideHH') %>% html_text() #Title #Combining Data x <- as.data.frame(cbind(Posted_Date,Summary,Pros,Cons,Title,Page_Number)) data <- rbind(data,x) #Dealing with Ascii data$Summary <- stri_trans_general(data$Summary, "latin-ascii") data$Pros <- stri_trans_general(data$Pros, "latin-ascii") data$Cons <- stri_trans_general(data$Cons, "latin-ascii") data$Title <- stri_trans_general(data$Title, "latin-ascii") #Adding Lag Time lag <- runif(1,60,240) Sys.sleep(lag) #Adding Value i <- i+1 } |
I added lag time in the code so that it wouldn’t be too rough on the server. So, it may take some time to get the reviews.