Google

Let’s analyze 3,320 reviews from Googlers.

We have seen what people said about Airbnb (link) and Microsoft (link, and link.) Now is the time for Google. I scraped a total of 3,320 reviews from 2015 to 2017. So, let’s take a look what Googlers said. But as usual, we need to load the goodies.

##### Load Goodies #####
library(tidyverse)
library(stringr)
library(tidytext)
library(ggraph)
library(igraph)

##### Create Theme for GGPLOT2 #####
theme_moma <- function(base_size = 12, base_family = "Helvetica") {
  theme(
    plot.background = element_rect(fill = "#F7F6ED"),
    legend.key = element_rect(fill = "#F7F6ED"),
    legend.background = element_rect(fill = "#F7F6ED"),
    panel.background = element_rect(fill = "#F7F6ED"),
    panel.border = element_rect(colour = "black", fill = NA, linetype = "dashed"),
    panel.grid.minor = element_line(colour = "#7F7F7F", linetype = "dotted"),
    panel.grid.major = element_line(colour = "#7F7F7F", linetype = "dotted")
  )
}

##### Load Goodies #####

library(tidyverse)

library(stringr)

library(tidytext)

library(ggraph)

library(igraph)

##### Create Theme for GGPLOT2 #####

theme_moma <- function(base_size = 12, base_family = "Helvetica") {

theme(

plot.background = element_rect(fill = "#F7F6ED"),

legend.key = element_rect(fill = "#F7F6ED"),

legend.background = element_rect(fill = "#F7F6ED"),

panel.background = element_rect(fill = "#F7F6ED"),

panel.border = element_rect(colour = "black", fill = NA, linetype = "dashed"),

panel.grid.minor = element_line(colour = "#7F7F7F", linetype = "dotted"),

panel.grid.major = element_line(colour = "#7F7F7F", linetype = "dotted")

)

}

As I use ‘xx’ as the first observation in the initial data frame and some other features, we need to exclude them first.

##### Processing #####
data2 <- data[2:6]
data2 <- data2 %>%
  filter(Summary != "xx")

##### Processing #####

data2 <- data[2:6]

data2 <- data2 %>%

filter(Summary != "xx")

Next is date conversion.

##### Date Management #####
data2$Posted_Date <- as.Date(data2$Posted_Date, format = " %b %d, %Y")

1 2	##### Date Management ##### data2$Posted_Date <- as.Date(data2$Posted_Date, format = " %b %d, %Y")

For Google, I didn’t apply any filter in Glassdoor. So there are both domestic (US) and international reviews. As Glassdoor include employment status, title, and location in Title column. We need to split them.

##### Separating Employee Type and Locations #####
data2 <- data2%>%
  separate(col = Title, into = c("Employee_Type","Location"),sep = ' in ') %>%
  separate(col = Employee_Type, into = c("Employee_Type", "Title"), sep = 'ee - ')

data2$Employee_Type <- str_replace_all(data2$Employee_Type, "Employ","Employee")

##### Separating Employee Type and Locations #####

data2 <- data2%>%

separate(col = Title, into = c("Employee_Type","Location"),sep = ' in ') %>%

separate(col = Employee_Type, into = c("Employee_Type", "Title"), sep = 'ee - ')

data2$Employee_Type <- str_replace_all(data2$Employee_Type, "Employ","Employee")

Next, we classify the reviews into Domestic and International. It’s very straightforward as the international review will have a country abbreviation in parentheses. So, we can just use “(” to distinguish between the two.

##### US or International #####
data2 <- data2 %>%
  mutate(Location_2 = ifelse(str_detect(Location, "[(]")==TRUE,"International","Domestic"))

##### US or International #####

data2 <- data2 %>%

mutate(Location_2 = ifelse(str_detect(Location, "[(]")==TRUE,"International","Domestic"))

Let’s visualize.

##### Visualize - 1 #####
ggplot(data2, aes(x = Location_2)) + 
  geom_bar() + 
  theme_moma() +
  geom_text(stat='count',aes(label=..count..), vjust = -0.5)

##### Visualize - 1 #####

ggplot(data2, aes(x = Location_2)) +

geom_bar() +

theme_moma() +

geom_text(stat='count',aes(label=..count..), vjust = -0.5)

Well, why wouldn’t Googlers want to disclose where they work at? 😐 I doubt if someone is going to chase them. Oh well, since NA consists of almost half of the reviews, let’s just ignore the location of the reviews, for now anyway.

So, it’s time to process their comments.

##### Creating New Variables for Processing #####
data2<- data2 %>%
  mutate(Summary_2 = Summary,
         Pros_2 = Pros,
         Cons_2 = Cons)

##### Change to lower #####
data2$Summary_2 <- str_replace_all(data2$Summary_2, "[:alpha:]",tolower)
data2$Pros_2 <- str_replace_all(data2$Pros_2, "[:alpha:]",tolower)
data2$Cons_2 <- str_replace_all(data2$Cons_2, "[:alpha:]",tolower)

##### Remove Punct #####
data2$Summary_2 <- str_replace_all(data2$Summary_2, "[:punct:]","")
data2$Pros_2 <- str_replace_all(data2$Pros_2, "[:punct:]","")
data2$Cons_2 <- str_replace_all(data2$Cons_2, "[:punct:]","")

##### Remove Numbers #####
data2$Summary_2 <- str_replace_all(data2$Summary_2, "[:digit:]","")
data2$Pros_2 <- str_replace_all(data2$Pros_2, "[:digit:]","")
data2$Cons_2 <- str_replace_all(data2$Cons_2, "[:digit:]","")

##### Creating New Variables for Processing #####

data2<- data2 %>%

mutate(Summary_2 = Summary,

Pros_2 = Pros,

Cons_2 = Cons)

##### Change to lower #####

data2$Summary_2 <- str_replace_all(data2$Summary_2, "[:alpha:]",tolower)

data2$Pros_2 <- str_replace_all(data2$Pros_2, "[:alpha:]",tolower)

data2$Cons_2 <- str_replace_all(data2$Cons_2, "[:alpha:]",tolower)

##### Remove Punct #####

data2$Summary_2 <- str_replace_all(data2$Summary_2, "[:punct:]","")

data2$Pros_2 <- str_replace_all(data2$Pros_2, "[:punct:]","")

data2$Cons_2 <- str_replace_all(data2$Cons_2, "[:punct:]","")

##### Remove Numbers #####

data2$Summary_2 <- str_replace_all(data2$Summary_2, "[:digit:]","")

data2$Pros_2 <- str_replace_all(data2$Pros_2, "[:digit:]","")

data2$Cons_2 <- str_replace_all(data2$Cons_2, "[:digit:]","")

Next, many variations of work/life balance. We need to change them to have the same word: ‘wlb.’

##### Work Life Balance & Other Words #####
worklife <- array(c("work life balance", "work-life balance",
                    "work/life balance", "work life", "work&life",
                    "worklife"))

for (i in 1:nrow(worklife)){
  print(i)
  for (j in 8:ncol(data2)) {
    data2[[j]] <- str_replace_all(data2[[j]],worklife[[i]],"wlb")
  }
}

##### Work Life Balance & Other Words #####

worklife <- array(c("work life balance", "work-life balance",

"work/life balance", "work life", "work&life",

"worklife"))

for (i in 1:nrow(worklife)){

print(i)

for (j in 8:ncol(data2)) {

data2[[j]] <- str_replace_all(data2[[j]],worklife[[i]],"wlb")

}

Alright, let’s count the titles.

##### Title #####
data2 %>%
  group_by(Title) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

##### Title #####

data2 %>%

group_by(Title) %>%

summarise(count = n()) %>%

arrange(desc(count))

# A tibble: 590 x 2
                      Title count
                      <chr> <int>
 1       Anonymous Employee  1843
 2        Software Engineer   233
 3 Senior Software Engineer    58
 4          Product Manager    37
 5          Program Manager    34
 6          Account Manager    29
 7       Account Strategist    26
 8  Staff Software Engineer    26
 9    Software Engineer III    21
10                   Intern    20
# ... with 580 more rows

# A tibble: 590 x 2

Title count

1 Anonymous Employee 1843

2 Software Engineer 233

3 Senior Software Engineer 58

4 Product Manager 37

5 Program Manager 34

6 Account Manager 29

7 Account Strategist 26

8 Staff Software Engineer 26

9 Software Engineer III 21

10 Intern 20

# ... with 580 more rows

There are 3,321 observations. But Anonymous Employee accounted for 55.5%. Well, I’ll just ignore the review distribution by title then. Seriously, what are these Googlers afraid of? 😐

Okay, then let’s move to creating a bigram chart.

#### Overall - Pros #####
#Step 1: Unnest
data_pros <- data2 %>% select(Pros_2) %>% 
  unnest_tokens(words, Pros_2, token = 'ngrams',n = 2)

#Step 2: Separate
data_pros_split <- data_pros %>%
  separate(words, c("from","to",sep = " ")) %>%
  select(1:2)

#Step 3: Remove stopwords
data_pros_clean <- data_pros_split %>%
  filter(!from %in% stop_words$word) %>%
  filter(!to %in% stop_words$word)

#Step 4: Count
data_pros_counts <- data_pros_clean %>% 
  count(from, to)

pros_bigram <- data_pros_counts %>%
  filter(n > 15) %>%
  graph_from_data_frame()

arrow_control <- grid::arrow(type = "closed", length = unit(.15, "inches"))
Pros_chart <- ggraph(pros_bigram) +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = arrow_control) +
  geom_node_point(color = "lightgreen", size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() + theme( plot.background = element_rect(fill = "#F7F6ED")) +
  ggtitle("Pros")

Pros_chart

#### Overall - Pros #####

#Step 1: Unnest

data_pros <- data2 %>% select(Pros_2) %>%

unnest_tokens(words, Pros_2, token = 'ngrams',n = 2)

#Step 2: Separate

data_pros_split <- data_pros %>%

separate(words, c("from","to",sep = " ")) %>%

select(1:2)

#Step 3: Remove stopwords

data_pros_clean <- data_pros_split %>%

filter(!from %in% stop_words$word) %>%

filter(!to %in% stop_words$word)

#Step 4: Count

data_pros_counts <- data_pros_clean %>%

count(from, to)

pros_bigram <- data_pros_counts %>%

filter(n > 15) %>%

graph_from_data_frame()

arrow_control <- grid::arrow(type = "closed", length = unit(.15, "inches"))

Pros_chart <- ggraph(pros_bigram) +

geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,

arrow = arrow_control) +

geom_node_point(color = "lightgreen", size = 3) +

geom_node_text(aes(label = name), vjust = 1, hjust = 1) +

theme_void() + theme( plot.background = element_rect(fill = "#F7F6ED")) +

ggtitle("Pros")

Pros_chart

Isn’t that almost about the same as that of Microsoft? Smart Colleague, Competitive Pay. But there is apparently one thing that Microsoft lacks… FREE FOODS!

Let’s move to Cons.

#### Overall - Cons #####
#Step 1: Unnest
data_cons <- data2 %>% select(Cons_2) %>% 
  unnest_tokens(words, Cons_2, token = 'ngrams',n = 2)

#Step 2: Separate
data_cons_split <- data_cons %>%
  separate(words, c("from","to",sep = " ")) %>%
  select(1:2)

#Step 3: Remove stopwords
data_cons_clean <- data_cons_split %>%
  filter(!from %in% stop_words$word) %>%
  filter(!to %in% stop_words$word)

#Step 4: Count
data_cons_counts <- data_cons_clean %>% 
  count(from, to)

cons_bigram <- data_cons_counts %>%
  filter(n > 10) %>%
  graph_from_data_frame()

arrow_control <- grid::arrow(type = "closed", length = unit(.15, "inches"))
cons_chart <- ggraph(cons_bigram) +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = arrow_control) +
  geom_node_point(color = "lightgreen", size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() + theme( plot.background = element_rect(fill = "#F7F6ED")) +
  ggtitle("cons")

cons_chart

#### Overall - Cons #####

#Step 1: Unnest

data_cons <- data2 %>% select(Cons_2) %>%

unnest_tokens(words, Cons_2, token = 'ngrams',n = 2)

#Step 2: Separate

data_cons_split <- data_cons %>%

separate(words, c("from","to",sep = " ")) %>%

select(1:2)

#Step 3: Remove stopwords

data_cons_clean <- data_cons_split %>%

filter(!from %in% stop_words$word) %>%

filter(!to %in% stop_words$word)

#Step 4: Count

data_cons_counts <- data_cons_clean %>%

count(from, to)

cons_bigram <- data_cons_counts %>%

filter(n > 10) %>%

graph_from_data_frame()

arrow_control <- grid::arrow(type = "closed", length = unit(.15, "inches"))

cons_chart <- ggraph(cons_bigram) +

geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,

arrow = arrow_control) +

geom_node_point(color = "lightgreen", size = 3) +

geom_node_text(aes(label = name), vjust = 1, hjust = 1) +

theme_void() + theme( plot.background = element_rect(fill = "#F7F6ED")) +

ggtitle("cons")

cons_chart

Yeah, that looks familiar: management, review process, politics, and growth. But one interesting comment is “reallyno cons.” Wow. Even Microsoft doesn’t have that compliment. Mountain View also showed up; I’d think that’s because the commute can be brutal for those living in San Francisco.

If you could scroll up to the Pros chart, you could see that “wlb” shows up both in Pros and Cons. Hm, well, let’s take a look at those who complain about wlb. Could the lack of work-life balance come from a specific title?

##### wlb in Cons #####
data2 %>%
  filter(str_detect(Cons_2,'wlb') == TRUE) %>%
  group_by(Title) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

##### wlb in Cons #####

data2 %>%

filter(str_detect(Cons_2,'wlb') == TRUE) %>%

group_by(Title) %>%

summarise(count = n()) %>%

arrange(desc(count))

# A tibble: 43 x 2
                                 Title count
                                 <chr> <int>
 1                  Anonymous Employee    71
 2                   Software Engineer    11
 3                      Administrative     2
 4                             Analyst     2
 5        Associate Account Strategist     2
 6 Associate Product Marketing Manager     2
 7                   Contracts Manager     2
 8                      Senior Analyst     2
 9            Senior Software Engineer     2
10               Software Engineer III     2
# ... with 33 more rows

# A tibble: 43 x 2

Title count

1 Anonymous Employee 71

2 Software Engineer 11

3 Administrative 2

4 Analyst 2

5 Associate Account Strategist 2

6 Associate Product Marketing Manager 2

7 Contracts Manager 2

8 Senior Analyst 2

9 Senior Software Engineer 2

10 Software Engineer III 2

# ... with 33 more rows

Well, “Anonymous Employee” doesn’t help.

TL;DR Googlers are afraid of telling their title. Not sure what they are so scared of. The Pros and Cons are quite similar to those of Microsoft. Pros: smart colleagues, excellent pay, and free foods. Cons: management, long commute, politics.