Please refer to this post for scraping code. As usual, we need to load the goodies.

First, we will remove the initial row. Next, we need to clean up the Posted_Date. One funny thing about the Posted_Date is that the first 300 observations, the format is DDMMMYYYY. But from 301 on, the format is MMMDDYYYY. Therefore we need two separate date conversion codes for them.

Next, we need to clean up the Title. The original Title comes in the form of “Employee Status – Title in Location.” The original form looks more like a combination of three variables. We need to do some data munging.

I used “ee -” instead of just an “-” because some users used “-” to indicate the job level. So we need to add an extra step in the process.

Let’s see the columns we have

Okay. That is much better. Now is the time to process the comments. As I want to see how the string operations affect the data, I’ll create new columns.

In writings, capitalization matters. But it is not in this case.

After glancing the data, I found that users called companies and CEO in different ways. I’ll just change them to ‘company’ and ‘ceo.’

Another word that needs some fix is “work-life balance.” There are seven major variations. We need to make them aligned.

Finally, removing punctuation and numbers.

At this point, there are too many unique titles. I’ll use ifelse()  to group them.

Before we proceed to text analytics, let’s do an EDA.

So, what group do most of our reviewers work for?

No doubt as Microsoft is an engineering-intensive company. Now let’s see the time-series plot.

It seems like the Glassdoor popularity among Microsoft employees gained traction over time. But popularity started to dwindle in the past two years. In term of Group, users certainly have opted more anonymity. Also, Program Management group has participated more since 2012.

Next, let’s take a look at the Pros and Cons. Instead of using Word Cloud that brings the most common word, I’ll create a network diagram for bigram.

First, we create an ngram for Pros_2.

unnest_tokens()  will pair 2 words together as 1 observation. Then we will split them into ‘from’ and ‘to.’

Now we’ve got two columns. However, there are stop words. Tidytext has built-in stop words data, which we can use to filter out.

Next, we count the occurrence of each bigram, which will be used to differentiation later in the network graph.

Alright, we are now ready to create a network diagram. Since it is likely that plotting every occurrence will significantly clutter the Plot area, let’s specify the minimum occurrence at 30.

So… I expected compensation to be the most frequent term. It turned out, people at Microsoft seem to appreciate each other intelligence the most. Then it is the benefits, pays, and career growth.

Let’s repeat the same process with Cons by just changing ‘Pros’ in the first step to ‘Cons.’

Ok… politics, stack ranking, and review system, which is quite typical for a gigantic company. So… it seems like rank-and-file employees don’t really like their bosses. Both middle managers and upper managers are mentioned in the Cons section.

Although we got pretty interesting Pros and Cons, the data itself disproportionately represented by Engineering group. It is possible that the most of the bigram is from Engineering. Let’s take a look at Product Management group.

The process is still the same, all we need to do is to use filter (Group == Product Mngt)  in the first step.

Okay. It is still about the same Pros: smart people, health insurance, and great benefits. What about Cons?

TL;DR Although comments in Glassdoor are subjective to users; they are still better than knowing nothing at all about a company.