H1B or H-1B refers to a visa that an employer can apply for temporary workers. We will not discuss its usefulness or controversies that have been covered extensively by the media. On the contrary, we will look at the hard cold facts officially published by United States Department of Labor (DoL.) The data in this series is fiscal year 2017 and is released at the DoL web site (link) under “Disclosure Data” section.
But as usual, we need to load goodies.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ##### Load Goodies ##### library(tidyverse) library(readxl) library(ggrepel) ##### Create Theme for GGPLOT2 ##### theme_moma <- function(base_size = 12, base_family = "Helvetica") {   theme(     plot.background = element_rect(fill = "#F7F6ED"),     legend.key = element_rect(fill = "#F7F6ED"),     legend.background = element_rect(fill = "#F7F6ED"),     panel.background = element_rect(fill = "#F7F6ED"),     panel.border = element_rect(colour = "black", fill = NA, linetype = "dashed"),     panel.grid.minor = element_line(colour = "#7F7F7F", linetype = "dotted"),     panel.grid.major = element_line(colour = "#7F7F7F", linetype = "dotted")   ) } | 
The DoL released the data in an Excel file format alongside an explanation in a PDF. Let’s load the data and take a look at its dimension.
| 1 2 3 4 5 | ##### Load Data ##### data <- read_xlsx('H-1B_Disclosure_Data_FY17.xlsx') ##### Dimension ##### dim(data) | 
| 1 2 | > dim(data) [1] 528146     40 | 
Wow, 528,416 observations and 40 variables. I am sure that with that many variables, there are many interesting facts to discover. But as this is the first article in the series and is an introduction, we will only use only a few variables: CASE_STATUS, CASE_SUBMITTED, DECISION_DATE, VISA_CLASS, EMPLOYMENT_START_DATE, EMPLOYER_NAME, and TOTAL_WORKERS.
| 1 2 3 4 | ##### Data 2 ##### data2 <- select(data, CASE_STATUS, CASE_SUBMITTED, DECISION_DATE,                        VISA_CLASS, EMPLOYMENT_START_DATE, EMPLOYER_NAME,                        TOTAL_WORKERS) | 
Let’s take a look at the VISA_CLASS first.
| 1 2 3 4 5 | ##### Visa Type ##### ggplot(data, aes(x = VISA_CLASS)) +    geom_bar() +    geom_text(stat='count',aes(label=..count..), vjust = -0.5) +   theme_moma() | 

The majority of the filing in the release is H1B visa which, I believe, applies to every nationality. The exception would be Chile, Singapore, and Australian in which they seem to have their specific visa type but is still considered non-immigrant visa. Since other visas are insignificant, there should be no need to analyze them separately.
Case Submitted
According to the explanation from DoL, the data is extracted from the Office of Foreign Labor Certification’s iCERT Visa Portal System where the decision date was between October 1, 2016, and June 30, 2017, inclusive.
It was an interesting criterion to use. If DoL uses the decision date, could there be a backlog in which an application submitted in 2012 but had just been approved/denied in 2017? We can use min() and max() on CASE_SUBMITTED.
| 1 2 3 4 | ##### CASE_SUBMITTED ##### data2 %>%   summarise(Min = min(CASE_SUBMITTED),          Max = max(CASE_SUBMITTED)) | 
| 1 2 3 4 5 6 7 | > data2 %>% +   summarise(Min = min(CASE_SUBMITTED), +          Max = max(CASE_SUBMITTED)) # A tibble: 1 x 2          Min        Max       <dttm>     <dttm> 1 2011-03-23 2017-06-30 | 
Mar 23, 2011? That’s 5 to 6 years of waiting for one single decision. 😐
Since the submitted date went back as far as six years, let’s create a time series chart.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ##### CASE_SUBMITTED Plot ##### data2 %>%   select(CASE_SUBMITTED) %>%   arrange(CASE_SUBMITTED) %>%   mutate(Month = format(CASE_SUBMITTED, "%m"),          Year = format(CASE_SUBMITTED, "%y")) %>%                  select(Month, Year) %>%   mutate(Count = n()) %>%   ggplot(aes(x = Month, group = Count)) +    geom_line(stat = "count") +   theme_moma() +   facet_grid(.~Year) +   geom_point(stat = "count")  +   theme(axis.text.x = element_text(angle = 90, size = 7),         panel.grid.minor = element_line(colour = "#F7F6ED", linetype = "dotted"),         panel.grid.major = element_line(colour = "#F7F6ED", linetype = "dotted")) +   geom_text_repel(stat='count',aes(label=..count..), size = 3) | 
 It seems like most of the decision has been made relatively quick. But we can do better than that to find the average time it takes to get a final decision.
It seems like most of the decision has been made relatively quick. But we can do better than that to find the average time it takes to get a final decision.
Processing Time
We can subtract CASE_SUBMITTED and DECISION_DATE to obtain the processing time.
| 1 2 3 4 | ##### Processing Time ##### data2 %>%   mutate(Processing_Time = as.numeric((DECISION_DATE - CASE_SUBMITTED)/86400)) %>%   select(Processing_Time) %>% summary() | 
| 1 2 3 4 5 6 7 8 9 10 11 | > data2 %>% +   mutate(Processing_Time = as.numeric((DECISION_DATE - CASE_SUBMITTED)/86400)) %>% +   select(Processing_Time, CASE_STATUS) %>%  +   summary()  Processing_Time   CASE_STATUS         Min.   :   0.00   Length:528146       1st Qu.:   6.00   Class :character    Median :   6.00   Mode  :character    Mean   :  31.21                       3rd Qu.:   6.00                       Max.   :2214.00 | 
They are fairly quick. The mean is only 31 days.
Employer
When it comes to H1B coverage, the media often cites tech giant such as Facebook and Microsoft. But are tech giants the only petitioners? Let’s do some simple count.
| 1 2 3 4 5 | ##### Employers ##### data %>%   group_by(EMPLOYER_NAME) %>%   summarise(count = n()) %>%   arrange(desc(count)) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | > data %>% +   group_by(EMPLOYER_NAME) %>% +   summarise(count = n()) %>% +   arrange(desc(count)) # A tibble: 65,132 x 2                        EMPLOYER_NAME count                                <chr> <int>  1                   INFOSYS LIMITED 17059  2 TATA CONSULTANCY SERVICES LIMITED 10806  3             CAPGEMINI AMERICA INC  8208  4         IBM INDIA PRIVATE LIMITED  7673  5     TECH MAHINDRA (AMERICAS),INC.  6903  6                     ACCENTURE LLP  5573  7           DELOITTE CONSULTING LLP  5449  8            ERNST & YOUNG U.S. LLP  5143  9                       GOOGLE INC.  4532 10             MICROSOFT CORPORATION  4042 # ... with 65,122 more rows | 
Wow. There were 65,312 employers filed the H1B applications. By just eyeballing, it seems like these top 10 employees accounted for almost 100,000 submitted applications. What about the rest?
| 1 2 3 4 5 6 | ##### Summary ##### data %>%   group_by(EMPLOYER_NAME) %>%   summarise(count = n()) %>%   arrange(desc(count)) %>%   summary() | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | > ##### Summary ##### > data %>% +   group_by(EMPLOYER_NAME) %>% +   summarise(count = n()) %>% +   arrange(desc(count)) %>% +   summary()  EMPLOYER_NAME          count            Length:65132       Min.   :    1.000    Class :character   1st Qu.:    1.000    Mode  :character   Median :    1.000                       Mean   :    8.109                       3rd Qu.:    3.000                       Max.   :17059.000 | 
Only three applications on average.
No. of Application VS. No. of Workers
When we apply for, say a new passport, we can only put one applicant in an application, right? So, how about an H1B application?
At first, I thought one application is for one worker. But it seems like I am wrong.
| 1 2 3 4 | ##### Worker VS Application ##### data %>%   select(EMPLOYER_NAME, CASE_NUMBER, TOTAL_WORKERS, CASE_STATUS) %>%   arrange(desc(TOTAL_WORKERS)) | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | > data %>% +   select(EMPLOYER_NAME, CASE_NUMBER, TOTAL_WORKERS, CASE_STATUS) %>% +   arrange(desc(TOTAL_WORKERS)) # A tibble: 528,146 x 4              EMPLOYER_NAME        CASE_NUMBER TOTAL_WORKERS CASE_STATUS                      <chr>              <chr>         <dbl>       <chr>  1 DELOITTE CONSULTING LLP I-200-16307-558116           155      DENIED  2              APPLE INC. I-200-16319-287614           150   CERTIFIED  3              APPLE INC. I-200-16319-496930           150   CERTIFIED  4              APPLE INC. I-200-16319-541705           150   CERTIFIED  5              APPLE INC. I-200-16319-087529           150   CERTIFIED  6              APPLE INC. I-200-16319-172125           150   CERTIFIED  7              APPLE INC. I-200-16319-608407           150   CERTIFIED  8              APPLE INC. I-200-16319-401391           150   CERTIFIED  9              APPLE INC. I-200-16319-872666           150   CERTIFIED 10              APPLE INC. I-200-16319-217547           150   CERTIFIED # ... with 528,136 more rows | 
Deloitte has submitted one application that has 155 workers. Apple, on the contrary, has filed multiple applications each with its unique case number with 150 workers each.
Consequently, the total number of workers whose employers applied an H1B visa for is not merely the sum of the case number. But the sum of the TOTAL_WORKERS by EMPLOYER_NAME.
Total Workers
But, in general, how many workers are in an application?
| 1 2 3 4 5 | ##### How many workers in a application? ##### data2 %>%   ggplot(aes(x = as.factor(TOTAL_WORKERS))) +    geom_bar() +   theme_moma() | 

Oh okay. 90% of the application only has one worker. That makes sense. But the other 10% has more than one worker. Let’s modify the chart a bit.
| 1 2 3 4 5 6 7 8 | ##### Second Try ##### data2 %>%   filter(TOTAL_WORKERS > 1) %>%   ggplot(aes(x = as.factor(TOTAL_WORKERS))) +    geom_bar() +   theme_moma() +   geom_text(stat='count', aes(label=..count..), size = 3, vjust = -0.5) +   theme(axis.text.x = element_text(size = 7)) | 

15 workers per applicant? But then I am not a legal expert. 🙂
Before we end this Overview post, how many of these applications have been approved?
| 1 2 3 4 5 6 | ##### Approval Rate ##### data2 %>%   ggplot(aes(x = CASE_STATUS)) +   geom_bar() +   theme_moma() +   geom_text(stat='count', aes(label=..count..), size = 3, vjust = -0.5) | 

