The data for this Exploratory Data Analysis (EDA) is published at Kaggle (link.) In this EDA, we will find where hospitals with high rating scores in the US are.

As usual, let’s load libraries and data.

There are many ways to plot the data on the map such as ggmap. Since the dataset doesn’t give longitude and latitude coordinates, I will utilize the ZIP code through a package called “ggcounty” from (link) and FIPS county code from U.S. Census. Next, we will set up some functions to use throughout the EDA.

Next, let’s process data.

As usual, let’s start with glimpse() .

There are 4,807 observations and 29 variables. Many of them are factors that are not yet sorted. Also, they use “Not Available.” So, I will sort the level of factor and change “Not Available” to NA.

From now on we will call the dataset “eda.” There are 4807 observations (or hospitals) in the dataset covering 53 states. There are many exciting features such as “Hospital Overall Rating,” “Readmission.national.comparison”, and other types of comparison. For a start, I’d like to focus on the overall quality of the hospital. So, let’s see how the score distributed.

Ah, isn’t that a normally distributed chart?! Excluding “NA” of course. But that’s a whole lot of NA. Isn’t that about 25% of the data? 😐 Next, let’s see how many states are in this dataset.

Texas has a whole lot of representation in this dataset. But since we have over 25% of NA, I think it is worthwhile to see what it looks like.

As expected, a whole lot of hospitals in Texas cannot report the data. But some states such as Puerto Rico and South Dakota have more than 70% hospitals that cannot complete the survey. Is there any state that worse than that? Yes, Maryland. It seems like ALL hospitals in Maryland cannot out the data. I’m curious why.

Hm, so I would guess the majority of the hospitals that cannot report scores are relatively small hospitals?  Let’s drill deeper, the “Hospital.Ownership” variable may shed some light.

Voluntary non-profit – Private could not do the reporting. I’d guess they are somehow a small hospital that may not offer that many services and thus just simply too few measures.
For now, let’s include those that cannot report out and make some charts.
There are eight scoring features in the dataset as follows:

  • Hospital overall rating
  • Mortality.national.comparison
  • Readmission.national.comparison
  • Patient.experience.national.comparison
  • Efficient.use.of.medical.imaging.national.comparison

The most interesting feature is, of course, overall rating. I’d rather think that those score five would probably have “Above.Average” in other features. But we would never know until we take a look.

Yes, as expected, if a hospital gets a 5 rating, they would generally score either Above or Same. But come to think about it, two charts are interesting: Mortality and Readmission. In other features, “Above” is better than “Below.” But those 2 are the opposite. However, those with five stars mostly fall into “Above.” Hm, so the case is either miscoding or 5-star hospital does, in fact, have more readmission rate.

It is certainly good to know about the quality of the hospital. But bar chart or scatter plot may not be the most appropriate chart in this case. Let’s try choropleth. But I am not certain if we should use “Hospital.overall.rating” since they have over 25% of missing data in the feature. Fortunately, “Hospital.overall.rating” is not the only measure available in this dataset. Some interesting features of this dataset are that a hospital can still report out some other measures even though they don’t have the score for “Hospital.overall.rating.” Let’s take a look.

128 Hospitals could report the “Mortality.national.comparison” measure. Well, they could report something. 🙂 So, I’ll create a new criterion. I will give the score of 3 to “Above the national average,” 2 to  “Same as the National average” and 1 to “Below the National average.” I’ll reverse the score for Readmission and Mortality rate as “Above average” indicates the worse condition.

Let’s see the new score.

Well, at least we decreased the NA by half! 😀 Those with a score of zero are hospitals that truly cannot report anything out. Alright, now is the time to do the choropleth map. The data gives the ZIP code which will be handy when combining with the FIPS data. Let’s download the FIPS from

Unfortunately, there are a whole lot of preparation we need to do before joining them together; some is more annoying than others. There are some differences in little details on how to spell county names: LASALLE or LA SALLE, ST. MART’s or ST. MARY. All of those small differences must be fixed.

And now is the time for joining FIPS and the hospital data.

Next, we will create a base map by using “ggcounty” package.

Now the map is ready. Let’s plot the hospital data on the map.

Hm, that’s interesting. Those white counties are counties that are not in the survey. I am not quite surprised that counties with white or dark blue are in the Midwest. But it seems like hospital qualities in the West Coast (California, Portland, and Washington) are on the lower end comparing to those in the East. Let’s take a look at California in particular.

It is quite on the lighter blue. But what is wrong with those two counties? They are huge yet have a low score. Let’s plot hospitals in the map.

Hm, apparently, there are a lot of hospitals in San Francisco, Los Angeles, and San Diego. The counties in question though have either 1 or 2 hospitals in the survey. I’d think they must be small hospitals that may not have the capacity to track all the kinds of stuff in the questionnaire.

Let’s see if the assumption is correct with a state that could have a huge amount of hospitals but clustered only to a couple of counties… New York!


Some counties have a really high score, but that was just from a couple of hospitals. On the contrary, New York City score should be around in the middle where there are significantly more hospitals in the area.

I think that makes sense. There are 8 million people in NYC and to take good care of all of them is just simply impossible. On the contrary, in a small county where a population is not so high, healthcare management would be much easier and thus result in a higher score.

I think this is a pretty good dataset to give the overall state of the healthcare in the US. But I feel like the reach of this survey is somewhat limited. There must be some criteria that will dictate if a hospital will be included. So if you were to move to a rural area, this dataset might not help you out that much as hospitals in the area may not even be in this dataset. But if you move to a big city such as New York City, Los Angeles, or San Francisco, this dataset can really help you figure out what hospital you should go. And to answer the question above, I’d say this chart can help.

Hm, so, It seems like Indiana has the highest score hospital in the dataset. Also, its overall hospitals appear to have a score of more than 15. Delaware and Rhode Island have good overall scores. Yep, those are all rural states, while highly populated states such as California and New York have very broad score distribution.

TL;DR: I think this is a good dataset yet incomplete. It can act as a good indicator but with some grain of salt. First, it certainly cannot cover every hospital in the US. Second, some small hospitals seem to struggle with answering the surveys. Third, for some reason, some states such as Maryland cannot answer any of the questions. 😐