Click for Github files with code and analysis

You got upgraded to first class. But then downgraded at your hotel. And then sideswiped by an automatic scooter. But not before a lovely stroll on the beach. The paramedic asks, "How was your trip?"

I've always been interested in stuff like this. Many times, we have feelings about multiple aspects of a phenomena and are then asked to give some overall evaluation. So how does it work?

I don't really know.

But I found an interesting dataset on coffee to use as a case study.

1. Data & research questions

Data

The data comes courtesy of GitHub user, jldbc. He scraped 1,000+ Arabica coffee reviews from Coffee Quality Institute. For each review, a trained tester rated 9 dimensions of a cup of coffee and then gave an overall impression. The 9 dimensions are

Acidity (From flat to sharp)
Aftertaste (Self explanatory)
Aroma (Rating of fresh and brewed smell)
Balance (How everything comes together)
Body (How thick does it taste)
Clean Cup (A lack of defects)
Flavor (Combination of aroma and taste)
Sweetness (How sweet the coffee is)
Uniformity (How uniform the taste is)

Each rating is on a 1-10 scale. Along with this data, each observation has (among other things) the country of origin of the beans.

Research questions

My questions are:

Which aspect of coffee is most important for assigning an overall rating?
Which countries receive overall ratings better than their component ratings would predict?

Hypotheses

My initial hypothesis was that flavor and aftertaste would be the most important components. By construction, "flavor" contains a lot of information -- information on aroma and taste. Danial Kahneman hypothesized the peak-end rule: we tend to encode experiences by their peak and end. Since aftertaste occurs at the end of the act of drinking coffee, this should color our experience of the coffee.

I also thought that Asian countries would receive overall ratings better than their component ratings. Even though this is a blind taste test, it is impossible for it to be truly blind. It is somewhat obvious (to me, and certainly to a professional) if coffee comes from Indonesia vs Colombia. That said, I think we have less priors for coffee from Asia. And I think this novelty aspect will give these countries a boost.

2. Correlations

A lot of these ratings are correlated with each other. But really, there are two sets of correlated groups. There are things like aftertaste, flavor, balance, acidity and body. These represent potency. Think a dark, smoky Sumatra. And then you have clean cup, sweetness, and uniformity — representing lightness. Think a smooth Guatemalan blend with a hint of cherry.

The potency and lightness “principal components” (or combinations of variables) explain 46% and 25% of the variation in the 9 features. Or, we could summarize about 7/10 of the information in the 9 features by just talking about potency and lightness.

3. correlation approach

Now back to the first question. Which component is most important for assigning an overall rating?

Just going by correlations, the answers are flavor, aftertaste, and balance. These 3 make sense. Flavor takes into account both aroma and taste. And if you like both of those, you’ll probably like the coffee. Aftertaste makes sense, as well. We upweight things at the end our experiences. Balance, like flavor, also appears to be a kind of aggregate rating — making it predictive of our overall coffee enjoyment.

Now look at acidity and sweetness. I would bet that in the general population, these two positions are flipped.

I was surprised at aroma’s middling position. Aroma is the first thing you encounter with the coffee, so it would make sense to me that would affect your later evaluation. My only explanation for the result is that filling out the review form is somewhat involved. So your initial impression, based on aroma, fades in importance by the time you’re done. But I would hypothesize that if reviewers made snap judgements about coffee (like when somebody asks you, “How is your coffee?”) aroma would have a stronger correlation with overall rating.

This approach is fine, but can be improved upon. The problem is that we are not controlling for anything. For example, maybe if the aftertaste is good, flavor does not matter. To get around this issue, let's run some regressions.

4. regression approach

An explanation

This approach will utilize a simple regression. The dependent variable is overall rating. As regressors, we'll use the 9 features plus country-level fixed effects. Country-level fixed effects means estimating a separate intercept for each country. This is called "fixed effects" because the idea is that there is something about country X's coffee that is (1) not captured by variables in our model, (2) unique to country X, and (3) fixed over time.

Fixed effects are an econometric way of specifying "Special Sauce". If a country has a positive and significant fixed effect, this is saying: "After Controlling for the stuff in your model, there is just something about this country that makes its ratings higher/lower."

An issue with this approach is "multicollinearity". As we saw from the correlation matrix, all these variables are correlated with each other. This makes it hard to see the unique effect of each variable. It's like if a bunch of stressful things all happened at once. It would be difficult to ascribe X amount of stress to event A, Y amount to event B, etc. Statistically, this has the effect of increasing standard errors on our regression coefficients - making them seem less significant than they are.

results

Our model performed pretty well. The coefficient of determination (R^2) was 0.61, so 61% of the variation in ratings is explained by our model. And the mean squared error (MSE) was quite low -- 0.18. Of our 9 features, all were statistically significant at the 5% level except for sweetness, uniformity, and body.

Qualitatively, the results are similar to those obtained by the correlation approach. But quantitatively, we see that flavor, aftertaste, and balance are the key drivers. The effect of acidity and aroma on overall rating are practically insignificant. Clean cup has virtually no effect.

5. Special Sauce

Because we included fixed effects, we can see which countries receive overall coffee ratings that are better than what their component ratings would predict. Here, Vietnam and Burundi top the list.

Coffee

vietnamese coffee

Vietnam is the second largest producer of coffee, though 97% of it is Robusta. Arabica coffee was introduced to Vietnam by French missionaries in 1857. Then the French brought over Robusta in 1908. This grew better. From the Vietnam War to the late 80s, coffee production grew slowly. Around that time, the government started investing in coffee - encouraging households to grow coffee and trying to improve quality of beans. I imagine at least some of those improvements worked.

burundian coffee

Opposite of Vietnam, Burundi grows about 93% Arabica and 7% Robusta. Coffee was first brought to Burundi in the 30s by Belgians. After independence in the 60s, the private sector ran the industry into the ground. The government later took control of the coffee industry, but didn't do very well either. Now at the prodding of the World Bank, coffee is being privatized once again. Burundi coffee is affected by the Potato Taste Defect, where coffee tastes like raw potatoes. This leads some people to avoid buying it. But honestly, reviews of Burundi coffee seem really interesting -- "berry", "sugar" and "raisin" being three adjectives people use to describe its taste.

I'm probably going to buy some.

the taiwan quandary

So the model was pretty useful. But it didn't do well on every country. After looking at the residuals (predicted rating minus actual rating), I found that Taiwan is a constant outlier. Of the 10 largest prediction errors, 7 were from Taiwanese coffee. And if I removed Taiwan, R^2 jumps to 0.75, standard errors shrink, and the MSE drops to 0.16. When I ran my model on just Taiwan, it did terribly. Components of ratings explain just 20% variation in overall ratings. So something is up with Taiwanese coffee. Then I conducted a statistical test of that intuition, testing whether the coefficients of my model differ for Taiwanese vs non-Taiwanese coffee. And they do. Taiwan remains a mystery.

6. Takeaways

We learned a couple of things.

First, the most important coffee qualities are flavor, aftertaste, and balance. These are the qualities that are most predictive of overall coffee rating.

Second, everyone should try Vietnamese or Burundi coffee. There's something special about it. Coffee from these two countries receives an overall rating above what its component ratings would predict.

the ultimate coffee dimension(s)