The Case Of The Flawed Metacritic Study

A study looking into the hidden formula that drives Metacritic made headlines this week, but Kotaku has discovered some critical errors that call it into question.

Earlier this week, Full Sail University professor Adams Greenwood-Ericksen held a GDC session in San Francisco in which he shared some of his research on the effects of Metacritic, the aggregation site that takes media reviews from hundreds of outlets and outputs them as a single number, or Metascore.

Metacritic has taken some heat over the past few years for refusing to reveal the formula they use to produce their scores. It’s not a simple average: Metacritic admits they give more weight to some outlets when crunching the numbers. But they’ve never said how that weighting system works.

So when Greenwood-Ericksen said he had a model that replicated Metacritic’s scores, people took notice. Gamasutra ran an article titled “Metacritic’s weighting system revealed,” and it got a whole lot of video game developers and reporters talking. The system categorized outlets in six different “tiers” and gave heavy weight to sites like IGN and Wired (and significantly less weight to other big sites like Giant Bomb).

Shortly afterwards, Metacritic came out firing. They took to Facebook to shoot down the formula, calling it “wildly, wholly inaccurate,” and they accused Gamasutra of running a misleading headline. (When reached by Kotaku for comment, Gamasutra editor Kris Graft apologized: “Yeah, I feel that the main issue was a poor headline, and we apologize for the confusion over this. It’s also unfortunate that a session with inaccurate information like this got into the show.”)

Some, however, have remained skeptical of Metacritic’s accusations, as the aggregator still won’t share the formula that they use.

However, today Kotaku discovered a flaw in Greenwood-Ericksen’s formula: at least two of the listed weights—for the outlets The Sixth Axis and Play UK—are incorrect.

Let’s start from the beginning. Greenwood-Ericksen’s model—devised based on Metacritic data spanning from 2005 or 2006 until 2011—assigns certain numerical weights, like 1.5 and 0.5, to each video game outlet. The formula: look at a video game’s Metacritic page, take all of the review scores listed, multiply each one by the weight associated with its outlet, add them all together, and divide by the total number of scores. This model has successfully replicated something like 50 scores, Greenwood-Ericksen said.

Except, while plugging in the numbers and testing out the formula today, I discovered that the math just didn’t work for the PS3 game Swords & Soldiers. When I tried to get the Metascore, I found that my results were 7-8 points off. (The math did work for some of the other games I experimented with, like Venetica.)

So I reached out to Greenwood-Ericksen, who I’ve been chatting with throughout the day.

“Looks like [Swords & Soldiers] was the development case for The Sixth Axis, and also for Play UK,” he told me via Gchat. “So those two weights were actually set using that erroneous data.”

I asked exactly what that means.

“It means you caught us making a mistake,” he said. “It also means that at least one of those two weights of the 189 are probably off… So those particular weights are unreliable. The good news is that it suggests the process still works, one of us just made a mistake somewhere in applying it… It’s embarrassing, certainly. On the one hand, I’m glad somebody spotted the issue. On the other hand, I wish we’d done it before we were so far into the public spotlight.”

I asked what makes him think there are no other mistakes like this in the study.

“I don’t think we’d have made a mistake like that one twice, but it’s always possible,” he said. “Certainly I’m going to have to check our work over again to make sure.”

But the Full Sail professor doesn’t believe that these flaws invalidate the study: the point, he says, was not to determine the values of each weight, but to show that it’s possible to figure out the weight behind each outlet.

Greenwood-Ericksen and I had a long conversation on the phone this morning, before I started digging into this formula. He wanted to make it clear that these weights are just one part of a larger study—a study that makes a number of other conclusions about Metacritic, like its strong connections to sales data—and he told me that the goal was never to show off an accurate model of how Metacritic weights scores.

“One of the things that virtually everybody missed was that this was a model,” he said. “We didn’t go down under the basement with a flashlight and find out what the results were. A lot of words like ‘revealed’ and ‘discovered’ were all kinds of inaccurate.”

The professor said he was pleased by Metacritic’s Facebook response, even though the aggregator called his work inaccurate. He’s pleased because it offered new information: Metacritic said they use fewer than six tiers, for example, and that publication weights are much closer together than they were in Greenwood-Ericksen’s model.

It seems like Greenwood-Ericksen is on the right track, even if the model didn’t quite fit in this case. As he continues crunching numbers and trying to figure out exactly how each Metascore works, the truth behind this formula could eventually come out.

Greenwood-Ericksen said he wishes Metacritic would be more transparent about the formula that they use. It’d certainly preempt situations like this.

“I think the community—and Metacritic as well—would be better served by transparency on this,” he said. “Part of what makes them so unpopular and what creates so much resentment is that people have the sense that there’s this sort of arbitrary magical process that produces this score. I don’t think that’s the case. I think Metacritic is actually trying very hard to get a reasonable score to represent the quality of the product.

“I think that’s what comes across because they’re opaque about this particular issue.”

Photo: Gualtiero Boffi/Shutterstock