With Great Data Comes Great Responsibility

During a recent EconTalk episode, economist Russ Roberts opened a discussion with legendary baseball data analyst Bill James by debunking a commonly repeated refrain in today’s politically contentious environment:

People are entitled to their own opinions, but not to their own facts.

On the surface, the observation sounds true enough, but in this age of big data it not only is untrue but is becoming increasingly absurd. Facts are simply data—and all accurate data are facts. As the sheer tonnage of data available for analysis continues to grow to the point where it appears to approach infinity, so does the number of “facts” at our disposal. It is highly impractical to take every relevant fact into account when attempting to draw a conclusion or support a thesis. There are simply too many of them, and they tend to be messy and contradictory. Data are like stars in the night sky. An imaginative mind can contort them into any number of different constellations, but doing so always requires looking only at certain stars and ignoring those that don’t fit the pattern. Complicating matters further is that we can only see a small fraction of stars that are actually there, and we don’t know what we can’t see.

I was reminded of this principle last week when an outlet called Reveal News reported on a study showing considerably higher mortgage denial rates among people of color than among whites. The Mortgage Bankers Association immediately responded with a statement characterizing the Reveal story as “deeply flawed” and “a disservice to the important issues of access to credit and fair lending.”

Taken together, the Reveal story and the MBA response constitute a wonderful example of entitlement to one’s own facts. The Reveal story makes an arguably true claim based on a rather narrowly defined set of facts. The most important facts to Reveal’s analysis included: 1) the race/ethnicity of the mortgage applicant, and 2) the ultimate credit decision. Using these “stars” (and these alone) the story draws a constellation of persistent lending discrimination 50 years after passage of the Fair Housing Act. MBA counters that Reveal’s constellation willfully ignores some fairly important stars—notably credit score, debt-to-income (DTI) ratio, and loan-to-value (LTV) ratio—which form the backbone of all underwriting decisions.

At the heart of this discrepancy lie two issues that big data sets alone cannot solve and may actually exacerbate: 1) unknown data, and 2) analyst bias.


You Don’t Know What You Don’t Know.

The Reveal story acknowledges that lenders claim to rely on DTI, LTV, and credit scores (though it argues that the way in which credit scores are computed disproportionately hurts minority applicants). It goes on to assert that these data were not available to its study because “the industry has fought to keep [them] hidden.”

Whether that accusation against the industry is true or not, it’s not particularly germane to the validity of the analysis. The fact is that data exist that would likely bolster either Reveal’s claim or MBA’s rebuttal, but neither of them had it. Absent having it, each made a prediction about what the unknown data would say based on their own preconceptions. This introduces the next issue.


Everyone has biases. Those who claim to be blindly “following the data” without any preconceived notion of what it might say are either being dishonest or deluding themselves. Eliminating bias is not usually practical. In most cases, the best we can do is try to be cognizant of our own biases (which can be hard—it’s much easier to spot others’ biases than our own) and make a concerted effort to keep them in check.

A quick perusal of the Reveal News website indicates that it tends to favor stories that fit a certain narrative—that traditionally underserved communities are the victims of various discriminatory policies and practices. Reveal may be correct about this, but there is little doubt that the site’s reporting—the stories it chooses to run and the data it chooses to include or ignore—is a function of this bias.

The MBA is an industry association representing the interests of mortgage lenders. As such, it has biases of its own. It claims to take fair lending seriously and is likely biased toward questioning the validity of any study that impugns the motives of its membership.

My personal biases tend to be more closely aligned with the MBA’s than with Reveal’s. My sense is that mortgage lenders respond to incentives in the same way everybody else does. Mortgage lenders are strongly incentivized to close mortgage loans, and not to reject creditworthy applications. Lenders only profit when loans close, and rejections cost them real money. It would be foolish of lenders to turn away applications on the basis of anything other than cold, hard, colorblind underwriting criteria.

The Value of Self-Skepticism and Humility

I may have reason to believe that my biases are rooted in common sense, but they are still biases. Reveal doubtless feels the same way about its biases. Just because Reveal News is basing its reporting on a flawed study doesn’t necessarily mean that it is wrong about the existence of lending discrimination. Like Reveal, I don’t have access to the core underwriting data that would vindicate MBA’s claim. Reveal believes that the absence of this data reflects the lenders’ having something to hide. I’m inclined to think that lenders are simply following the law and protecting their customers’ privacy. But neither of us can say for sure, because we don’t actually have the data. We are simply making suppositions about data we don’t have based entirely on our biases.

This is an example of a study that clearly did not have enough data to be valid, regardless of the ultimate accuracy of its conclusions. But it speaks to a broader danger that can accompany even studies without glaring methodological flaws. Every conclusion we draw is colored by our biases, and the world would be a better place for data analysts and for humanity in general if everyone were willing to be a little more humble. When I refer to humility in this context, I again borrow from Russ Roberts who frequently speaks and writes about the topic—here is one representative example. At its heart, humility is the ability to draw a reasonable, objective conclusion based on a on a well-designed study of a statistically significant sample of data and still be able to sincerely ask oneself, “What if I’m wrong?”

This is best achieved by being willing to apply the same skepticism to our own conclusions that we have learned to apply to dubious conclusions advanced by others. During a recent episode of More or Less: Behind the Statistics (another of my other favorite podcasts), economist Tim Harford presents a postcard-sized summary of these. I cite them here with an invitation to apply them in the mirror:

  1. Observe your feelings: Beware of confirmation bias and the natural human tendency to remember and repeat not only those things we are already inclined to believe are true, but, more dangerously, things we want to believe are true.
  2. Understand the claim: Headlines often distort what the underlying data is actually saying. While it is reasonable to write headlines and opening paragraphs that invite readers to continue on, we should be careful to ensure that these do not convey misleading information.
  3. Get the backstory: Almost as important as the factual accuracy of the analysis itself is the reason we think it is important. Agenda-driven analyses lead to the types of problems identified in item (1).
  4. Put things in perspective: Figures should be discussed and presented in appropriate context (often in comparison to other, related things) in order to be meaningful.
  5. Embrace imprecision: Roughly correct numbers are not only easier to remember and work with, they are sometimes a more accurate representation of reality. Giving false impressions of precision is misleading and tempts consumers of the results to drawn inferences about the data that cannot reasonably be made.
  6. Be curious: Be willing to look for and embrace confounding variables and other uncomfortable results. As Harford puts it: “Treat surprising or counterintuitive claims not with suspicion nor open arms, but as mysteries to be solved. It’s fun.”

One of the most interesting things I have learned over the course of countless model validation engagements is that an arrogant modeler is a bad modeler. The better we get at cracking bigger and more sophisticated datasets, the more important humility becomes because the resulting conclusions can seem so unassailable. We would be better served to constantly ask ourselves whether there’s anything we’ve missed. At RiskSpan, I work with an extraordinarily talented team of data analysts, modelers, and model validators—a group of people that does good work precisely because they can (paradoxically) take pride in their humility.