All Data Are Wrong, but Some Are Useful

Aleksi Aaltonen
3 min readJan 18, 2024

--

Algorithms and complex models have been all the rage across management disciplines for some time. Yet, data matters just as much: if you put a turkey in, it’s not going to come out as an eagle — however much you cook the data with your algorithms.

The problem is that you can never get data completely right. The moment you capture a phenomenon as data, you impose artificial distinctions on reality. Some distinctions are better than others and capture perhaps more imporant aspects of the phenomenon of interest but, paraphrasing the old saying:

All data are wrong, but some are useful.

Data are often treated as an unproblematic, ready-to-use resource measured by its volume. Such an assumption may be necessary and justified in the context of a particular research effort, but it should not obscure the fact that data are a much more complex matter than implicit assumptions embedded in our statistical methods suggest.

Against this background, it is not surprising that critical scholarship on data typically fault one or another information system for not representing the world ‘as it really is’. This position rightly calls for attention to important nuances and omissions with any data or representation, but fails to acknowledge a hard limitation of computational technology:

It is impossible to represent the world as it really is, in its full richness, as data.

If we will only accept systems (or statistical analyses) that represent, for instance, certain people perfectly, we will never build any system. It is always possible to try make the collected data more detailed and accurate, but the price of more nuanced and in this sense better data is increasing complexity. Complexity has a very real and often escalating cost to any effort to build an application in practice. Let me give you an example.

It is easier to work with a binary male vs. female variable than a multilevel classification in which diverse ways to define gender identity are not mutually exclusive. The latter will undoubtedly allow capturing the gender identity of the diversity of people better, but it also imposes practical burden on the production and use of the data. We could even think of allowing people to describe their gender identity freely as text. This would provide maximum flexibility in capturing people’s gender identities, but also largely defeat the purpose why we typically collect data. That is, to classify things so that we can work with a limited number of categories. If everyone are allowed to create their own category, there are no categories at all.

Even if we can never have perfect data, we can strive for better data.

However, not being able to produce perfect data should not be used as an excuse for not striving for better data. We have increasingly powerful computational systems that enable us to contain much more complexity than before and, consequently, allow to add detail to data where it matters the most. That is, the price of complexity we have to pay for better data does not stay the same but can decrease with the rapidly evolving technology. We must use this opportunity to continuously improve how we capture important aspects of reality in data, and to minimize the violence that our data does to those who might not fit into traditional ways of caegorizing the world.

This post is based on the notes I wrote for my opening statement in the “Algorithmic Bias and Data Injustice: Dark Side or Dark Matter?” panel symposium in the 83rd Annual Meeting of the Academy of Management (AoM) on August 7, 2023, Boston, MA.

--

--

Aleksi Aaltonen

I am a management scholar and thinker who writes about data and the production of academic knowledge — www.aleksi.info