We expect that data scientists and analysts should be objective and base their conclusions on data. Now while the name of the job implies that “data” is the fundamental material that is used to do their jobs, it is not impossible to lie with it. Quite the opposite – the data scientist is affected by unconscious biases, peer pressure, urgency, and if that’s not enough – there are inherent risks in the process of data analysis and interpretation that lead to lying. It happens all the time while the intentions might be truly honest – though we all know the saying “The road to Hell is paved with good intentions”.
As every industry in every country is affected by data revolution we need to make sure we are aware of the dangerous mechanisms that can affect the output of any data project.
Averages, averages everywhere
The average is the most over-used aggregation metric that creates lies everywhere. Whenever an average metric is provided – unless the underlying data is distributed normally (and it almost never is) – it does not represent any useful information about reality whatsoever. When the data distribution is skewed then the average is affected and makes no sense. The average is not a robust metric which means it is very sensitive to outliers and any deviation from normal distribution.
And while this knowledge has been known to statisticians for decades, it’s still being used in business, institutions and governments as a core statistic that drives billions, even trillions of dollars’ worth of decisions. Now what’s the solution? Don’t use it! Stop doing that at this instance and start thinking about data distributions consciously before reporting a statistic measure that only works in rare cases. As a first step – move to using median, top 99%, bottom 1 percentile metrics to summarize your data.
“The Average” has been standing on the data science, hell – any science – pedestal for far too long – it has so many blind followers that don’t question it, we can almost consider it a religion. Why? Because the normal distribution assumptions that were made in natural sciences long time ago had spilled over to other fields, especially business analytics and other corporate data applications. This has poisoned generations of analysts who to this day still lie with average data.
Fitting data to hypothesis – confirmation bias
Now this is classic. It starts even before you are handed with the problem to solve with data – although this step also affects this bias. The way data scientist views the case or problem that has to be solved can fundamentally change the process that is supposed to be objective. This bias intensifies when there are strong emotions – either expressed or implied – about the matter in question. Typically it’s very hard to identify it and this is what separates truly exceptional data scientists from the average ones (pun intended).
A typical situation is when there’s a rushed analysis that needs to be done, there’s pressure to deliver the outcome fast as there is an important decision pending on it. A lot of biases kick in but the confirmation bias is the one that offers data scientists the easiest “way out”. The data scientist then rushes to answer the question or solve the problem as soon as possible. This means that the first spurious correlation discovered can become the answer. In these situations the evidence is searched for to confirm the hypothesis – hence they are “fitting data to hypothesis”.
This happens when the preconceived notions about the “right” solution to the problem steer the data scientist to the wrong direction where they start looking for proof. So objective data exploration doesn’t take place – there’s data tweaking and squeezing to get to the conclusion that’s already defined. A very important thing to do here is to define robust requirements from the very beginning and collect evidence and data for conflicting hypotheses – the ones that proof, the ones that reject the hypothesis, and then the ones that do neither. The last one is also very important – because of the “itch” to find a pattern or explanation (see more about it in the next item), the data scientist might miss the fact that there might not be enough data to conclude or answer the question. That’s also fine, and maybe the question needs to be redefined.
Finding “patterns” – a.k.a. clustering illusion
The human brain is so good at identifying patterns they start seeing them where they don’t exist. This is a lethal trap for the data scientist. Many data scientists are hired to “find” patterns hence the more patterns are found that better they are presumed to be at their job. This false success metric leads to a lot of work being focused in search of patterns, segments and “something peculiar”. Many times and more than is normally expected – there’s a lot of noise and everything’s normal (pun intended, but normality not assumed).
This leads to tricky situations where business gets patterns that don’t exist, makes decisions on them, and eventually influences the actual population and enforces these patterns to actually emerge. Amazing. Very simple example – finding customer segments and trying to get them to “convert” from one segment to another. When one “segment” is targeted and pushed towards another “segment”, the magic happens and there’s an actual impact. But this is very dangerous and can lead to many wrong and costly decisions.
Don’t be a data liar
This is definitely not a final list and you should read about other cognitive biases that can affect your judgement and quality of insights. But these are very common traps that I have seen data scientists fall into and then unintentionally make up lies instead of searching for truth. Objectivity is not an easily achievable goal, and it requires a lot of discipline. With all of this data out there the role of data scientist will only become more and more important.
The most successful data scientists will put enormous focus on being super aware about the potential biases they can have and the lies these biases can lead to.