“What if we add these variables?..” is a deadly type of a question that can ruin your analytic project. Now, while curiosity is the best friend of a data scientist, there’s a curse that comes with it – some call it analysis paralysis, others – just over-analysis, but I call these situations “analytic rabbit holes”. As you start any data science project – be it an in-depth statistical research, machine learning model, or a simple business analysis – there are certain steps that are always involved. Some sources make them more granular, some make them more general but this view makes the most sense from a real-world business perspective.
The process goes as follows: a data scientist defines a hypothesis, then explores the data, gains insights into the data that help explain the hypothesis better. After this step the loop begins – a new information allows to refine the hypothesis and start “digging deeper” while repeating the data exploration, insight generation and… re-refining the hypothesis again. This is where the loop starts and it’s important to be conscious about it from the very beginning. Falling into an analytic rabbit hole starts here if one thing isn’t defined – a supported decision.
If the decision is not defined or it’s not the main goal of the analytic investigation – the project will go down the drains to the rabbit hole. Why? Because the over-analysis begins when the data scientist starts focusing on the hypothesis instead of the decision. While the two might look very similar, in reality this makes a fundamental difference between a successful data science project and an “analytic rabbit hole”. I am going to describe the two approaches and how one leads to success while the other is doomed to fail.
Hypothesis-focused. As the data exploration goes, the hypothesis is constantly refined and new insights are discovered. The curse of this process is that since the goal is to find the perfect answer or a solution to the hypothesis a data scientist will fall for many traps such as spurious correlations where relationship between un-related though correlated variables are discovered. Eventually the breadth of ways of analyzing and cutting through the data start having their side effect – the hypothesis is broken out into sub-segments each of which have a series of data points, assumptions and conflicting conclusions of their own. A typical end for this project is a happy data scientist presenting these immense findings to a non-technical team who get lost in the details faster than the data scientist starts explaining a second bullet-point. A question that knocks this effort down goes something like this – “can we do something about it?” That’s it. Weeks spent and one question derails the whole effort.
Decision-focused. The focus of this exploration is to find ways to influence and improve a decision. And to test whether it moves the needle as soon as possible. Then and only then a hypothesis can be refined. This doesn’t close the analytic loop, but it ensures that the focus of the data scientist is to discover insights that can improve the impact of the underlying decision. In this case the focus is on how the project’s output impacts the environment, and both the data scientist and the business can learn from the response the environment has to the data-refined actions. Hypothesis testing without any actual intervention that uses the generated generated is a perfect example of an analytic rabbit hole.
While this may sound very trivial, the amount of time data scientists waste on hypothesis-focused projects is incredibly high. If this hypothesis-focused philosophy is left unchallenged it might even ruin their careers, while others can end the trust put into the data science department. And believe me – it’s very tempting to wake up your inner geek and fall into the analytic rabbit hole trap every time you are handed with a very cool and interesting hypothesis.
Data scientist’s inner gut feeling tells that the main task of the job is to answer complex questions and gain in-depth insights. While in reality it’s all about solving problems – and the only way to solve a problem is to act on it. Our goal as data scientists is to support tough & complex decisions with actionable data-based recommendations. We are the ultimate internal consultants that drive actions through insights. And action with some insights is always better than no action with all the insights there can be discovered. So never forget to ask yourself a question – “what is the decision that this analysis supports?” It might save the project and maybe even your career as a data scientist.