## Experimental Studies vs Observational Studies

Causal Inference is concerned with a very specific kind of prediction problem: predicting the results of an action, manipulation, or intervention. Hence, it becomes imperative to learn Causal Relationships in data, to make effective Decisions.

The gold standard of learning Causal Relationships has been Randomized control trials, which is a methodology for Experimental Studies. Scientists have been performing such experiments for decades, where one group is served with a potential drug (treatment group) and another is given a placebo(control group).

However, it is not possible to design experiments like these. For instance, if we want to study Causal Relationship between smoking and lung cancer, no one will smoke for years to please a Scientist. This kind of study needs to be performed using observational data, called Observational Studies.

## Motivating Causal Inference in Observational Studies

Why is Causal Inference needed in Observational Studies, when a mature discipline like Statistics exists? It turns out that Causal Inference is a very special kind of analysis. In fact, it is an addition to Statistics.

Traditional Statistics loves to answer the *what *type of questions. For instance, what is the correlation between Exercise and Cholesterol levels? At the most, they can answer the *how *i.e. How much is the correlation? But, it cannot answer questions like why. Another important aspect of the correlational analysis is that it lacks context, which basically gives the answers to *why*. Let’s understand this with the famous concept in Statistics, called Simpson’s Paradox.

## Simpson’s Paradox

Named after Edward Simpson (born 1922), the statistician who first popularized it, the paradox refers to the existence of data in which a statistical association that holds for an entire population is reversed in every subpopulation. Let’s understand this with an example.

Consider a study that measures weekly exercise and cholesterol. The image on the left-hand side shows the plot of Cholesterol vs weekly Exercise. From the data, it seems that as exercise increases, cholesterol levels increase. However, it is counterintuitive.

As Sherlock Holmes says, ‘It is a capital mistake to theorize before you have all the evidence.’ In this case, we need additional data to get the context. That data is the age of the patient. The graph on the right-hand side shows how the age-segregated data in order to compare same-age people and thereby eliminate the possibility that the high exercisers in each group we examine are more likely to have high cholesterol due to their age, and not due to exercising.

This is a classic example of Simpson’s Paradox where the entire population shows one picture, while a subpopulation shows the reverse. The key here is the story behind the data. If we know that older people, who are more likely to exercise, are also more likely to have high cholesterol regardless of exercise, then the reversal is easily explained, and easily resolved. Age is a common *cause* (also called Confounder) of both treatments (exercise) and outcomes (cholesterol).

This highlights the importance of Causal Analysis since it gives us the complete picture of what’s happening and why something is happening. We know that Exercise does have an effect on Cholesterol levels. However, plainly looking at correlational analysis won’t help.

## Next Steps: Towards Interventions

Note that the above knowledge won’t help us make any decision from the data. It requires another set of tools apart from the one available to us from Statistical Inference. Let me repeat again, Causal Inference is concerned with a very specific kind of prediction problem: predicting the results of an action, manipulation, or intervention. Thus, we need a systematic framework to do the following:

- Explore potential Causal Relationships and build Causal Assumptions
- Model the Causal Assumptions
- Test the Model
- Perform Interventions to get actionable insights.

We will cover these steps in another article. We hope that this article has been informative. However, we do not claim any guarantees regarding the accuracy or completeness of the article.

**References**:

- Statistics by David Freedman
- Causal Inference by Dr Judea Pearl
- In one of our previous posts i.e.
**Mind over Data – Towards Causality in AI**, we elaborated on the need for Causality. However, that was from an AI/ML or Intelligent Systems standpoint.

Featured Image Credit: **Wikipedia**