# Statistics - (Case-control|retrospective) sampling

> (Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)

### Table of Contents

## 1 - About

The case-control sampling comes in a study in which subjects are **not randomized** to exposed or unexposed groups, but rather the subjects are observed in order to determine both:

- their exposure
- and their outcome status

The exposure is thus **NOT** determined by the researcher.

Case/Control sampling is most effective when the prior probabilities of the classes are very unequal.

## 2 - Articles Related

## 3 - Example

For example, in a study trying to show that people who smoke (the attribute) are more likely to be diagnosed with lung cancer (the outcome):

- the cases would be persons with lung cancer,
- the controls would be persons without lung cancer (not necessarily healthy).
- and some of each group would be smokers (the exposure)

If a larger proportion of the cases smoke than the controls, that suggests, but does not conclusively show, that the hypothesis is valid.

### 3.1 - Control Group

The control group should represent those at risk of becoming a case. Controls should come from the same population as the cases, and their selection should be independent of the exposures.

Controls can carry the same disease as the experimental group, but of another grade/severity, therefore being different from the outcome of interest. However, because the difference between the cases and the controls will be smaller, this results in a lower power to detect an exposure effect.

## 4 - Why ?

### 4.1 - Expensive and long

The most obvious way to study the risk factors for disease would be to take a large group of people, maybe 1,000 or 100,000 people, follow them for maybe 20 years, record their risk factors, and see who gets the disease and who doesn't after 20 years.

For a risk factor of 0.5, with 1,000 people, we'd get 50 cases. So It's not very practical.

Now that actually is a good way to do things, except it's very expensive and it takes a long time. You have to get a lot of people, and you have to wait for many years.

Case-control sampling is a lot more attractive. It will not do things prospectively but rather retrospectively.

Because what you do is rather than taking people and following them forward in time, you sample people who you know have heart disease.

You also get a comparison sample of people who do not have heart disease, the controls. And then you record their risk factors.

So it's much cheaper, and it's much quicker to do. And that's why case-control sampling is a very commonly used technique in epidemiology.

### 4.2 - Low Percentage of cases

In many modern data sets, we'll have very imbalanced situations.

For instance, if you have five clicks and 1000 impressions, then your CTR is 0.5% but you don't to use all the 0, 1 data to fit the models.

The main point is that ultimately the variance of your parameter estimates has to do with the number of cases that you got, which is the smaller class.

## 5 - Logistic Regression

Case-control sampling and Logistic Regression.

With case-control samples, we can estimate the regression parameters <math>B_i</math> accurately (if our model is correct) but the constant term <math>B_0</math> is incorrect.

We can correct the estimated intercept by a simple transformation: <MATH> B^*_0 = B_0 + log \left ( \frac{\displaystyle \pi}{\displaystyle 1-\pi} \right ) + log \left ( \frac{\displaystyle \tilde{\pi}}{\displaystyle 1-\tilde{\pi}} \right ) </MATH>

where:

- <math>\tilde{\pi}</math> is the study case percentage.
- and <math>\pi</math> is the case percentage actually observed in the population.