Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • What Is a Case-Control Study? | Definition & Examples

What Is a Case-Control Study? | Definition & Examples

Published on February 4, 2023 by Tegan George . Revised on June 22, 2023.

A case-control study is an experimental design that compares a group of participants possessing a condition of interest to a very similar group lacking that condition. Here, the participants possessing the attribute of study, such as a disease, are called the “case,” and those without it are the “control.”

It’s important to remember that the case group is chosen because they already possess the attribute of interest. The point of the control group is to facilitate investigation, e.g., studying whether the case group systematically exhibits that attribute more than the control group does.

Table of contents

When to use a case-control study, examples of case-control studies, advantages and disadvantages of case-control studies, other interesting articles, frequently asked questions.

Case-control studies are a type of observational study often used in fields like medical research, environmental health, or epidemiology. While most observational studies are qualitative in nature, case-control studies can also be quantitative , and they often are in healthcare settings. Case-control studies can be used for both exploratory and explanatory research , and they are a good choice for studying research topics like disease exposure and health outcomes.

A case-control study may be a good fit for your research if it meets the following criteria.

  • Data on exposure (e.g., to a chemical or a pesticide) are difficult to obtain or expensive.
  • The disease associated with the exposure you’re studying has a long incubation period or is rare or under-studied (e.g., AIDS in the early 1980s).
  • The population you are studying is difficult to contact for follow-up questions (e.g., asylum seekers).

Retrospective cohort studies use existing secondary research data, such as medical records or databases, to identify a group of people with a common exposure or risk factor and to observe their outcomes over time. Case-control studies conduct primary research , comparing a group of participants possessing a condition of interest to a very similar group lacking that condition in real time.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Case-control studies are common in fields like epidemiology, healthcare, and psychology.

You would then collect data on your participants’ exposure to contaminated drinking water, focusing on variables such as the source of said water and the duration of exposure, for both groups. You could then compare the two to determine if there is a relationship between drinking water contamination and the risk of developing a gastrointestinal illness. Example: Healthcare case-control study You are interested in the relationship between the dietary intake of a particular vitamin (e.g., vitamin D) and the risk of developing osteoporosis later in life. Here, the case group would be individuals who have been diagnosed with osteoporosis, while the control group would be individuals without osteoporosis.

You would then collect information on dietary intake of vitamin D for both the cases and controls and compare the two groups to determine if there is a relationship between vitamin D intake and the risk of developing osteoporosis. Example: Psychology case-control study You are studying the relationship between early-childhood stress and the likelihood of later developing post-traumatic stress disorder (PTSD). Here, the case group would be individuals who have been diagnosed with PTSD, while the control group would be individuals without PTSD.

Case-control studies are a solid research method choice, but they come with distinct advantages and disadvantages.

Advantages of case-control studies

  • Case-control studies are a great choice if you have any ethical considerations about your participants that could preclude you from using a traditional experimental design .
  • Case-control studies are time efficient and fairly inexpensive to conduct because they require fewer subjects than other research methods .
  • If there were multiple exposures leading to a single outcome, case-control studies can incorporate that. As such, they truly shine when used to study rare outcomes or outbreaks of a particular disease .

Disadvantages of case-control studies

  • Case-control studies, similarly to observational studies, run a high risk of research biases . They are particularly susceptible to observer bias , recall bias , and interviewer bias.
  • In the case of very rare exposures of the outcome studied, attempting to conduct a case-control study can be very time consuming and inefficient .
  • Case-control studies in general have low internal validity  and are not always credible.

Case-control studies by design focus on one singular outcome. This makes them very rigid and not generalizable , as no extrapolation can be made about other outcomes like risk recurrence or future exposure threat. This leads to less satisfying results than other methodological choices.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Prospective cohort study

Research bias

  • Implicit bias
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic
  • Social desirability bias

A case-control study differs from a cohort study because cohort studies are more longitudinal in nature and do not necessarily require a control group .

While one may be added if the investigator so chooses, members of the cohort are primarily selected because of a shared characteristic among them. In particular, retrospective cohort studies are designed to follow a group of people with a common exposure or risk factor over time and observe their outcomes.

Case-control studies, in contrast, require both a case group and a control group, as suggested by their name, and usually are used to identify risk factors for a disease by comparing cases and controls.

A case-control study differs from a cross-sectional study because case-control studies are naturally retrospective in nature, looking backward in time to identify exposures that may have occurred before the development of the disease.

On the other hand, cross-sectional studies collect data on a population at a single point in time. The goal here is to describe the characteristics of the population, such as their age, gender identity, or health status, and understand the distribution and relationships of these characteristics.

Cases and controls are selected for a case-control study based on their inherent characteristics. Participants already possessing the condition of interest form the “case,” while those without form the “control.”

Keep in mind that by definition the case group is chosen because they already possess the attribute of interest. The point of the control group is to facilitate investigation, e.g., studying whether the case group systematically exhibits that attribute more than the control group does.

The strength of the association between an exposure and a disease in a case-control study can be measured using a few different statistical measures , such as odds ratios (ORs) and relative risk (RR).

No, case-control studies cannot establish causality as a standalone measure.

As observational studies , they can suggest associations between an exposure and a disease, but they cannot prove without a doubt that the exposure causes the disease. In particular, issues arising from timing, research biases like recall bias , and the selection of variables lead to low internal validity and the inability to determine causality.

Sources in this article

We strongly encourage students to use sources in their work. You can cite our article (APA Style) or take a deep dive into the articles below.

George, T. (2023, June 22). What Is a Case-Control Study? | Definition & Examples. Scribbr. Retrieved August 26, 2024, from https://www.scribbr.com/methodology/case-control-study/
Schlesselman, J. J. (1982). Case-Control Studies: Design, Conduct, Analysis (Monographs in Epidemiology and Biostatistics, 2) (Illustrated). Oxford University Press.

Is this article helpful?

Tegan George

Tegan George

Other students also liked, what is an observational study | guide & examples, control groups and treatment groups | uses & examples, cross-sectional study | definition, uses & examples, what is your plagiarism score.

What Is A Case Control Study?

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul McLeod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul McLeod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A case-control study is a research method where two groups of people are compared – those with the condition (cases) and those without (controls). By looking at their past, researchers try to identify what factors might have contributed to the condition in the ‘case’ group.

Explanation

A case-control study looks at people who already have a certain condition (cases) and people who don’t (controls). By comparing these two groups, researchers try to figure out what might have caused the condition. They look into the past to find clues, like habits or experiences, that are different between the two groups.

The “cases” are the individuals with the disease or condition under study, and the “controls” are similar individuals without the disease or condition of interest.

The controls should have similar characteristics (i.e., age, sex, demographic, health status) to the cases to mitigate the effects of confounding variables .

Case-control studies identify any associations between an exposure and an outcome and help researchers form hypotheses about a particular population.

Researchers will first identify the two groups, and then look back in time to investigate which subjects in each group were exposed to the condition.

If the exposure is found more commonly in the cases than the controls, the researcher can hypothesize that the exposure may be linked to the outcome of interest.

Case Control Study

Figure: Schematic diagram of case-control study design. Kenneth F. Schulz and David A. Grimes (2002) Case-control studies: research in reverse . The Lancet Volume 359, Issue 9304, 431 – 434

Quick, inexpensive, and simple

Because these studies use already existing data and do not require any follow-up with subjects, they tend to be quicker and cheaper than other types of research. Case-control studies also do not require large sample sizes.

Beneficial for studying rare diseases

Researchers in case-control studies start with a population of people known to have the target disease instead of following a population and waiting to see who develops it. This enables researchers to identify current cases and enroll a sufficient number of patients with a particular rare disease.

Useful for preliminary research

Case-control studies are beneficial for an initial investigation of a suspected risk factor for a condition. The information obtained from cross-sectional studies then enables researchers to conduct further data analyses to explore any relationships in more depth.

Limitations

Subject to recall bias.

Participants might be unable to remember when they were exposed or omit other details that are important for the study. In addition, those with the outcome are more likely to recall and report exposures more clearly than those without the outcome.

Difficulty finding a suitable control group

It is important that the case group and the control group have almost the same characteristics, such as age, gender, demographics, and health status.

Forming an accurate control group can be challenging, so sometimes researchers enroll multiple control groups to bolster the strength of the case-control study.

Do not demonstrate causation

Case-control studies may prove an association between exposures and outcomes, but they can not demonstrate causation.

A case-control study is an observational study where researchers analyzed two groups of people (cases and controls) to look at factors associated with particular diseases or outcomes.

Below are some examples of case-control studies:
  • Investigating the impact of exposure to daylight on the health of office workers (Boubekri et al., 2014).
  • Comparing serum vitamin D levels in individuals who experience migraine headaches with their matched controls (Togha et al., 2018).
  • Analyzing correlations between parental smoking and childhood asthma (Strachan and Cook, 1998).
  • Studying the relationship between elevated concentrations of homocysteine and an increased risk of vascular diseases (Ford et al., 2002).
  • Assessing the magnitude of the association between Helicobacter pylori and the incidence of gastric cancer (Helicobacter and Cancer Collaborative Group, 2001).
  • Evaluating the association between breast cancer risk and saturated fat intake in postmenopausal women (Howe et al., 1990).

Frequently asked questions

1. what’s the difference between a case-control study and a cross-sectional study.

Case-control studies are different from cross-sectional studies in that case-control studies compare groups retrospectively while cross-sectional studies analyze information about a population at a specific point in time.

In  cross-sectional studies , researchers are simply examining a group of participants and depicting what already exists in the population.

2. What’s the difference between a case-control study and a longitudinal study?

Case-control studies compare groups retrospectively, while longitudinal studies can compare groups either retrospectively or prospectively.

In a  longitudinal study , researchers monitor a population over an extended period of time, and they can be used to study developmental shifts and understand how certain things change as we age.

In addition, case-control studies look at a single subject or a single case, whereas longitudinal studies can be conducted on a large group of subjects.

3. What’s the difference between a case-control study and a retrospective cohort study?

Case-control studies are retrospective as researchers begin with an outcome and trace backward to investigate exposure; however, they differ from retrospective cohort studies.

In a  retrospective cohort study , researchers examine a group before any of the subjects have developed the disease, then examine any factors that differed between the individuals who developed the condition and those who did not.

Thus, the outcome is measured after exposure in retrospective cohort studies, whereas the outcome is measured before the exposure in case-control studies.

Boubekri, M., Cheung, I., Reid, K., Wang, C., & Zee, P. (2014). Impact of windows and daylight exposure on overall health and sleep quality of office workers: a case-control pilot study. Journal of Clinical Sleep Medicine: JCSM: Official Publication of the American Academy of Sleep Medicine, 10 (6), 603-611.

Ford, E. S., Smith, S. J., Stroup, D. F., Steinberg, K. K., Mueller, P. W., & Thacker, S. B. (2002). Homocyst (e) ine and cardiovascular disease: a systematic review of the evidence with special emphasis on case-control studies and nested case-control studies. International journal of epidemiology, 31 (1), 59-70.

Helicobacter and Cancer Collaborative Group. (2001). Gastric cancer and Helicobacter pylori: a combined analysis of 12 case control studies nested within prospective cohorts. Gut, 49 (3), 347-353.

Howe, G. R., Hirohata, T., Hislop, T. G., Iscovich, J. M., Yuan, J. M., Katsouyanni, K., … & Shunzhang, Y. (1990). Dietary factors and risk of breast cancer: combined analysis of 12 case—control studies. JNCI: Journal of the National Cancer Institute, 82 (7), 561-569.

Lewallen, S., & Courtright, P. (1998). Epidemiology in practice: case-control studies. Community eye health, 11 (28), 57–58.

Strachan, D. P., & Cook, D. G. (1998). Parental smoking and childhood asthma: longitudinal and case-control studies. Thorax, 53 (3), 204-212.

Tenny, S., Kerndt, C. C., & Hoffman, M. R. (2021). Case Control Studies. In StatPearls . StatPearls Publishing.

Togha, M., Razeghi Jahromi, S., Ghorbani, Z., Martami, F., & Seifishahpar, M. (2018). Serum Vitamin D Status in a Group of Migraine Patients Compared With Healthy Controls: A Case-Control Study. Headache, 58 (10), 1530-1540.

Further Information

  • Schulz, K. F., & Grimes, D. A. (2002). Case-control studies: research in reverse. The Lancet, 359(9304), 431-434.
  • What is a case-control study?

Print Friendly, PDF & Email

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Case Control Study: Definition, Benefits & Examples

By Jim Frost 2 Comments

What is a Case Control Study?

A case control study is a retrospective, observational study that compares two existing groups. Researchers form these groups based on the existence of a condition in the case group and the lack of that condition in the control group. They evaluate the differences in the histories between these two groups looking for factors that might cause a disease.

Photograph of medical scientist at work.

By evaluating differences in exposure to risk factors between the case and control groups, researchers can learn which factors are associated with the medical condition.

For example, medical researchers study disease X and use a case-control study design to identify risk factors. They create two groups using available medical records from hospitals. Individuals with disease X are in the case group, while those without it are in the control group. If the case group has more exposure to a risk factor than the control group, that exposure is a potential cause for disease X. However, case-control studies establish only correlation and not causation. Be aware of spurious correlations!

Case-control studies are observational studies because researchers do not control the risk factors—they only observe them. They are retrospective studies because the scientists create the case and control groups after the outcomes for the subjects (e.g., disease vs. no disease) are known.

This post explains the benefits and limitations of case-control studies, controlling confounders, and analyzing and interpreting the results. I close with an example case control study showing how to calculate and interpret the results.

Learn more about Experimental Design: Definition, Types, and Examples .

Related posts : Observational Studies Explained and Control Groups in Experiments

Benefits of a Case Control Study

A case control study is a relatively quick and simple design. They frequently use existing patient data, and the experimenters form the groups after the outcomes are known. Researchers do not conduct an experiment. Instead, they look for differences between the case and control groups that are potential risk factors for the condition. Small groups and individual facilities can conduct case-control studies, unlike other more intensive types of experiments.

Case-control studies are perfect for evaluating outbreaks and rare conditions. Researchers simply need to let a sufficient number of known cases accumulate in an established database. The alternative would be to select a large random sample and hope that the condition afflicts it eventually.

A case control study can provide rapid results during outbreaks where the researchers need quick answers. They are ideal for the preliminary investigation phase, where scientists screen potential risk factors. As such, they can point the way for more thorough, time-consuming, and expensive studies. They are especially beneficial when the current state of science knows little about the connection between risk factors and the medical condition. And when you need to identify potential risk factors quickly!

Cohort studies are another type of observational study that are similar to case-control studies, but there are some important differences. To learn more, read my post about Cohort Studies .

Limitations of a Case Control Study

Because case-control studies are observational, they cannot establish causality and provide lower quality evidence than other experimental designs, such as randomized controlled trials . Additionally, as you’ll see in the next section, this type of study is susceptible to confounding variables unless experimenters correctly match traits between the two groups.

A case-control study typically depends on health records. If the necessary data exist in sources available to the researchers, all is good. However, the investigation becomes more complicated if the data are not readily available.

Case-control studies can incorporate biases from the underlying data sources. For example, researchers frequently obtain patient data from hospital records. The population of hospital patients is likely to differ from the general population. Even the control patients are in the hospital for some reason—they likely have serious health problems. Consequently, the subjects in case-control studies are likely to differ from the general population, which reduces the generalizability of the results.

A case-control study cannot estimate incidence or prevalence rates for the disease. The data from these studies do not allow you to calculate the probability of a new person contracting the condition in a given period nor how common it is in the population. This limitation occurs because case-control studies do not use a representative sample.

Case-control studies cannot determine the time between exposure and onset of the medical condition. In fact, case-control studies cannot reliably assess each subject’s exposure to risk factors over time. Longitudinal studies, such as prospective cohort studies, can better make those types of assessment.

Related post : Causation versus Correlation in Statistics

Use Matching to Control Confounders

Because case-control studies are observational studies, they are particularly vulnerable to confounding variables and spurious correlations . A confounder correlates with both the risk factor and the outcome variable. Because observational studies don’t use random assignment to equalize confounders between the case and control groups, they can become unbalanced and affect the results.

Unfortunately, confounders can be the actual cause of the medical condition rather than the risk factor that the researchers identify. If a case-control study does not account for confounding variables, it can bias the results and make them untrustworthy.

Case-control studies typically use trait matching to control confounders. This technique involves selecting study participants for the case and control groups with similar characteristics, which helps equalize the groups for potential confounders. Equalizing confounders limits their impact on the results.

Ultimately, the goal is to create case and control groups that have equal risks for developing the condition/disease outside the risk factors the researchers are explicitly assessing. Matching facilitates valid comparisons between the two groups because the controls are similar to cases. The researchers use subject-area knowledge to identify characteristics that are critical to match.

Note that you cannot assess matching variables as potential risk factors. You’ve intentionally equalized them across the case and control groups and, consequently, they do not correlate with the condition. Hence, do not use the risk factors you want to evaluate as trait matching variables.

Learn more about confounding variables .

Statistical Analysis of a Case Control Study

Researchers frequently include two controls for each case to increase statistical power for a case-control study. Adding even more controls per case provides few statistical benefits, so studies usually do not use more than a 2:1 control to case ratio.

For statistical results, case-control studies typically produce an odds ratio for each potential risk factor. The equation below shows how to calculate an odds ratio for a case-control study.

Equation for an odds ratio in a case-control study.

Notice how this ratio takes the exposure odds in the case group and divides it by the exposure odds in the control group. Consequently, it quantifies how much higher the odds of exposure are among cases than the controls.

In general, odds ratios greater than one flag potential risk factors because they indicate that exposure was higher in the case group than in the control group. Furthermore, higher ratios signify stronger associations between exposure and the medical condition.

An odds ratio of one indicates that exposure was the same in the case and control groups. Nothing to see here!

Ratios less than one might identify protective factors.

Learn more about Understanding Ratios .

Now, let’s bring this to life with an example!

Example Odds Ratio in a Case-Control Study

The Kent County Health Department in Michigan conducted a case-control study in 2005 for a company lunch that produced an outbreak of vomiting and diarrhea. Out of multiple lunch ingredients, researchers found the following exposure rates for lettuce consumption.

53 33
1 7

By plugging these numbers into the equation, we can calculate the odds ratio for lettuce in this case-control study.

Example odds ratio calculations for a case-control study.

The study determined that the odds ratio for lettuce is 11.2.

This ratio indicates that those with symptoms were 11.2 times more likely to have eaten lettuce than those without symptoms. These results raise a big red flag for contaminated lettuce being the culprit!

Learn more about Odds Ratios.

Epidemiology in Practice: Case-Control Studies (NIH)

Interpreting Results of Case-Control Studies (CDC)

Share this:

hypothesis in case control study

Reader Interactions

' src=

January 18, 2022 at 7:56 am

Great post, thanks for writing it!

Is it possible to test an odds ration for statistical significance?

' src=

January 18, 2022 at 7:41 pm

Hi Michael,

Thanks! And yes, you can test for significance. To learn more about that, read my post about odds ratios , where I discuss p-values and confidence intervals.

Comments and Questions Cancel reply

Case-Control Studies

hypothesis in case control study

Introduction

Cohort studies have an intuitive logic to them, but they can be very problematic when one is investigating outcomes that only occur in a small fraction of exposed and unexposed individuals. They can also be problematic when it is expensive or very difficult to obtain exposure information from a cohort. In these situations a case-control design offers an alternative that is much more efficient. The goal of a case-control study is the same as that of cohort studies, i.e., to estimate the magnitude of association between an exposure and an outcome. However, case-control studies employ a different sampling strategy that gives them greater efficiency.

Learning Objectives

After completing this module, the student will be able to:

  • Define and explain the distinguishing features of a case-control study
  • Describe  and identify the types of epidemiologic questions that can be addressed by case-control studies
  • Define what is meant by the term "source population"
  • Describe the purpose of controls in a case-control study
  • Describe differences between hospital-based and population-based case-control studies
  • Describe the principles of valid control selection
  • Explain the importance of using specific diagnostic criteria and explicit case definitions in case-control studies
  • Estimate and interpret the odds ratio from a case-control study
  • Identify the potential strengths and limitations of case-control studies

Overview of Case-Control Design

In the module entitled Overview of Analytic Studies it was noted that Rothman describes the case-control strategy as follows:

"Case-control studies are best understood by considering as the starting point a source population , which represents a hypothetical study population in which a cohort study might have been conducted. The source population is the population that gives rise to the cases included in the study. If a cohort study were undertaken, we would define the exposed and unexposed cohorts (or several cohorts) and from these populations obtain denominators for the incidence rates or risks that would be calculated for each cohort. We would then identify the number of cases occurring in each cohort and calculate the risk or incidence rate for each. In a case-control study the same cases are identified and classified as to whether they belong to the exposed or unexposed cohort. Instead of obtaining the denominators for the rates or risks, however, a control group is sampled from the entire source population that gives rise to the cases. Individuals in the control group are then classified into exposed and unexposed categories. The purpose of the control group is to determine the relative size of the exposed and unexposed components of the source population. Because the control group is used to estimate the distribution of exposure in the source population, the cardinal requirement of control selection is that the controls be sampled independently of exposure status."

To illustrate this consider the following hypothetical scenario in which the source population is the state of Massachusetts. Diseased individuals are red, and non-diseased individuals are blue. Exposed individuals are indicated by a whitish midsection. Note the following aspects of the depicted scenario:

  • The disease is rare.
  • There is a fairly large number of exposed individuals in the state, but most of these are not diseased.

Map of Massachusetts with thousands of icon people overlaid. A very small percentage of them are identified as having a rare disease.

If we somehow had exposure and outcome information on all of the subjects in the source population and looked at the association using a cohort design, we might find the data summarized in the contingency table below.

 

Diseased

Non-diseased

Total

Exposed

700

999,300

1,000,000

Non-exposed

600

4,999,400

5,000,000

In this hypothetical example, we have data on all 6,000,000 people in the source population, and we could compute the probability of disease (i.e., the risk or incidence) in both the exposed group and the non-exposed group, because we have the denominators for both the exposed and non-exposed groups.

The table above summarizes all of the necessary information regarding exposure and outcome status for the population and enables us to compute a risk ratio as a measure of the strength of the association. Intuitively, we compute the probability of disease (the risk) in each exposure group and then compute the risk ratio as follows:

The problem , of course, is that we usually don't have the resources to get the data on all subjects in the population. If we took a random sample of even 5-10% of the population, we would have few diseased people in our sample, certainly not enough to produce a reasonably precise measure of association. Moreover, we would expend an inordinate amount of effort and money collecting exposure and outcome data on a large number of people who would not develop the outcome.

We need a method that allows us to retain all the people in the numerator of disease frequency (diseased people or "cases") but allows us to collect information from only a small proportion of the people that make up the denominator (population, or "controls"), most of whom do not have the disease of interest. The case-control design allows us to accomplish this. We identify and collect exposure information on all the cases, but identify and collect exposure information on only a sample of the population. Once we have the exposure information, we can assign subjects to the numerator and denominator of the exposed and unexposed groups. This is what Rothman means when he says,

"The purpose of the control group is to determine the relative size of the exposed and unexposed components of the source population."

In the above example, we would have identified all 1,300 cases, determined their exposure status, and ended up categorizing 700 as exposed and 600 as unexposed. We might have ransomly sampled 6,000 members of the population (instead of 6 million) in order to determine the exposure distribution in the total population. If our sampling method was random, we would expect that about 1,000 would be exposed and 5,000 unexposed (the same ratio as in the overall population). We calculate a similar measure as the risk ratio above, but substituting in the denominator a sample of the population ("controls") instead of the whole population:

Note that when we take a sample of the population, we no longer have a measure of disease frequency, because the denominator no longer represents the population. Therefore, we can no longer compute the probability or rate of disease incidence in each exposure group. We also can't calculate a risk or rate difference measure for the same reason. However, as we have seen, we can compute the relative probability of disease in the exposed vs. unexposed group. The term generally used for this measure is an odds ratio , described in more detail later in the module.

Consequently, when the outcome is uncommon, as in this case, the risk ratio can be estimated much more efficiently by using a case-control design. One would focus first on finding an adequate number of cases in order to determine the ratio of exposed to unexposed cases. Then, one only needs to take a sample of the population in order to estimate the relative size of the exposed and unexposed components of the source population. Note that if one can identify all of the cases that were reported to a registry or other database within a defined period of time, then it is possible to compute an estimate of the incidence of disease if the size of the population is known from census data.   While this is conceptually possible, it is rarely done, and we will not discuss it further in this course.

Toggle open/close quiz question

A Nested Case-Control Study

Suppose a prospective cohort study were conducted among almost 90,000 women for the purpose of studying the determinants of cancer and cardiovascular disease. After enrollment, the women provide baseline information on a host of exposures, and they also provide baseline blood and urine samples that are frozen for possible future use. The women are then followed, and, after about eight years, the investigators want to test the hypothesis that past exposure to pesticides such as DDT is a risk factor for breast cancer. Eight years have passed since the beginning of the study, and 1.439 women in the cohort have developed breast cancer. Since they froze blood samples at baseline, they have the option of analyzing all of the blood samples in order to ascertain exposure to DDT at the beginning of the study before any cancers occurred. The problem is that there are almost 90,000 women and it would cost $20 to analyze each of the blood samples. If the investigators could have analyzed all 90,000 samples this is what they would have found the results in the table below.

Table of Breast Cancer Occurrence Among Women With or Without DDT Exposure

 

Breast Cancer

No Breast Cancer

Total

DDT exposed

360

13,276

13,636

Unexposed

1,079

75,234

76,313

 

1,439

88,510

89,949

If they had been able to afford analyzing all of the baseline blood specimens in order to categorize the women as having had DDT exposure or not, they would have found a risk ratio = 1.87 (95% confidence interval: 1.66-2.10). The problem is that this would have cost almost $1.8 million, and the investigators did not have the funding to do this.

While 1,439 breast cancers is a disturbing number, it is only 1.6% of the entire cohort, so the outcome is relatively rare, and it is costing a lot of money to analyze the blood specimens obtained from all of the non-diseased women. There is, however, another more efficient alternative, i.e., to use a case-control sampling strategy. One could analyze all of the blood samples from women who had developed breast cancer, but only a sample of the whole cohort in order to estimate the exposure distribution in the population that produced the cases.

If one were to analyze the blood samples of 2,878 of the non-diseased women (twice as many as the number of cases), one would obtain results that would look something like those in the next table.

 

Breast Cancer

No Breast Cancer

DDT exposed

360

432

Unexposed

1,079

2,446

 

1,439

2,878

Odds of Exposure: 360/1079 in the cases versus 432/2,446 in the non-diseased controls.

Totals Samples analyzed = 1,438+2,878 = 4,316

Total Cost = 4,316 x $20 = $86,320

With this approach a similar estimate of risk was obtained after analyzing blood samples from only a small sample of the entire population at a fraction of the cost with hardly any loss in precision. In essence, a case-control strategy was used, but it was conducted within the context of a prospective cohort study. This is referred to as a case-control study "nested" within a cohort study.

Rothman states that one should look upon all case-control studies as being "nested" within a cohort. In other words the cohort represents the source population that gave rise to the cases. With a case-control sampling strategy one simply takes a sample of the population in order to obtain an estimate of the exposure distribution within the population that gave rise to the cases. Obviously, this is a much more efficient design.

It is important to note that, unlike cohort studies, case-control studies do not follow subjects through time. Cases are enrolled at the time they develop disease and controls are enrolled at the same time. The exposure status of each is determined, but they are not followed into the future for further development of disease.

As with cohort studies, case-control studies can be prospective or retrospective. At the start of the study, all cases might have already occurred and then this would be a retrospective case-control study. Alternatively, none of the cases might have already occurred, and new cases will be enrolled prospectively. Epidemiologists generally prefer the prospective approach because it has fewer biases, but it is more expensive and sometimes not possible. When conducted prospectively, or when nested in a prospective cohort study, it is straightforward to select controls from the population at risk. However, in retrospective case-control studies, it can be difficult to select from the population at risk, and controls are then selected from those in the population who didn't develop disease. Using only the non-diseased to select controls as opposed to the whole population means the denominator is not really a measure of disease frequency, but when the disease is rare , the odds ratio using the non-diseased will be very similar to the estimate obtained when the entire population is used to sample for controls. This phenomenon is known as the r are-disease assumption . When case-control studies were first developed, most were conducted retrospectively, and it is sometimes assumed that the rare-disease assumption applies to all case-control studies. However, it actually only applies to those case-control studies in which controls are sampled only from the non-diseased rather than the whole population.  

The difference between sampling from the whole population and only the non-diseased is that the whole population contains people both with and without the disease of interest. This means that a sampling strategy that uses the whole population as its source must allow for the fact that people who develop the disease of interest can be selected as controls. Students often have a difficult time with this concept. It is helpful to remember that it seems natural that the population denominator includes people who develop the disease in a cohort study. If a case-control study is a more efficient way to obtain the information from a cohort study, then perhaps it is not so strange that the denominator in a case-control study also can include people who develop the disease. This topic is covered in more detail in EP813 Intermediate Epidemiology.

Retrospective and Prospective Case-Control Studies

Students usually think of case-control studies as being only retrospective, since the investigators enroll subjects who have developed the outcome of interest. However, case-control studies, like cohort studies, can be either retrospective or prospective. In a prospective case-control study, the investigator still enrolls based on outcome status, but the investigator must wait to the cases to occur.

When is a Case-Control Study Desirable?

Given the greater efficiency of case-control studies, they are particularly advantageous in the following situations:

  • When the disease or outcome being studied is rare.
  • When the disease or outcome has a long induction and latent period (i.e., a long time between exposure and the eventual causal manifestation of disease).
  • When exposure data is difficult or expensive to obtain.
  • When the study population is dynamic.
  • When little is known about the risk factors for the disease, case-control studies provide a way of testing associations with multiple potential risk factors. (This isn't really a unique advantage to case-control studies, however, since cohort studies can also assess multiple exposures.)

Another advantage of their greater efficiency, of course, is that they are less time-consuming and much less costly than prospective cohort studies.

The DES Case-Control Study

A classic example of the efficiency of the case-control approach is the study (Herbst et al.: N. Engl. J. Med. Herbst et al. (1971;284:878-81) that linked in-utero exposure to diethylstilbesterol (DES) with subsequent development of vaginal cancer 15-22 years later. In the late 1960s, physicians at MGH identified a very unusual cancer cluster. Eight young woman between the ages of 15-22 were found to have cancer of the vagina, an uncommon cancer even in elderly women. The cluster of cases in young women was initially reported as a case series, but there were no strong hypotheses about the cause.

In retrospect, the cause was in-utero exposure to DES. After World War II, DES started being prescribed for women who were having troubles with a pregnancy -- if there were signs suggesting the possibility of a miscarriage, DES was frequently prescribed. It has been estimated that between 1945-1950 DES was prescribed for about 20% of all pregnancies in the Boston area. Thus, the unborn fetus was exposed to DES in utero, and in a very small percentage of cases this resulted in development of vaginal cancer when the child was 15-22 years old (a very long latent period). There were several reasons why a case-control study was the only feasible way to identify this association: the disease was extremely rare (even in subjects who had been exposed to DES), there was a very long latent period between exposure and development of disease, and initially they had no idea what was responsible, so there were many possible exposures to consider.

In this situation, a case-control study was the only reasonable approach to identify the causative agent. Given how uncommon the outcome was, even a large prospective study would have been unlikely to have more than one or two cases, even after 15-20 years of follow-up. Similarly, a retrospective cohort study might have been successful in enrolling a large number of subjects, but the outcome of interest was so uncommon that few, if any, subjects would have had it. In contrast, a case-control study was conducted in which eight known cases and 32 age-matched controls provided information on many potential exposures. This strategy ultimately allowed the investigators to identify a highly significant association between the mother's treatment with DES during pregnancy and the eventual development of adenocarcinoma of the vagina in their daughters (in-utero at the time of exposure) 15 to 22 years later.

For more information see the DES Fact Sheet from the National Cancer Institute.

An excellent summary of this landmark study and the long-range effects of DES can be found in a Perspective article in the New England Journal of Medicine. A cohort of both mothers who took DES and their children (daughters and sons) was later formed to look for more common outcomes. Members of the faculty at BUSPH are on the team of investigators that follow this cohort for a variety of outcomes, particularly reproductive consequences and other cancers.

Selecting & Defining Cases and Controls

The "case" definition.

Careful thought should be given to the case definition to be used. If the definition is too broad or vague, it is easier to capture people with the outcome of interest, but a loose case definition will also capture people who do not have the disease. On the other hand, an overly restrictive case definition is employed, fewer cases will be captured, and the sample size may be limited. Investigators frequently wrestle with this problem during outbreak investigations. Initially, they will often use a somewhat broad definition in order to identify potential cases. However, as an outbreak investigation progresses, there is a tendency to narrow the case definition to make it more precise and specific, for example by requiring confirmation of the diagnosis by laboratory testing. In general, investigators conducting case-control studies should thoughtfully construct a definition that is as clear and specific as possible without being overly restrictive.

Investigators studying chronic diseases generally prefer newly diagnosed cases, because they tend to be more motivated to participate, may remember relevant exposures more accurately, and because it avoids complicating factors related to selection of longer duration (i.e., prevalent) cases. However, it is sometimes impossible to have an adequate sample size if only recent cases are enrolled.

Sources of Cases

Typical sources for cases include:

  • Patient rosters at medical facilities
  • Death certificates
  • Disease registries (e.g., cancer or birth defect registries; the SEER Program [Surveillance, Epidemiology and End Results] is a federally funded program that identifies newly diagnosed cases of cancer in population-based registries across the US )
  • Cross-sectional surveys (e.g., NHANES, the National Health and Nutrition Examination Survey)

Selection of the Controls

As noted above, it is always useful to think of a case-control study as being nested within some sort of a cohort, i.e., a source population that produced the cases that were identified and enrolled. In view of this there are two key principles that should be followed in selecting controls:

  • The comparison group ("controls") should be representative of the source population that produced the cases.
  • The "controls" must be sampled in a way that is independent of the exposure, meaning that their selection should not be more (or less) likely if they have the exposure of interest.

If either of these principles are not adhered to, selection bias can result (as discussed in detail in the module on Bias).

hypothesis in case control study

Note that in the earlier example of a case-control study conducted in the Massachusetts population, we specified that our sampling method was random so that exposed and unexposed members of the population had an equal chance of being selected. Therefore, we would expect that about 1,000 would be exposed and 5,000 unexposed (the same ratio as in the whole population), and came up with an odds ratio that was same as the hypothetical risk ratio we would have had if we had collected exposure information from the whole population of six million:

What if we had instead been more likely to sample those who were exposed, so that we instead found 1,500 exposed and 4,500 unexposed among the 6,000 controls?   Then the odds ratio would have been:

This odds ratio is biased because it differs from the true odds ratio.   In this case, the bias stemmed from the fact that we violated the second principle in selection of controls. Depending on which category is over or under-sampled, this type of bias can result in either an underestimate or an overestimate of the true association.

A hypothetical case-control study was conducted to determine whether lower socioeconomic status (the exposure) is associated with a higher risk of cervical cancer (the outcome). The "cases" consisted of 250 women with cervical cancer who were referred to Massachusetts General Hospital for treatment for cervical cancer. They were referred from all over the state. The cases were asked a series of questions relating to socioeconomic status (household income, employment, education, etc.). The investigators identified control subjects by going door-to-door in the community around MGH from 9:00 AM to 5:00  PM. Many residents are not home, but they persist and eventually enroll enough controls. The problem is that the controls were selected by a different mechanism than the cases, AND the selection mechanism may have tended to select individuals of different socioeconomic status, since women who were at home may have been somewhat more likely to be unemployed. In other words, the controls were more likely to be enrolled (selected) if they had the exposure of interest (lower socioeconomic status). 

Toggle open/close quiz question

Sources for "Controls"

Population controls:.

A population-based case-control study is one in which the cases come from a precisely defined population, such as a fixed geographic area, and the controls are sampled directly from the same population. In this situation cases might be identified from a state cancer registry, for example, and the comparison group would logically be selected at random from the same source population. Population controls can be identified from voter registration lists, tax rolls, drivers license lists, and telephone directories or by "random digit dialing". Population controls may also be more difficult to obtain, however, because of lack of interest in participating, and there may be recall bias, since population controls are generally healthy and may remember past exposures less accurately.

Random Digit Dialing

Random digit dialing has been popular in the past, but it is becoming less useful because of the use of caller ID, answer machines, and a greater reliance on cell phones instead of land lines.

Ken Rothman points out several that random digit dialing provides an equal probability that any given phone will be dialed, but not an equal probability of reaching eligible control subjects, because households vary in the number of residents and the likelihood that someone will be home. In addition, random digit dialing doesn't make any distinction between residential and business phones.

 

Example of a Population-based Case-Control Study: Rollison et al. reported on a "Population-based Case-Control Study of Diabetes and Breast Cancer Risk in Hispanic and Non-Hispanic White Women Living in US Southwestern States". (ALink to the article - Citation: Am J Epidemiol 2008;167:447–456).

"Briefly, a population-based case-control study of breast cancer was conducted in Colorado, New Mexico, Utah, and selected counties of Arizona. For investigation of differences in the breast cancer risk profiles of non-Hispanic Whites and Hispanics, sampling was stratified by race/ethnicity, and only women who self-reported their race as non-Hispanic White, Hispanic, or American Indian were eligible, with the exception of American Indian women living on reservations. Women diagnosed with histologically confirmed breast cancer between October 1999 and May 2004 (International Classification of Diseases for Oncology codes C50.0–C50.6 and C50.8–C50.9) were identified as cases through population-based cancer registries in each state."

"Population-based controls were frequency-matched to cases in 5-year age groups. In New Mexico and Utah, control participants under age 65 years were randomly selected from driver's license lists; in Arizona and Colorado, controls were randomly selected from commercial mailing lists, since driver's license lists were unavailable. In all states, women aged 65 years or older were randomly selected from the lists of the Centers for Medicare and Medicaid Services (Social Security lists). Of all women contacted, 68 percent of cases and 42 percent of controls participated in the study."

"Odds ratios and 95% confidence intervals were calculated using logistic regression, adjusting for age, body mass index at age 15 years, and parity. Having any type of diabetes was not associated with breast cancer overall (odds ratio = 0.94, 95% confidence interval: 0.78, 1.12). Type 2 diabetes was observed among 19% of Hispanics and 9% of non-Hispanic Whites but was not associated with breast cancer in either group."

In this example, it is clear that the controls were selected from the source population (principle 1), but less clear that they were enrolled independent of exposure status (principle 2), both because drivers' licenses were used for selection and because the participation rate among controls was low. These factors would only matter if they impacted on the estimate of the proportion of the population who had diabetes.

Hospital or Clinic Controls:

hypothesis in case control study

  • They have diseases that are unrelated to the exposure being studied. For example, for a study examining the association between smoking and lung cancer, it would not be appropriate to include patients with cardiovascular disease as control, since smoking is a risk factor for cardiovascular disease. To include such patients as controls would result in an underestimate of the true association.
  • Second, control patients in the comparison should have diseases with similar referral patterns as the cases, in order to minimize selection bias. For example, if the cases are women with cervical cancer who have been referred from all over the state, it would be inappropriate to use controls consisting of women with diabetes who had been referred primarily from local health centers in the immediate vicinity of the hospital. Similarly, it would be inappropriate to use patients from the emergency room, because the selection of a hospital for an emergency is different than for cancer, and this difference might be related to the exposure of interest.

The advantages of using controls who are patients from the same facility are:

  • They are easier to identify
  • They are more likely to participate than general population controls.
  • They minimize selection bias because they generally come from the same source population (provided referral patterns are similar).
  • Recall bias would be minimized, because they are sick, but with a different diagnosis.

Example: Several years ago the vascular surgeons at Boston Medical Center wanted to study risk factors for severe atherosclerosis of the lower extremities. The cases were patients who were referred to the hospital for elective surgery to bypass severe atherosclerotic blockages in the arteries to the legs. The controls consisted of patients who were admitted to the same hospital for elective joint replacement of the hip or knee. The patients undergoing joint replacement were similar in age and they also were following the same referral pathways. In other words, they met the "would" criterion: if one of the joint replacement surgery patients had developed severe atherosclerosis in their leg arteries, they would have been referred to the same hospital.

Friend, Neighbor, Spouse, and Relative Controls:

Occasionally investigators will ask cases to nominate controls who are in one of these categories, because they have similar characteristics, such as genotype, socioeconomic status, or environment, i.e., factors that can cause confounding, but are hard to measure and adjust for. By matching cases and controls on these factors, confounding by these factors will be controlled.   However, one must be careful that the controls satisfy the two fundamental principles. Often, they do not.

How Many Controls?

Since case-control studies are often used for uncommon outcomes, investigators often have a limited number of cases but a plentiful supply of potential controls. In this situation the statistical power of the study can be increased somewhat by enrolling more controls than cases. However, the additional power that is achieved diminishes as the ratio of controls to cases increases, and ratios greater than 4:1 have little additional impact on power. Consequently, if it is time-consuming or expensive to collect data on controls, the ratio of controls to cases should be no more than 4:1. However, if the data on controls is easily obtained, there is no reason to limit the number of controls.

Methods of Control Sampling

There are three strategies for selecting controls that are best explained by considering the nested case-control study described on page 3 of this module:

  • Survivor sampling: This is the most common method. Controls consist of individuals from the source population who do not have the outcome of interest.
  • Case-base sampling (also known as "case-cohort" sampling): Controls are selected from the population at risk at the beginning of the follow-up period in the cohort study within which the case-control study was nested.
  • Risk Set Sampling: In the nested case-control study a control would be selected from the population at risk at the point in time when a case was diagnosed.

The Rare Outcome Assumption

It is often said that an odds ratio provides a good estimate of the risk ratio only when the outcome of interest is rare, but this is only true when survivor sampling is used. With case-base sampling or risk set sampling, the odds ratio will provide a good estimate of the risk ratio regardless of the frequency of the outcome, because the controls will provide an accurate estimate of the distribution in the source population (i.e., not just in non-diseased people).

More on Selection Bias

Always consider the source population for case-control studies, i.e. the "population" that generated the cases. The cases are always identified and enrolled by some method or a set of procedures or circumstances. For example, cases with a certain disease might be referred to a particular tertiary hospital for specialized treatment. Alternatively, if there is a database or a disease registry for a geographic area, cases might be selected at random from the database. The key to avoiding selection bias is to select the controls by a similar, if not identical, mechanism in order to ensure that the controls provide an accurate representation of the exposure status of the source population.

Example 1: In the first example above, in which cases were randomly selected from a geographically defined database, the source population is also defined geographically, so it would make sense to select population controls by some random method. In contrast, if one enrolled controls from a particular hospital within the geographic area, one would have to at least consider whether the controls were inherently more or less likely to have the exposure of interest. If so, they would not provide an accurate estimate of the exposure distribution of the source population, and selection bias would result.

Example 2: In the second example above, the source population was defined by the patterns of referral to a particular hospital for a particular disease. In order for the controls to be representative of the "population" that produced those cases, the controls should be selected by a similar mechanism, e.g., by contacting the referring health care providers and asking them to provide the names of potential controls. By this mechanism, one can ensure that the controls are representative of the source population, because if they had had the disease of interest they would have been just as likely as the cases to have been included in the case group (thus fulfilling the "would" criterion).

Example 3: A food handler at a delicatessen who is infected with hepatitis A virus is responsible for an outbreak of hepatitis which is largely confined to the surrounding community from which most of the customers come. Many (but not all) of the infected cases are identified by passive and active surveillance. How should controls be selected? In this situation, one might guess that the likelihood of people going to the delicatessen would be heavily influenced by their proximity to it, and this would to a large extent define the source population. In a case-control study undertaken to identify the source, the delicatessen is one of the exposures being tested. Consequently, even if the cases were reported to the state-wide surveillance system, it would not be appropriate to randomly select controls from the state, the county, or even the town where the delicatessen is located. In other words, the "would" criterion doesn't work here, because anyone in the state with clinical hepatitis would end up in the surveillance system, but someone who lived far from the deli would have a much lower likelihood of having the exposure. A better approach would be to select controls who were matched to the cases by neighborhood, age, and gender. These controls would have similar access to go to the deli if they chose to, and they would therefore be more representative of the source population.

Analysis of Case-Control Studies

The computation and interpretation of the odds ratio in a case-control study has already been discussed in the modules on Overview of Analytic Studies and Measures of Association. Additionally, one can compute the confidence interval for the odds ratio, and statistical significance can also be evaluated by using a chi-square test (or a Fisher's Exact Test if the sample size is small) to compute a p-value. These calculations can be done using the Case-Control worksheet in the Excel file called EpiTools.XLS.

Image of the Case-Control worksheet in the Epi_Tools file

Advantages and Disadvantages of Case-Control Studies

Advantages:

  • They are efficient for rare diseases or diseases with a long latency period between exposure and disease manifestation.
  • They are less costly and less time-consuming; they are advantageous when exposure data is expensive or hard to obtain.
  • They are advantageous when studying dynamic populations in which follow-up is difficult.

Disadvantages:

  • They are subject to selection bias.
  • They are inefficient for rare exposures.
  • Information on exposure is subject to observation bias.
  • They generally do not allow calculation of incidence (absolute risk).

Home

Introduction to study designs - case-control studies

Introduction.

hypothesis in case control study

Learning objectives: You will learn about basic introduction to case-control studies, its analysis and interpretation of outcomes. Case-control studies are one of the frequently used study designs due to the relative ease of its application in comparison with other study designs. This section introduces you to basic concepts, application and strengths of case-control study. This section also covers: 1. Issues in the design of case-control studies 2. Common sources of bias in a case-control study 3. Analysis of case-control studies 4. Strengths and weaknesses of case-control studies 5. Nested case-control studies Read the resource text below.

Resource text

Case-control studies start with the identification of a group of cases (individuals with a particular health outcome) in a given population and a group of controls (individuals without the health outcome) to be included in the study.

hypothesis in case control study

In a case-control study the prevalence of exposure to a potential risk factor(s) is compared between cases and controls. If the prevalence of exposure is more common among cases than controls, it may be a risk factor for the outcome under investigation. A major characteristic of case-control studies is that data on potential risk factors are collected retrospectively and as a result may give rise to bias. This is a particular problem associated with case-control studies and therefore needs to be carefully considered during the design and conduct of the study.

1. Issues in the design of case-control studies

Formulation of a clearly defined hypothesis As with all epidemiological investigations the beginning of a case-control study should begin with the formulation of a clearly defined hypothesis. Case definition It is essential that the case definition is clearly defined at the outset of the investigation to ensure that all cases included in the study are based on the same diagnostic criteria. Source of cases The source of cases needs to be clearly defined.

Selection of cases Case-control studies may use incident or prevalent cases.

Incident cases comprise cases newly diagnosed during a defined time period. The use of incident cases is considered as preferential, as the recall of past exposure(s) may be more accurate among newly diagnosed cases. In addition, the temporal sequence of exposure and disease is easier to assess among incident cases.

Prevalent cases comprise individuals who have had the outcome under investigation for some time. The use of prevalent cases may give rise to recall bias as prevalent cases may be less likely to accurately report past exposures(s). As a result, the interpretation of results based on prevalent cases may prove more problematic, as it may be more difficult to ensure that reported events relate to a time before the development of disease rather than to the consequence of the disease process itself. For example, individuals may modify their exposure following the onset of disease. In addition, unless the effect of exposure on duration of illness is known, it will not be possible to determine the extent to which a particular characteristic is related to the prognosis of the disease once it develops rather than to its cause.

Source of cases Cases may be recruited from a number of sources; for example they may be recruited from a hospital, clinic, GP registers or may be population bases. Population based case control studies are generally more expensive and difficult to conduct.

Selection of controls A particular problem inherent in case-control studies is the selection of a comparable control group. Controls are used to estimate the prevalence of exposure in the population which gave rise to the cases. Therefore, the ideal control group would comprise a random sample from the general population that gave rise to the cases. However, this is not always possible in practice. The goal is to select individuals in whom the distribution of exposure status would be the same as that of the cases in the absence of an exposure disease association. That is, if there is no true association between exposure and disease, the cases and controls should have the same distribution of exposure. The source of controls is dependent on the source of cases. In order to minimize bias, controls should be selected to be a representative sample of the population which produced the cases. For example, if cases are selected from a defined population such as a GP register, then controls should comprise a sample from the same GP register.

hypothesis in case control study

In case-control studies where cases are hospital based, it is common to recruit controls from the hospital population. However, the choice of controls from a hospital setting should not include individuals with an outcome related to the exposure(s) being studied. For example, in a case-control study of the association between smoking and lung cancer the inclusion of controls being treated for a condition related to smoking (e.g. chronic bronchitis) may result in an underestimate of the strength of the association between exposure (smoking) and outcome. Recruiting more than one control per case may improve the statistical power of the study, though including more than 4 controls per case is generally considered to be no more efficient.

Measuring exposure status Exposure status is measured to assess the presence or level of exposure for each individual for the period of time prior to the onset of the disease or condition under investigation when the exposure would have acted as a causal factor. Note that in case-control studies the measurement of exposure is established after the development of disease and as a result is prone to both recall and observer bias. Various methods can be used to ascertain exposure status. These include:

  • Standardized questionnaires
  • Biological samples
  • Interviews with the subject
  • Interviews with spouse or other family members
  • Medical records
  • Employment records
  • Pharmacy records

The procedures used for the collection of exposure data should be the same for cases and controls.

2. Common sources of bias in case-control studies

Due to the retrospective nature of case-control studies, they are particularly susceptible to the effects of bias, which may be introduced as a result of a poor study design or during the collection of exposure and outcome data. Because the disease and exposure have already occurred at the outset of a case control study, there may be differential reporting of exposure information between cases and controls based on their disease status. For example, cases and controls may recall past exposure differently (recall bias). Similarly, the recording of exposure information may vary depending on the investigator's knowledge of an individual's disease status (interviewer/observer bias). Therefore, the design and conduct of the study must be carefully considered, as there are limited options for the control of bias during the analysis. Selection bias in case-control studies Selection bias is a particular problem inherent in case-control studies, where it gives rise to non-comparability between cases and controls. Selection bias in case control studies may occur when: 'cases (or controls) are included in (or excluded from) a study because of some characteristic they exhibit which is related to exposure to the risk factor under evaluation' [1]. The aim of a case-control study is to select study controls who are representative of the population which produced the cases. Controls are used to provide an estimate of the exposure rate in the population. Therefore, selection bias may occur when those individuals selected as controls are unrepresentative of the population that produced the cases.

hypothesis in case control study

The potential for selection bias in case control studies is a particular problem when cases and controls are recruited exclusively from hospital or clinics. Hospital patients tend to have different characteristics than the population, for example they may have higher levels of alcohol consumption or cigarette smoking. If these characteristics are related to the exposures under investigation, then estimates of the exposure among controls may be different from that in the reference population, which may result in a biased estimate of the association between exposure and disease. Berkesonian bias is a bias introduced in hospital based case-control studies, due to varying rates of hospital admissions. As the potential for selection bias is likely to be less of a problem in population based case-control studies, neighbourhood controls may be a preferable choice when using cases from a hospital or clinic setting. Alternatively, the potential for selection bias may be minimized by selecting controls from more than one source, such as by using both hospital and neighbourhood controls. Selection bias may also be introduced in case-control studies when exposed cases are more likely to be selected than unexposed cases.

3. Analysis of case-control studies

The odds ratio (OR) is used in case-control studies to estimate the strength of the association between exposure and outcome. Note that it is not possible to estimate the incidence of disease from a case control study unless the study is population based and all cases in a defined population are obtained.

The results of a case-control study can be presented in a 2x2 table as follow:

hypothesis in case control study

The odds ratio is a measure of the odds of disease in the exposed compared to the odds of disease in the unexposed (controls) and is calculated as:

hypothesis in case control study

Example: Calculation of the OR from a hypothetical case-control study of smoking and cancer of the pancreas among 100 cases and 400 controls. Table 1. Hypothetical case-control study of smoking and cancer of the pancreas.

hypothesis in case control study

OR = 60 x 300        100 x 40 OR = 4.5 The OR calculated from the hypothetical data in table 1 estimates that smokers are 4.5 times more likely to develop cancer of the pancreas than non-smokers. NB: The odds ratio of smoking and cancer of the pancreas has been performed without adjusting for potential confounders. Further analysis of the data would involve stratifying by levels of potential confounders such as age. The 2x2 table can then be extended to allow for stratum specific rates of the confounding variable(s) to be calculated and, where appropriate, an overall summary measure, adjusted for the effects of confounding, and a statistical test of significance can also be calculated. In addition, confidence intervals for the odds ratio would also be presented.

4. Strengths and weaknesses of case-control studies

  • Cost effective relative to other analytical studies such as cohort studies.
  • Case-control studies are retrospective, and cases are identified at the beginning of the study; therefore there is no long follow up period (as compared to cohort studies).
  • Efficient for the study of diseases with long latency periods.
  • Efficient for the study of rare diseases.
  • Good for examining multiple exposures.
  • Particularly prone to bias; especially selection, recall and observer bias.
  • Case-control studies are limited to examining one outcome.
  • Unable to estimate incidence rates of disease (unless study is population based).
  • Poor choice for the study of rare exposures.
  • The temporal sequence between exposure and disease may be difficult to determine.

References 1. Hennekens CH, Buring JE. Epidemiology in Medicine, Lippincott Williams & Wilkins, 1987.

  • Open access
  • Published: 07 January 2022

Identification of causal effects in case-control studies

  • Bas B. L. Penning de Vries 1 &
  • Rolf H. H. Groenwold 1 , 2  

BMC Medical Research Methodology volume  22 , Article number:  7 ( 2022 ) Cite this article

5818 Accesses

3 Citations

8 Altmetric

Metrics details

Case-control designs are an important yet commonly misunderstood tool in the epidemiologist’s arsenal for causal inference. We reconsider classical concepts, assumptions and principles and explore when the results of case-control studies can be endowed a causal interpretation.

We establish how, and under which conditions, various causal estimands relating to intention-to-treat or per-protocol effects can be identified based on the data that are collected under popular sampling schemes (case-base, survivor, and risk-set sampling, with or without matching). We present a concise summary of our identification results that link the estimands to the (distribution of the) available data and articulate under which conditions these links hold.

The modern epidemiologist’s arsenal for causal inference is well-suited to make transparent for case-control designs what assumptions are necessary or sufficient to endow the respective study results with a causal interpretation and, in turn, help resolve or prevent misunderstanding. Our approach may inform future research on different estimands, other variations of the case-control design or settings with additional complexities.

Peer Review reports

Introduction

In causal inference, it is important that the causal question of interest is unambiguously articulated [ 1 ]. The causal question should dictate, and therefore be at the start of, investigation. When the target causal quantity, the estimand, is made explicit, one can start to question how it relates to the available data distribution and, as such, form a basis for estimation with finite samples from this distribution.

The counterfactual framework offers a language rich enough to articulate a wide variety of causal claims that can be expressed as what-if statements [ 1 ]. Another, albeit closely related, approach to causal inference is target trial emulation, an explicit effort to mitigate departures from a study (the ‘target trial’) that, if carried out, would enable one to readily answer the causal what-if question of interest [ 2 ]. While it may be too impractical or unethical to implement, making explicit what a target trial looks like has particular value in communicating the inferential goal and offers a reference against which to compare studies that have been or are to be conducted.

The counterfactual framework and emulation approach have become increasingly popular in observational cohort studies. Case-control studies, however, have not yet enjoyed this trend. A notable exception is given by Dickerman et al. [ 3 ], who recently outlined an application of trial emulation with case-control designs to statin use and colorectal cancer.

In this paper, we give an overview of how observational data obtained with case-control designs can be used to identify a number of causal estimands and, in doing so, recast historical case-control concepts, assumptions and principles in a modern and formal framework.

Preliminaries

Identification versus estimation.

An estimand is said to be identifiable if the distribution of the available data is compatible with exactly one value of the estimand, or therefore, if the estimand can be expressed as a functional of the available data distribution. Identifiability is a relative notion as it depends on which data are available as well as on the assumptions one is willing to make. Identification forms a basis for estimation with finite samples from the available data distribution [ 4 ]. Once the estimand has been made explicit and an identifying functional established, estimation is a purely statistical problem. While the identifying functional will often naturally translate into a plug-in estimator, there is, however, generally more than one way to translate an identifiability result into an estimator and different estimators may have important differences in their statistical properties. Moreover, while the estimand may be identifiable, there need not exist an estimator with the desired properties (see e.g. [ 5 ]). Here, our focus is on identification, so that the purely statistical issues of the next step in causal inference, estimation, can be momentarily put aside.

Case-control study nested in cohort study

To facilitate understanding, it is useful to consider every case-control study as being “nested” within a cohort study. A case-control study could be considered as a cohort study with missingness governed by the control sampling scheme. Therefore, when the observed data distribution of a case-control study is compatible with exactly one value of a given estimand, then so is the available or observed data distribution of the underlying cohort study. In other words, identifiability of an estimand with a case-control study implies identifiability of the estimand with the cohort study within which it is nested (conceptually). The converse is not evident and in fact may not be true. In this paper, the focus is on sets of conditions or assumptions that are sufficient for identifiability in case-control studies.

Set-up of underlying cohort study

Consider a time-varying exposure A k that can take one of two levels, 0 or 1, at K successive time points t k ( k =0,1,..., K −1), where t 0 denotes baseline (cohort entry or time zero). Study participants are followed over time until they sustain the event of interest or the administrative study end t K , whichever comes first. We denote by T the time elapsed from baseline until the event of interest and let Y k = I ( T < t k ) indicate whether the event has occurred by t k . The lengths between the time points are typically fixed at a constant (e.g., of one day, week, or month). Figure  1 depicts twelve equally spaced time points over, say, twelve months with several possible courses of follow-up of an individual. As the figure illustrates, individuals can switch between exposure levels during follow-up, as in any truly observational study. Apart from exposure and outcome data, we also consider a (vector of) covariate(s) L k , which describes time-fixed individual characteristics or time-varying characteristics typically relating to a time window just before exposure or non-exposure at t k , k =0,1,..., K −1.

figure 1

Illustration of possible courses of follow-up of an individual for a study with baseline t 0 and administrative study end t 12 . Solid bullets indicate ‘exposed’; empty bullets indicate ‘not exposed’. The incident event of interest is represented by a cross

Causal contrasts

Although there are many possible contrasts, particularly with time-varying exposures, for simplicity we consider only two pairs of mutually exclusive interventions: (1) setting baseline exposure A 0 to 1 versus 0; and (2) setting all of A 0 , A 1 ,..., A K −1 to 1 (‘always exposed’) versus all to 0 (‘never exposed’). For a =0,1, we let counterfactual outcome Y k ( a ) indicate whether the event has occurred by t k under the baseline-only intervention that sets A 0 to a . By convention, we write \(\overline {1}=(1,1,...,1)\) and \(\overline {0}=(0,0,...,0)\) , and let \(Y_{k}(\overline {1})\) and \(Y_{k}(\overline {0})\) indicate whether the event has occurred by t k under the intervention that sets all elements of ( A 0 , A 1 ,..., A K −1 ) to 1 and all to 0, respectively. Further details about the notation and set-up are given in Supplementary Appendix A.

Case-control sampling

The fact that each time-specific exposure variable can take only one value per time point means that at most one counterfactual outcome can be observed per individual. This type of missingness is common to all studies. Relative to the cohort studies within which they are nested, case-control studies have additional missingness, which is governed by the control sampling scheme. In this paper, we focus on three well-known sampling schemes: case-base sampling, survivor sampling, and risk-set sampling. The next sections give an overview of conditions under which intention-to-treat and always-versus-never-exposed per-protocol effects can be identified with the data that are observed under these sampling schemes.

Case-control studies without matching

Table  1 summarises a number of identification results for case-control studies without matching. Each result consists of one of the three aforementioned sampling schemes, an estimand, a set of assumptions, and an identification strategy. Under the conditions of the “Sampling scheme” and “Assumptions” columns, an identifying functional of the estimand of the “Estimand” column is obtained by following the steps of the “Identification strategy” column. More formal statements and proofs are given in Supplementary Appendix B.

In all case-control studies that we consider in this section, cases are compared with controls with regard to their exposure status via an odds ratio, even when an effect measure other than the odds ratio is targeted. An individual qualifies as a case if and only if they sustain the event of interest by the administrative study end (i.e., Y K =1) and adhered to one of the protocols of interest until the time of the incident event. In Fig.  1 , the individual represented by row 1 is therefore regarded as a case (an exposed case in particular) in our investigation of intention-to-treat effects but not in that of per-protocol effects. Whether an individual (also) serves as a control depends on the control sampling scheme.

Case-base sampling

The first result in Table  1 describes how to identify the intention-to-treat effect as quantified by the marginal risk ratio

under case-base sampling. (For identification of a conditional risk ratio, see Theorem 2 of Supplementary Appendix B.) Case-base sampling, also known as case-cohort sampling, means that no individual who is at risk at baseline of sustaining the event of interest is precluded from selection as a control. Selection as a control, S , is further assumed independent of baseline covariate L 0 and exposure A 0 . Selecting controls from survivors only (e.g., rows 4, 5, 7 and 9 in Fig.  1 ) violates this assumption when survival depends on L 0 or A 0 .

To account for baseline confounding, inverse probability weights could be derived from control data according to

We then compute the odds of baseline exposure among cases and among controls in the pseudopopulation that is obtained by weighting everyone by subject-specific values of W . The ratio of these odds coincides with the target risk ratio under the three key identifiability conditions of consistency, baseline conditional exchangeability and positivity [ 1 ]. Consistency here means that for a =0,1, Y K ( a )= Y K if A 0 = a , baseline conditional exchangeability that for a =0,1, A 0 is independent of Y K ( a ), and positivity that 0< Pr( A 0 =1| L 0 , S =1)<1.

The identification result for case-base sampling suggests a plug-in estimator: replace all functionals of the theoretical data distribution with sample analogues. For example, to obtain the weight for an individual with baseline covariate level l 0 , replace the theoretical propensity score Pr( A 0 =1| L 0 = l 0 , S =1) with an estimate \(\widehat {\Pr }(A_{0}=1|L_{0}=l_{0},S=1)\) derived from a fitted model (e.g., a logistic regression model) that imposes parametric constraints on the distribution of A 0 given L 0 among the controls.

Survivor sampling

With survivor (cumulative incidence or exclusive) sampling, a subject is eligible for selection as a control only if they reach the administrative study end event-free. To identify the conditional odds ratio of baseline exposure versus baseline non-exposure given L 0 ,

selection as a control, S , is assumed independent of baseline exposure A 0 given L 0 and survival until the end of study (i.e., Y K =0).

As is shown in Supplementary Appendix B, Theorem 3, the above odds ratio is identified by the ratio of the baseline exposure odds given L 0 among the cases versus controls, provided the key identifiability conditions of consistency, baseline conditional exchangeability, and positivity are met.

All estimands in Table  1 describe a marginal effect, except for the odds ratio, which is conditional on baseline covariates L 0 . The corresponding marginal odds ratio

is not identifiable from the available data distribution under the stated assumptions (see remark to Theorem 3, Supplementary Appendix B). However, approximate identifiability can be achieved by invoking the rare event assumption (or rare disease assumption), in which case the marginal odds ratio approximates the marginal risk ratio.

Risk-set sampling for intention-to-treat effect

With risk-set (or incidence density) sampling, for all time windows [ t k , t k +1 ), k =0,..., K −1, every subject who is event-free at t k is eligible for selection as a control for the period [ t k , t k +1 ). This means that study participants may be selected as a control more than once.

Consider the intention-to-treat effect quantified by the marginal (discrete-time) hazard ratio (or rate ratio)

(For identification of a conditional hazard ratio, see Theorem 5, Supplementary Appendix B.) For identification of the above marginal hazard ratio under risk-set sampling, it is assumed that selection as a control between t k and t k +1 , S k , is independent of the baseline covariates and exposure given eligibility at t k (i.e., Y k =0). It is also assumed that the sampling probability among those eligible, Pr( S k =1| Y k =0), is constant across time windows k =0,..., K −1. To this end, it suffices that the marginal hazard Pr( Y k +1 =1| Y k =0) remains constant across time windows and that every k th sampling fraction Pr( S k =1) is equal, up to a proportionality constant, to the probability Pr( Y k +1 =1, Y k =0) of an incident case in the k th window (see remark to Theorem 4, Supplementary Appendix B). For practical purposes, this suggests sampling a fixed number of controls for every case from among the set of eligible individuals. To illustrate, consider Fig.  1 and note first of all that the individual represented by row 1 trivially qualifies as a case, because the individual survived until the event occurred. Because the event was sustained between t 5 and t 6 , the proposed sampling suggests selecting a fixed number of controls from among those who are eligible at t 5 . Thus, rows (and only rows) 4 through 9 as well as row 1 itself in Fig.  1 qualify for selection as a control for this case. Even though the individual of row 1 is a case, the individual may also be selected as a control when the individuals of row 2, 3 and 6 (but not 8) sustain the event.

Once cases and controls are selected, we can start to derive inverse probability weights W according to Eq. 1 with S replaced with S 0 . We then compute the odds of baseline exposure among cases in the pseudopopulation that is obtained by weighting everyone by W and the odds of baseline exposure among controls weighted by W multiplied by the number of times the individual was selected as a control. The ratio of these odds coincides with the target hazard ratio under the three key identifiability conditions of consistency, baseline conditional exchangeability and positivity together with the assumption that the hazards in the numerator and denominator of the causal hazard ratio are constant across the time windows.

The consistency and exchangeability conditions are here slightly stronger than those of the previous subsections. Specifically, Theorem 4 (Supplementary Appendix B) requires consistency of the form: for all k =1,..., K and a =0,1, Y k ( a )= Y k if A 0 = a . The exchangeability condition requires, for a =0,1, that conditional on L 0 , the counterfactual outcomes Y 1 ( a ),..., Y K ( a ) are jointly independent of A 0 . The positivity condition takes the same form as in the previous subsections (i.e., 0< Pr( A 0 = a | L 0 , S 0 =1)<1).

Risk-set sampling for per-protocol effect

For the per-protocol effect quantified by the (discrete-time) hazard ratio (or rate ratio)

eligibility for selection as a control for the period [ t k , t k +1 ) again requires that the respective subject is event-free at t k (i.e., Y k =0). Selection as a control between t k and t k +1 , S k , is further assumed independent of covariate and exposure history up to t k given eligibility at t k (but see Supplementary Appendix B for a slightly weaker assumption). As for the intention-to-treat effect, it is also assumed that the probability to be selected as a control S k given eligibility is constant across time windows. This assumption is guaranteed to hold if the marginal hazard Pr( Y k +1 =1| Y k =0) remains constant across time windows and that every k th sampling fraction Pr( S k =1) is equal, up to a proportionality constant, to the probability of an incident case in the k th window. Figure  1 shows five incident events yet only three qualify as a case (rows 2, 3 and 8) when it concerns per-protocol effects. When the first case emerges (row 2), all rows meet the eligibility criterion for selection as a control. When the second emerges, the individual of row 2, who fails to survive event-free until t 4 , is precluded as a control. When the case of row 8 emerges, only the individuals of rows 4, 5, 7 and 9 are eligible as controls.

Once cases and controls are selected, we can start to derive time-varying inverse probability weights according to

It is important to note that the weights are derived from control information but are nonetheless used to weight both cases and controls [ 6 ]. The denominators of the weights describe the propensity to switch exposure level. However, once the weights are derived, every subject is censored from the time that they fail to adhere to one of the protocols of interest for all downstream analysis. The uncensored exposure levels are therefore constant over time. We then compute the baseline exposure odds among cases, weighted by the weights W k corresponding to the interval [ t k , t k +1 ) of the incident event (i.e., Y k =0, Y k +1 =1), as well as the baseline exposure odds among controls, weighted by \(\sum _{k=0}^{K-1}W_{k}S_{k}\) , the weighted number of times selected as control. The ratio of these odds equals the target hazard ratio under the three key identifiability conditions of consistency, sequential conditional exchangeability, and positivity together with the assumption that hazards in the numerator and denominator of the causal hazard ratio for the per-protocol effect are constant across the time windows. The consistency, exchangeability and positivity conditions take a somewhat different (stronger) form than in the previous subsections; we refer the reader to Supplementary Appendix A for further details.

Case-control studies with matching

Table  2 gives an overview of identification results for case-control studies with exact pair matching. Formal statements and proofs are given in Supplementary Appendix C, which also includes a generalisation of the results of Table  2 to exact 1-to- M matching. While the focus in this section is on exact covariate matching, for partial matching we refer the reader to Supplementary Appendix D, where we consider parametric identification by way of conditional logistic regression.

Pair matching involves assigning a single control exposure level, which we denote by A ′ , to every case. As for case-control studies without matching, in a case-control studies with matching an individual qualifies as a case if and only if they sustain the event of interest by the administrative study end (i.e., Y K =1) and adhered to one of the protocols of interest until the time of the incident event. How a matched control exposure is assigned is encoded in the sampling scheme and the assumptions of Table  2 . For example, for identification of the causal marginal risk ratio under case-base sampling, A ′ is sampled from all study participants whose baseline covariate value matches that of the case, independently of the participants’ baseline exposure value and whether they survive until the end of study. The matching is exact in the sense that the control exposure information is derived from an individual who has the same value for the baseline covariate as the case.

The identification strategy is the same for all results listed in Table  2 . Only the case-control pairs ( A 0 , A ′ ) with discordant exposure values (i.e., (1,0) or (0,1)) are used. Under the stated sampling schemes and assumptions, the respective estimands are identified by the ratio of discordant pairs.

This paper gives a formal account of how and when causal effects can be identified in case-control studies and, as such, underpins the case-control application of Dickerman et al. [ 3 ]. Like Dickerman et al., we believe that case-control studies should generally be regarded as being nested within cohort studies. This view emphasises that the threats to the validity of cohort studies should also be considered in case-control studies. For example, in case-control applications with risk-set sampling, researchers often consider the covariate and exposure status only at, or just before, the time of the event (for cases) or the time of sampling (for controls). However, where a cohort study would require information on baseline levels or the complete treatment and covariate history of participants, one should suspect that this holds for the nested case-control study too. To gain clarity, we encourage researchers to move away from using person-years, -weeks, or -days (rather than individuals) as the default units of inference [ 7 ], and to realise that inadequately addressed deviations from a target trial may lead to bias (or departure from identifiability), regardless of whether the study that attempts to emulate it is a case-control or a cohort study [ 3 ].

What is meant by a cohort study differs between authors and contexts [ 8 ]. The term ‘cohort’ may refer to either a ‘dynamic population’, or a ‘fixed cohort’, whose “membership is defined in a permanent fashion” and “determined by a single defining event and so becomes permanent” [ 9 ]. While it may sometimes be of interest to ask what would have happened with a dynamic cohort (e.g., the residents of a country) had it been subjected to one treatment protocol versus another, the results in this paper relate to fixed cohorts.

Like the cohort studies within which they are (at least conceptually) nested, case-control studies require an explicit definition of time zero, the time at which a choice is to be made between treatment strategies or protocols of interest [ 3 ]. Given a fixed cohort, time zero is generally determined by the defining event of the cohort (e.g., first diagnosis of a particular disease or having survived one year since diagnosis). This event may occur at different calendar times for different individuals. However, while a fixed cohort may be ‘open’ to new members relative to calendar time, it is always ‘closed’ along the time axis on which all subject-specific time zeros are aligned.

In this paper, time was regarded as discrete. Since we considered arbitrary intervals between time points and because, in real-world studies, time is never measured in a truly continuous fashion, this does not represent an important limitation for practical purposes. It is however important to note that the intervals between interventions and outcome assessments (in a target trial) are an intrinsic part of the estimand that lies at the start of investigation. Careful consideration of time intervals in the design of the conceptual target trial and of the actual cohort or case-control study is therefore warranted.

We emphasize that identification and estimation are distinct steps in causal inference. Although our focus was on the former, identifying functionals often naturally translate into estimators. The task of finding the estimator with the most appealing statistical properties is not necessarily straightforward, however, and is beyond the scope of this paper.

We specifically studied two causal contrasts (i.e., pairs of interventions), one corresponding to intention-to-treat effects and the other to always-versus-never per-protocol effects of a time-varying exposure. There are of course many more causal contrasts, treatment regimes and estimands conceivable that could be of interest. We argue that also for these estimands, researchers should seek to establish identifiability before they select an estimator.

The conditions under which identifiability is to be sought for practical purposes may well include more constraints or obstacles to causal inference, such as additional missingness (e.g., outcome censoring) and measurement error, than we have considered here. While some of our results assume that hazards or hazard ratios remain constant over time, in many cases these are likely time-varying [ 10 , 11 ]. There are also more case-control designs (e.g., the case-crossover design) to consider. These additional complexities and designs are beyond the scope of this paper and represent an interesting direction for future research.

The case-control family of study designs is an important yet often misunderstood tool for identifying causal relations [ 12 – 15 ]. Although there is much to be learned, we believe that the modern arsenal for causal inference, which includes counterfactual thinking, is well-suited to make transparent for these classical epidemiological study designs what assumptions are sufficient or necessary to endow the study results with a causal interpretation and, in turn, help resolve or prevent misunderstanding.

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Hernán M, Robins J. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC; 2020.

Google Scholar  

Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016; 183(8):758–64.

Article   Google Scholar  

Dickerman BA, García-Albéniz X, Logan RW, Denaxas S, Hernán MA. Emulating a target trial in case-control designs: an application to statins and colorectal cancer. Int J Epidemiol. 2020; 49(5):1637–46.

Petersen ML, Van der Laan MJ. Causal models and learning from data: integrating causal modeling and statistical estimation. Epidemiol (Camb, Mass). 2014; 25(3):418.

Maclaren OJ, Nicholson R. Models, identifiability, and estimability in causal inference. In: 38th International Conference on Machine Learning. Workshop on the Neglected Assumptions in Causal Inference. ICML: 2021. https://sites.google.com/view/naci2021/home .

Robins JM. [Choice as an alternative to control in observational studies]: comment. Stat Sci. 1999; 14(3):281–93.

Hernán MA. Counterpoint: epidemiology to guide decision-making: moving away from practice-free research. Am J Epidemiol. 2015; 182(10):834–39.

Vandenbroucke JP, Pearce N. Incidence rates in dynamic populations. Int J Epidemiol. 2012; 41(5):1472–79.

Rothman KJ, Greenland S, Lash TL. Modern Epidemiology, Third edition. Philadelphia: Lippincott Williams & Wilkins; 2008.

Lefebvre G, Angers J-F, Blais L. Estimation of time-dependent rate ratios in case-control studies: comparison of two approaches for exposure assessment. Pharmacoepidemiol Drug Saf. 2006; 15(5):304–16.

Guess HA. Exposure-time-varying hazard function ratios in case-control studies of drug effects. Pharmacoepidemiol Drug Saf. 2006; 15(2):81–92.

Knol MJ, Vandenbroucke JP, Scott P, Egger M. What do case-control studies estimate? survey of methods and assumptions in published case-control research. Am J Epidemiol. 2008; 168(9):1073–81.

Pearce N. Analysis of matched case-control studies. BMJ. 2016; 352:i969.

Mansournia MA, Jewell NP, Greenland S. Case–control matching: effects, misconceptions, and recommendations. Eur J Epidemiol. 2018; 33(1):5–14.

Labrecque JA, Hunink MM, Ikram MA, Ikram MK. Do case-control studies always estimate odds ratios?. Am J Epidemiol. 2021; 190(2):318–21.

Download references

Acknowledgments

None declared.

RHHG was funded by the Netherlands Organization for Scientific Research (NWO-Vidi project 917.16.430). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding body.

Author information

Authors and affiliations.

Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, PO Box 9600, 2300 RC, The Netherlands

Bas B. L. Penning de Vries & Rolf H. H. Groenwold

Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands

Rolf H. H. Groenwold

You can also search for this author in PubMed   Google Scholar

Contributions

BBLPdV devised the project and wrote the manuscript and supplementary material with substantial input from RHHG, who supervised the project. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Bas B. L. Penning de Vries .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplementary material to ‘Identification of causal effects in case-control studies’.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

L. Penning de Vries, B.B., Groenwold, R.H.H. Identification of causal effects in case-control studies. BMC Med Res Methodol 22 , 7 (2022). https://doi.org/10.1186/s12874-021-01484-7

Download citation

Received : 26 August 2021

Accepted : 29 November 2021

Published : 07 January 2022

DOI : https://doi.org/10.1186/s12874-021-01484-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Causal inference
  • Case-control designs
  • Identifiability

BMC Medical Research Methodology

ISSN: 1471-2288

hypothesis in case control study

Level III Evidence: A Case-Control Study

  • First Online: 02 February 2019

Cite this chapter

hypothesis in case control study

  • Andrew D. Lynch 8 ,
  • Adam J. Popchak 8 &
  • James J. Irrgang 8  

2425 Accesses

Case-control studies are used to retrospectively determine the role of an exposure in the etiology of an outcome or condition of interest that is rare or takes a long time to develop. Because of the retrospective nature, case-control studies can be completed relatively quickly and at a smaller cost than a prospective observational study. However, the retrospective nature may introduce multiple types of bias into the data set, and results must be considered in light of the limitations of the retrospective study. This chapter outlines approaches to selecting cases and controls, designing studies to minimize bias, and basic approaches to statistical analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

hypothesis in case control study

Case Control Studies

hypothesis in case control study

Introduction to Epidemiological Studies

Observational studies: uses and limitations.

Bhandari M, Morshed S, Tornetta P 3rd, Schemitsch EH. Design, conduct, and interpretation of nonrandomized orthopaedic studies: a practical approach. (All) evidence matters. J Bone Joint Surg Am. 2009;91(Suppl 3):1. https://doi.org/10.2106/JBJS.H.01747 .

Article   PubMed   Google Scholar  

Busse JW, Obremskey WT. Principles of designing an orthopaedic case-control study. J Bone Joint Surg Am. 2009;91(Suppl 3):15–20. https://doi.org/10.2106/JBJS.H.01570 .

Gilbert R, Martin RM, Donovan J, Lane JA, Hamdy F, Neal DE, Metcalfe C. Misclassification of outcome in case-control studies: methods for sensitivity analysis. Stat Methods Med Res. 2016;25:2377–93. https://doi.org/10.1177/0962280214523192 .

Joseph L, Belisle P. Bayesian sample size determination for case-control studies when exposure may be misclassified. Am J Epidemiol. 2013;178:1673–9. https://doi.org/10.1093/aje/kwt181 .

Mayo NE, Goldberg MS. When is a case-control study a case-control study? J Rehabil Med. 2009;41:217–22. https://doi.org/10.2340/16501977-0341 .

Mayo NE, Goldberg MS. When is a case-control study not a case-control study? J Rehabil Med. 2009;41:209–16. https://doi.org/10.2340/16501977-0343 .

Morshed S, Tornetta P 3rd, Bhandari M. Analysis of observational studies: a guide to understanding statistical methods. J Bone Joint Surg Am. 2009;91(Suppl 3):50–60. https://doi.org/10.2106/JBJS.H.01577 .

Nesvick CL, Thompson CJ, Boop FA, Klimo P Jr. Case-control studies in neurosurgery. J Neurosurg. 2014;121:285–96. https://doi.org/10.3171/2014.5.JNS132329 .

Portney LG, Watkins MP. Foundations of clinical research : applications to practice. 2nd ed. Upper Saddle River, NJ: Prentice Hall; 2000.

Google Scholar  

Sackett DL. Bias in analytic research. J Chron Dis. 1979;32:51–63. https://doi.org/10.1016/0021-9681(79)90012-2 .

Article   CAS   PubMed   Google Scholar  

Schulz KF, Grimes DA. Case-control studies: research in reverse. Lancet. 2002;359:431–4. https://doi.org/10.1016/S0140-6736(02)07605-5 .

Vandenbroucke JP, et al. Strengthening the reporting of observational studies in epidemiology (STROBE): explanation and elaboration. PLoS Med. 2007;4:e297. https://doi.org/10.1371/journal.pmed.0040297 .

Article   PubMed   PubMed Central   Google Scholar  

von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP, Initiative S. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4:e296. https://doi.org/10.1371/journal.pmed.0040296 .

Article   Google Scholar  

Download references

Author information

Authors and affiliations.

Departments of Physical Therapy and Orthopaedic Surgery, University of Pittsburgh, Pittsburgh, PA, USA

Andrew D. Lynch, Adam J. Popchak & James J. Irrgang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Andrew D. Lynch .

Editor information

Editors and affiliations.

UPMC Rooney Sports Complex, University of Pittsburgh, Pittsburgh, PA, USA

Volker Musahl

Department of Orthopaedics, Sahlgrenska Academy, Gothenburg University, Sahlgrenska University Hospital, Gothenburg, Sweden

Jón Karlsson

Department of Orthopaedic Surgery and Traumatology, Kantonsspital Baselland (Bruderholz, Laufen und Liestal), Bruderholz, Switzerland

Michael T. Hirschmann

McMaster University, Hamilton, ON, Canada

Olufemi R. Ayeni

Hospital for Special Surgery, New York, NY, USA

Robert G. Marx

Department of Orthopaedic Surgery, NorthShore University HealthSystem, Evanston, IL, USA

Jason L. Koh

Institute for Medical Science in Sports, Osaka Health Science University, Osaka, Japan

Norimasa Nakamura

Rights and permissions

Reprints and permissions

Copyright information

© 2019 ISAKOS

About this chapter

Lynch, A.D., Popchak, A.J., Irrgang, J.J. (2019). Level III Evidence: A Case-Control Study. In: Musahl, V., et al. Basic Methods Handbook for Clinical Orthopaedic Research. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58254-1_32

Download citation

DOI : https://doi.org/10.1007/978-3-662-58254-1_32

Published : 02 February 2019

Publisher Name : Springer, Berlin, Heidelberg

Print ISBN : 978-3-662-58253-4

Online ISBN : 978-3-662-58254-1

eBook Packages : Medicine Medicine (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Module Index
  • Epiville Chamber of Commerce
  • About this site
  • Requirements

Case-Control Study

  • Introduction
  • Learning Objectives
  • Student Role

Study Design

  • Data Collection
  • Data Analysis
  • Discussion Questions
  • Print Module

Now that you have thoroughly assessed the situation, you have enough information to generate some hypotheses. The two suspected causal agents of the outbreak of Susser Syndrome are Quench-It and EnduroBrick. Use the case-control method to design a study that will allow you to compare the exposures to these products among your cases of Susser Syndrome and healthy controls of your choice. From all of your class work, you know that you want your hypotheses to be as explicit and detailed as possible.

1. Based on the information you gathered, which of the following hypotheses is the most appropriate for your case-control study?

  • (1) Those who consumed EnduroBrick are more likely to be diagnosed with Susser Syndrome than those who did not; (2) Those who consumed Quench-It are more likely to be diagnosed with Susser Syndrome than those who did not consume Quench-It.
  • Individuals diagnosed with Susser Syndrome are more likely to have been members of the Superfit Fitness Center than individuals without Susser Syndrome.
  • Individuals diagnosed with Susser Syndrome are likely to be exposed to a variety of different exposures than are individuals not diagnosed with Susser Syndrome

Now that you have hypotheses, the next step is to prepare the case definition. This requires us to understand how Susser Syndrome is diagnosed. The more certain you are about your diagnosis the less error you will introduce into your study by incorrectly specifying cases. Based on information from the EDOH website, you decide that your case definition will be based on a clinical diagnosis of Susser Syndrome.

After you establish your case definition, you need to decide on the population from which the cases for your study will be obtained. Since the majority of cases from the recent outbreak were active members of the Superfit Fitness Center, you decide to base your study on this population.

Next you need to decide how you will classify your cases and controls based on exposure status. Remember, we are actually operating under two hypotheses here, each with its own unique exposure variable. Scientists working on the possible causal connection between consumption of EnduroBrick or Quench-It and the development of Susser Syndrome suggest that both exposures may have an Induction time of at least 6 months. Under this hypothesis, any cases of Susser Syndrome that occurred within 6 months of initial consumption of either EnduroBrick or Quench-It could not have plausibly been caused by the exposure. Thus, you stipulate that at least 6 months are required to have elapsed since the initial exposure, before your individual will be considered "'exposed".

Once all of these decisions have been made, it is time to create appropriate eligibility criteria for your cases and controls.

2. Which of the following do you think are the best eligibility criteria for the cases? [Aschengrau & Seage, pp. 239-243]

  • Cases should have been members of the Superfit Center in the last two years for at least 6 months (total) and consumed either EnduroBrick or Quench-It.
  • Cases should be correctly diagnosed with Susser Syndrome and be employed at Glop Industries.
  • Cases should be correctly diagnosed with Susser Syndrome and have been members of the Superfit Fitness Center for at least 6 months in the last two years.

Now you need to decide who is eligible to be a control.

You recall from your wonderful learning experience in P6400 that valid controls in a case-control study are individuals that, had they acquired the disease under investigation, would have ended up as cases in your study. The best way to ensure this is to sample controls from the same population that gave rise to the cases. To ensure that the controls accurately represent a sample of the distribution of exposure in the population giving rise to the cases, they should be sampled independently of exposure status.

3. Which of the following do you think are the best eligibility criteria for the controls?

  • Controls should be residents of Epiville who have not been diagnosed with Susser Syndrome.
  • Controls should be members of the Superfit Fitness Center who have been diagnosed with Susser Syndrome but have not consumed either EnduroBrick or Quench-It.
  • Controls should be members of the Superfit Center for at least 6 months in the last 2 years and not be diagnosed with Susser Syndrome at the time of data collection.

Now that the eligibility criteria have been set, you must determine the specifics of the case-control study design.

How many cases and controls should you recruit?

The answer to this question obviously depends on your time and resources. However, an equally important consideration is how much power you want the study to have. Conventionally, we want a study's power to be at least 80 percent in being able to find a significant difference between the groups. Generally, if the study has less than 80 percent power, we conclude that the study is underpowered. This does not mean our results are incorrect; but if we observe an insignificant result in an underpowered study we may not be able to tell whether this is because there truly is no association or whether this is due to the lack of power in the study.

Intellectually Curious?

Learn more about power and sample size .

After crunching the numbers, you determine that the study will require the following size to achieve a desired power of 80 percent:

Number of cases: 112 Number of controls: 224 Total number of subjects: 336

Bear in mind that the study is voluntary. Subjects, even when eligible, are in no way required to participate. Furthermore, subjects may drop out of the study before completion, further decreasing your sample size. Study participation depends in large part on the methods of recruitment. In-person recruitment is generally regarded as the most effective, followed by telephone interviews, and then mail invitations. The participation rate that you expect to achieve, given your method of recruitment, will help you to calculate approximately how many individuals you will need to contact in order to meet your sample size.

Should you recruit cases and controls simultaneously or cases first and then all controls? Learn more here .

Website URL: http://epiville.ccnmtl.columbia.edu/

  • Chapter 8. Case-control and cross sectional studies

Case-control studies

Selection of cases, selection of controls, ascertainment of exposure, cross sectional studies.

  • Chapter 1. What is epidemiology?
  • Chapter 2. Quantifying disease in populations
  • Chapter 3. Comparing disease rates
  • Chapter 4. Measurement error and bias
  • Chapter 5. Planning and conducting a survey
  • Chapter 6. Ecological studies
  • Chapter 7. Longitudinal studies
  • Chapter 9. Experimental studies
  • Chapter 10. Screening
  • Chapter 11. Outbreaks of disease
  • Chapter 12. Reading epidemiological reports
  • Chapter 13. Further reading

Follow us on

Content links.

  • Collections
  • Health in South Asia
  • Women’s, children’s & adolescents’ health
  • News and views
  • BMJ Opinion
  • Rapid responses
  • Editorial staff
  • BMJ in the USA
  • BMJ in South Asia
  • Submit your paper
  • BMA members
  • Subscribers
  • Advertisers and sponsors

Explore BMJ

  • Our company
  • BMJ Careers
  • BMJ Learning
  • BMJ Masterclasses
  • BMJ Journals
  • BMJ Student
  • Academic edition of The BMJ
  • BMJ Best Practice
  • The BMJ Awards
  • Email alerts
  • Activate subscription

Information

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Tests of the null hypothesis in case-control studies

  • PMID: 6534405

The relative merits of the likelihood ratio statistic, the Wald statistic, and the score statistic are examined by an empirical evaluation based on matched case-control data. A mixture model for the relative-odds function is used. The likelihood ratio statistic is relatively constant for reasonable values of the mixture parameter, but the Wald statistic is unstable. The score statistic is shown to be independent of the mixture parameter. An exact expression is derived for the change in the score statistic upon deletion of risk sets, and an approximation is numerically evaluated.

PubMed Disclaimer

Similar articles

  • Evaluation of exact and asymptotic interval estimators in logistic analysis of matched case-control studies. Vollset SE, Hirji KF, Afifi AA. Vollset SE, et al. Biometrics. 1991 Dec;47(4):1311-25. Biometrics. 1991. PMID: 1786321
  • Testing approaches for overdispersion in poisson regression versus the generalized poisson model. Yang Z, Hardin JW, Addy CL, Vuong QH. Yang Z, et al. Biom J. 2007 Aug;49(4):565-84. doi: 10.1002/bimj.200610340. Biom J. 2007. PMID: 17638291
  • Likelihood ratio testing of variance components in the linear mixed-effects model using restricted maximum likelihood. Morrell CH. Morrell CH. Biometrics. 1998 Dec;54(4):1560-8. Biometrics. 1998. PMID: 9883552
  • Tests of homogeneity for the relative risk in multiply-matched case-control studies. Zelterman D, Le CT. Zelterman D, et al. Biometrics. 1991 Jun;47(2):751-5. Biometrics. 1991. PMID: 1912269
  • Likelihood-based inference for the genetic relative risk based on affected-sibling-pair marker data. McKnight B, Tierney C, McGorray SP, Day NE. McKnight B, et al. Biometrics. 1998 Jun;54(2):426-43. Biometrics. 1998. PMID: 9629637

Publication types

  • Search in MeSH

Grants and funding

  • CA06927/CA/NCI NIH HHS/United States
  • CA22780/CA/NCI NIH HHS/United States
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Open access
  • Published: 27 August 2024

Case-control study of the characteristics and risk factors of hot clot artefacts on 18F-FDG PET/CT

  • Jacques Dzuko Kamga 1 ,
  • Romain Floch 1 ,
  • Kevin Kerleguer 1 ,
  • David Bourhis 1 , 2 ,
  • Romain Le Pennec 1 , 2 ,
  • Simon Hennebicq 1 ,
  • Pierre-Yves Salaün 1 , 2 &
  • Ronan Abgral 1 , 2  

Cancer Imaging volume  24 , Article number:  114 ( 2024 ) Cite this article

Metrics details

Introduction

The pulmonary Hot Clot artifact (HCa) on 18F-FDG PET/CT is a poorly understood phenomenon, corresponding to the presence of a focal tracer uptake without anatomical lesion on combined CTscan. The hypothesis proposed in the literature is of microembolic origin. Our objectives were to determine the incidence of HCa, to analyze its characteristics and to identify associated factors.

All 18F-FDG PET/CT retrieved reports containing the keywords (artifact/vascular adhesion/no morphological abnormality) during the period June 2021–2023 at Brest University Hospital were reviewed for HCa. Each case was associated with 2 control patients (same daily work-list). The anatomical and metabolic characteristics of HCa were analyzed. Factors related to FDG preparation/administration, patient and vascular history were investigated. Case-control differences between variables were tested using Chi-2 test and OR (qualitative) or Student’s t-test (quantitative).

Of the 22,671 18F-FDG PET/CT performed over 2 years, 211 patients (0.94%) showed HCa. The focus was single in 97.6%, peripheral in 75.3%, and located independently in the right or left lung (51.1% vs. 48.9%). Mean ± SD values for SUVmax, SUVmean, MTV and TLG were 11.3 ± 16.5, 5.1 ± 5.0, 0.3 ± 0.3 ml and 1.5 ± 2.1 g respectively. The presence of vascular adhesion ( p  < 0.001), patient age ( p  = 0.002) and proximal venous access ( p  = 0.001) were statistically associated with the presence of HCa.

HCa is a real but rare phenomenon (incidence around 1%), mostly unique, intense, small in volume (< 1 ml), and associated with the presence of vascular FDG uptake, confirming the hypothesis of a microembolic origin due to probable vein wall trauma at the injection site.

18F-fluorodesoxyglucose positron emission tomography / computed tomography (18F-FDG PET/CT) is a functional imaging technique based on the study of glucose metabolism in cells. Although it is a whole-body scan, the analysis of the lungs remains fundamental in many contexts, not only in oncology. Indeed, FDG-PET/CT is now routinely recommended for the characterization of solid pulmonary nodules ≥ 8 mm and for the initial staging of non-small-cell lung cancer [ 1 ]. More recently, it can also be suggested for the management of infectious or inflammatory pathology, such as unknown chronic fever or sarcoidosis [ 2 ].

Numerous specific technical artifacts and potential pitfalls in the interpretation of PET/CT in the thoracic region, including normal variations in physiological uptake of 18F-FDG and benign conditions, have been well described [ 3 ]. Awareness of these pitfalls is crucial as they may lead to misinterpretation with consequences for patient management and therapeutic implications [ 4 ]. One cause of these false positives results, called “hot clot artefact” (HCa), is still poorly understood. HCa fulfils 3 criteria: (i) the presence of one or more focal pulmonary 18F-FDG uptake(s) without anatomical lesion on CT scan; (ii) the high level of visual and semi-quantitative metabolic activity of the foci; (iii) the disappearance or migration of foci on late or subsequent acquisition [ 4 , 5 , 6 ].

There is very little literature available on this subject, based mainly on the publication of several case reports, totaling approximately twenty cases (21 patients). Nevertheless, certain hypotheses have been proposed to explain this relatively rare phenomenon. Thus, pulmonary microvascular embolism due to clots formed at the 18F-FDG injection site as a result of the vascular lesion and the agglutinating nature of FDG is the most plausible mechanism, as some authors have reported para-venous injection, rapid injection or blood aspiration into the injector system [ 4 , 7 , 8 , 9 , 10 , 11 ].

Such as a background, our aims were to determine the incidence of hot clot artefact in a large case-control PET/CT study, to analyze its 18FDG uptake characteristics and to identify its potential associated factors.

Materials and methods

This is a single-center retrospective observational case control study conducted in the Nuclear Medicine Department of Brest University Hospital between June 2021 and June 2023. The study was conducted in accordance with the Declaration of Helsinki and was approved by the French Advisory Committee on Information Processing in Health Research (CCTIRS).

All patients who underwent a 18F-FDG PET/CT during the 2-year inclusion period were analysed, regardless of indication. First, examination reports available in the radiology information system (Xplore, EDL, Paris, France) were queried using an AI word recognition algorithm with the terms “artefact” and/or “vascular adhesion” and/or “no morphological abnormality”. All selected files were reviewed to authenticate HCa cases, defined as the presence of one or more focal pulmonary 18F-FDG uptake(s) without anatomical lesion on CT scan and disappearance of the focus or no appearance of pathological lesion on a subsequent scan. Finally, 2 control patients per case were included as those managed immediately before or after the selected case on the daily work list and using the same examination modality.

18F-FDG PET/CT procedure

All 18F-FDG PET/CT images were acquired on two digital Biograph Vision 600 PET/CT scanner systems (Siemens Healthineers, Knoxville, TN, USA) with the same technical settings.

Standard patient preparation included at least 4 h fasting and serum blood glucose level < 7 mmol/L prior to intravenous injection of approximately 3 MBq/kg (0.08 mCi/kg) of FDG by a nuclear medicine technologist (NMT) via a catheter or a permanent device (implantable chamber, PICC line or midline). After injection, patients remained in a quiet room for approximately 60 min before acquisition.

At first, CT scan was obtained just after injection of intravenous iodine contrast agent (1.5 mL/kg), unless contraindicated. The CT consisted in a 64-slice multidetector-row spiral scanner with the following parameters : 110 kVp tube voltage (automatic modulation carekV ® ); 80 refmAs effective tube current with automatic dose modulation (care4D ® ); 0.5 s rotation time; 19.2 mm total collimation width ; pitch 1, 512 matrix size, 0.98 × 0.98 mm pixels; 2 mm slices thickness.

Then, PET data were acquired in in the craniocaudal direction using a whole-body protocol (2 min per step) and were reconstructed using an iterative ordered subset expectation maximization (OSEM) algorithm (True X ® = point spread function (PSF) + time of flight (TOF) acquisition capabilities, 4 iterations, 5 subsets). Images were corrected for random coincidences, scatter and attenuation using the CT scan data and were smoothed with a Gaussian filter (full-width at half-maximum = 2 mm). The axial field of view was 263 mm and the overlap fraction was 49%. The reconstruction matrix was 440 × 440 voxels and the voxel size was 1.65 × 1.65 × 1.65 mm.

Image analysis

Hot clot artifacts (HCa) were visually characterized in terms of number (single or multiple) and location (right or left; lower lobe (LL) or middle lobe (ML) or upper lobe (UL), peripheral or intermediate or proximal).

Tracer uptake was determined using SUVs, calculated according to the following formula: SUV = tissue radioactivity concentration [kBq/mL] / [injected dose (kBq) / patient weight (g)]. Various PET parameters were analyzed for each HCa using MIM software (MIM Software Inc., Cleveland, United States): SUVmax and SUVmean, corresponding to the maximum and average values of SUV respectively; MTV (metabolic target volume), defined as the summed volume in millilitres (mL) measured using an image gradient-based method (PET EDGE™) [ 12 ]; TLB (total lesion burden) in grams (g), defined as MTV x SUVmean.

Data collection

A different set of data was collected for each case and control patient, including: (i) clinical characteristics [gender (M/F), age, weight, height, blood glucose level, active cancer defined as patient with a history of known cancer who had not achieved a complete response for at least 6 months at the time of the PET-CT (yes/no), anticoagulant treatment or antiplatelet drug (yes/no), and previous history of venous thrombosis or pulmonary embolism (yes/no)]; (ii) FDG administration [venous access (proximal/distal), permanent device (yes/no), NMT in charge, injected activity, time between 18F-FDG injection and image acquisition, iodinated contrast administration (yes/no)]; (iii) imaging procedure [PET machine (PET1/PET2), FDG vessel adhesion at injection site defined as venous linear uptake (yes/no), FDG extravasation into soft tissues (yes/no)].

Statistical analyses were performed using EpiInfo software version 7.2.6.0.

Descriptive statistics were used to characterize the cohort. Qualitative variables were presented as number (n) and percentage (%). The association between dichotomous categorical variables and the presence of the hot clot artifact was measured by the odd ratio (OR) with a 95% confidence interval (95%CI). Significant differences were assessed using chi-2 or Fisher exact test. Quantitative variables were expressed as mean ± standard deviation (SD) and compared in both case and control groups using Student t test. The level of significance was p  < 0.05.

Among the 22,671 18F-FDG PET/CT scans performed in our department between June 2021 and June 2023, 211 patients (98 M/113F, mean age ± SD 62.2 ± 15.4 years) had at least one pulmonary hot clot artefact, corresponding to an incidence of 0.94%. For further analysis of potential associated factors, 422 controls were selected, i.e. 2 per case.

The selection of case-control patients is described in the flowchart (Fig.  1 ).

figure 1

Flowchart of case-control patients’ selection

Hot clot artifact description

HCa were single, double or quintuple in 206 (97.6%), 4 (1.9%) and 1 case (0.5%) respectively, and were located in the right lung 112 times (51.1%) (58 in UL, 19 in ML and 35 in LL) and in the left lung 107 times (48.9%) (68 in UL and 39 in LL). The focus was peripheral (less than 2 cm from the pleura or fissure), proximal (less than 2 cm from the hilum) or intermediate (others) in 165 (75.3%), 23 (10.5%) and 31 (14.2%) cases respectively (Fig.  2 ).

figure 2

Presentation of 2 illustrative cases of HCa

a 54-year-old patient underwent 18F-FDG PET scan as part of the staging of a left lung neoplasm. The MIP image showed FDG avidity of the tumour (star), FDG vascular uptake in the elbow and right arm (dotted black arrow), lymph node uptake in the right subclavicular region (black arrow), and 5 lung foci (blue arrows), 3 peripheral sub-scissural foci in the middle lobe, 1 peripheral sub-pleural foci in the left upper lobe, and 1 peripheral sub-pleural foci in the right upper lobe) without anatomical lesions opposite, corresponding to a quintuple case of Hca.

a 60-year-old patient with oral squamous cell carcinoma underwent 18F-FDG PET scans for staging (top row) and follow-up (bottom row). Focal FDG uptake in the peripheral subpleural region of the left upper lobe (blue arrow) on PET (B) and fused PET-CT images (C) with no CT abnormalities (A) disappeared on the second scan, confirming a case of HCa.

The mean values ± SD [Range] of SUVmax, SUVmean, MTV and TLG were 11.3 ± 16.5 [0.9–142.0], 5.1 ± 5.0 [0.7–35.6], 0.3 ± 0.3 ml [0.1–1.5] and 1.5 ± 2.1 g [0.2–18.8], respectively. Only 3/217 MTV values (1.4%) were greater than 1 ml.

Associated factors

Clinical characteristics.

There was no significant difference in clinical characteristics between case and control patients (Table  1 ), except for age (mean ± SD 62.2 ± 15.4 vs. 65.9 ± 13.8, p  = 0.002).

FDG administration

Venous access (proximal vs. distal vs. permanent device) was associated with the occurence of HCa ( p  = 0.001). The distribution of case controls by FDG administration is shown in Table  2 .

Imaging analysis

There was no difference in FDG extravasation into soft tissues between case controls, in contrast to FDG venous linear uptake at injection site on images, which was more frequent in the HCa case group than in the control group (64.9% vs. 42.2%, respectively; OR = 2.56 95%CI 1.79–3.70, p  < 0.001) (Table  3 ).

Our results showed an incidence of pulmonary hot clot artifact (HCa) on 18F-FDG PET/CT of 0.94% (211/22671 scans for a total of 219 HCa) confirms the idea of a rare phenomenon. However, it has to be considered as a pitfall for physicians when interpreting images. Only Hany et al. found comparable results ( p  = 0.2 with X 2 statistical test), reporting an artifact in 3 patients out of 750 examinations carried out over a 9-months period, i.e. a frequency of 0.4% [ 7 ]. To the best of our knowledge, this is the largest series investigating the incidence of HCa, as the literature on this subject is sparse and mostly consists of case reports [ 4 , 5 , 7 , 8 , 9 , 10 , 11 , 13 ] (Table  4 ).

In our series, the HCa was almost exclusively single (206/211 = 97.6%). This finding is in accordance with the literature, as 19 of 21 (90.4%) published cases reported a single artifact. At most, we have showed an atypical example of a quintuple artifact in the same patient, as described by Ha et al. We found a balanced distribution of artifacts between the 2 lungs (51.1% versus 48.9%), redressing with a large population sample the predominance in the right lung (65%) extracted from the literature (12 patients, 17 artifacts). In our results, HCa were subpleural in approximately ¾ of the cases (75.3%), showing a tropism of the artifact for the peripheral region of the lung where the blood vessels are of smaller caliber and supporting the theory of a microscopic phenomenon an embolic origin of the artifact [ 4 , 7 , 8 , 9 , 11 ].

Regarding the metabolic characteristic of HCa, we found a high mean SUVmax 11.3 but with a large range [0.91 to 145], as calculated from data of 12 cases of literature (mean SUVmax ± SD = 40.6 ± 49.1 with a maximum of 185.1 and a minimum of 3.4) [ 4 , 8 , 10 , 11 , 13 ]. These findings demonstrate very high SUVmax values especially for possible lesion sizes below the spatial resolution of CT, as already suggested in several case report [ 4 , 9 , 13 ]. However, this very high variability of SUV parameters does not allow in current practice the use of a threshold to distinguish an artifact from a pathological lesion prior to its morphological expression. Nevertheless, its volume could be an interesting tool. Indeed, the mean MTV ± SD was 0.3 ± 0.3 ml in our series; and interestingly, 99% of them (216/219) presented a MTV lower than 1 ml. This again confirms that this artifact is a very low-volume phenomenon, such as micro-embolism. Therefore, a MTV value < 1 ml could be added as a new criterion for defining hot clot artifacts, avoiding repeat examinations (18F-FDG PET/CT or chest CT), thus limiting health care costs and improving patient management (consequences of misinterpretation, radiation exposure).

In our results, we found a significant statistical association between the presence of FDG vascular adhesion at the injection site (64.9% of cases vs. 42.2% of controls) and the presence of a hot clot artifact (OR = 2.56, 95%CI 1.79–3.70; p  < 0.0001). This correlation favors an embolic origin, as we imagine that the stasis of the radiopharmaceutical at the injection site probably reflects trauma to the vein wall, making it likely that a hot clot formed and migrated towards the lung. This hypothesis already been raised in the literature. In fact, Sánchez-Sánchez et al. observed the presence of 18F-FDG extravasation in 3 of their 4 reported patients [ 11 ]. In addition, Farsad et al. described a para-venous injection in the 4 cases they reported [ 10 ]. The migration or disappearance of the HCa on late or subsequent scans and the absence of clinical consequences for all the 21 cases published are consistent with this micro-embolic origin [ 4 , 5 , 7 , 8 , 9 , 10 , 11 , 13 ]. Moreover, regarding patient preparation, the venous proximal access was significantly higher in cases than in controls (94.3% of versus 84.4% of controls, p  = 0.0012). This result may seem paradoxical, as distal veins are thinner and more fragile, and therefore probably at risk of HCa. One explanation might be that the systematic use of small-caliber catheters for distal access in our routine would ultimately be less traumatic and protect against this risk. Retrospectively, we verified the association HCa/FDG vessel adhesion on PET was independent of this venous access type. In addition, there was no association between the nuclear medicine technologist (NMT) responsible for patient management and the presence of the hot clot artifact ( p  = 0.994). This does not suggest an isolated problem of competence in venipuncture procedure, which appears to be fairly homogeneous within our department. Injection-acquisition time interval and injected activity were not correlated with the presence of hot clot artifact. However, these two parameters varied very little (about 60 min for the delay and 3 MBq (0.08 mCi)/kg body weight for the injected activity), as we routinely used procedural guidelines for PET imaging [ 14 ]. We found no statistical association between the PET machine used for acquisition ( p  = 0.736) and the presence of a HCa, but the 2 systems were of the same model with the same technical settings. However, a machine effect remains unlikely as the cases reported in the literature were published over a wide time interval (2003 to 2020). Therefore, differences related to technological advances in PET imaging (PSF + TOF acquisition capabilities, digital technology, etc…) during this period cannot be involved [ 15 , 16 , 17 , 18 , 19 ]. Finally, there was no statistical association between the administration of iodinated contrast and the presence of the warm clot artifact ( p  = 0.1941), even though both agents were injected into the same venous access, making a pro-coagulant interaction between FDG and iodinated contrast agent unlikely.

We choose a 1:2 case-control design using the daily PET work list to rule out an obvious lack of correlation between HCa occurrence and radiopharmaceutical production (chemical purity, batch number, etc…) or time dependence (seasonal period, pm vs. am, etc…). Our results showed that controls were on average older than cases (65.9 versus 62.2 years; p  = 0.0021). At first sight, this may seem surprising, given that older people have a more fragile blood vessel system. On the contrary, one explanation could be that platelet function is better in younger people [ 20 , 21 ]. The mean age of cases reported in the literature was 55.3 years (17 patients) [ 4 , 5 , 7 , 8 , 9 , 11 , 13 ]. In addition, other clinical characteristics were comparable between the 2 groups notably in terms of gender ( p  = 0.910), as reported in the literature (21 patients, 52% female and 48% male) [ 4 , 5 , 7 , 8 , 9 , 10 , 11 , 13 ]. Finally, the presence of active cancer ( p  = 0.519), a history of deep vein thrombosis or pulmonary embolism ( p  = 0.818), anti-platelet drugs ( p  = 0.997) or anticoagulant treatment ( p  = 0.773) were not statistically associated with the presence of hot clot artifact. These factors were examined to identify potential circumstances associated with VTE that may or may not put patients at risk of thrombus formation.

This study had several limitations related to its single center retrospective nature, which is source of selection bias and limits external validity, even though we used a large case-control study design. Firstly, the word recognition query in the 22,671 reports may have slightly underestimated the incidence of artefacts if nuclear medicine physicians did not mention them. Secondly, it resulted in a missing data on the venous catheter caliber used to perfuse the patient, which prevented its inclusion in the analysis of protective and confounding factors for HCa occurrence. As mentioned above, we believe that the paradoxical statistical relationship between proximal (risk factor) and distal (protective factor) venous access could be explained by the use of small-caliber catheters distally to minimize vascular trauma. Thirdly, it also prevented us from studying the effect of injection type (manual versus automatic), as all our patients were injected with an automated system. Further prospective studies are needed to assess the effect of injection type and catheter size on the occurrence of artifacts. Finally, this study was limited to the specific case of FDG, whereas the problem of false-positives results may also concern other radiopharmaceuticals used in PET/CT. For example, Sgard B et al. in 2020 reported a case of pulmonary artifact on PET/CT with prostate-specific membrane antigen (PSMA) radioligands in the setting of biochemical recurrence of prostate adenocarcinoma. They associated this PSMA uptake with vascular malformation, which is different from a hot clot phenomenon [ 22 ].

Hot clot artefact is a real but rare phenomenon, occurring in about 1% of examinations and representing a pitfall in the interpretation of 18F-FDG PET scans. The results of our large case-control study suggest that this focal pulmonary tracer uptake is mostly unique, intense and small in volume (< 1 ml); often peripheral in location and associated with the presence of vascular adhesion on images. This supports the hypothesis of a micro embolic origin due to probable trauma to the vessel wall at injection site.

Data availability

No datasets were generated or analysed during the current study.

Salaün PY, Abgral R, Malard O, et al. Actualisation des recommandations de bonne pratique clinique pour l’utilisation de la TEP en cancérologie [Update of the recommendations of good clinical practice for the use of PET in oncology]. Bull Cancer. 2019;106(3):262–74. French. https://doi.org/10.1016/j.bulcan.2019.01.002

Casali M, Lauri C, Altini C, et al. State of the art of 18F-FDG PET/CT application in inflammation and infection: a guide for image acquisition and interpretation. Clin Transl Imaging. 2021;9(4):299–339. https://doi.org/10.1007/s40336-021-00445-w . Epub 2021 Jul 10. PMID: 34277510; PMCID: PMC8271312.

Article   PubMed   PubMed Central   Google Scholar  

Corrigan AJ, Schleyer PJ, Cook GJ. Pitfalls and Artifacts in the Use of PET/CT in Oncology Imaging. Semin Nucl Med. 2015;45(6):481 – 99. https://doi.org/10.1053/j.semnuclmed.2015.02.006 . PMID: 26522391.

Ozdemir E, Poyraz NY, Keskin M, et al. Hot-clot artifacts in the lung parenchyma on F-18 fluorodeoxyglucose positron emission tomography/CT due to faulty injection techniques: two case reports. Korean J Radiol. 2014 Jul-Aug;15(4):530–3. https://doi.org/10.3348/kjr.2014.15.4.530 . Epub 2014 Jul 9. PMID: 25053914; PMCID: PMC4105817.

Karantanis D, Subramaniam RM, Mullan BP, et al. Focal F-18 fluoro-deoxy-glucose accumulation in the lung parenchyma in the absence of CT abnormality in PET/CT. J Comput Assist Tomogr. 2007;31:800–5. [PubMed] [Google Scholar].

Article   PubMed   Google Scholar  

Hartman T. Pearls and pitfalls in thoracic imaging: variants and other difficult diagnoses. New York: Cambridge University Press; 2011. pp. 198–201. [Google Scholar].

Book   Google Scholar  

Hany TF, Heuberger J, von Schulthess GK. Iatrogenic FDG foci in the lungs: a pitfall of PET image interpretation. Eur Radiol. 2003;13:2122–7. [PubMed] [Google Scholar].

Ha JM, Jeong SY, Seo YS, et al. Incidental focal F-18 FDG accumulation in lung parenchyma without abnormal CT findings. Ann Nucl Med. 2009;23:599–603. [PubMed] [Google Scholar].

El Yaagoubi Y, Prunier-Aesch C, Philippe L, et al. Hot-clot artifact in the lung parenchyma on 18F-fluorodeoxyglucose positron emission tomography/computed tomography mimicking malignancy with a homolateral non-small cell lung cancer. World J Nucl Med. 2020;20(2):202–4. https://doi.org/10.4103/wjnm.WJNM_75_20 . PMID: 34321977; PMCID: PMC8286006.

Farsad M, Ambrosini V, Nanni C et al. Focal lung uptake of 18F-fluorodeoxyglucose (18F-FDG) without computed tomography findings. Nucl Med Commun. 2005;26(9):827 – 30. https://doi.org/10.1097/01.mnm.0000175786.27423.42 . PMID: 16096587.

Sánchez-Sánchez R, Rodríguez-Fernández A, Ramírez-Navarro A et al. PET-TAC: captación pulmonar focal de FDG sin alteracion estructural en TAC [PET/CT: focal lung uptake of 18F-fluorodeoxyglucose on PET but no structural alterations on CT]. Rev Esp Med Nucl. 2010 May-Jun;29(3):131-4. Spanish. https://doi.org/10.1016/j.remn.2010.01.002 . Epub 2010 Mar 15. PMID: 20227797.

Geets X, Lee JA, Bol A, et al. A gradient-based method for segmenting FDG-PET images: methodology and validation. Eur J Nucl Med Mol Imaging. 2007;34(9):1427–38. https://doi.org/10.1007/s00259-006-0363-4 . Epub 2007 Mar 13. PMID: 17431616.

Fathinul Fikri A, Lau W. An intense F-FDG pulmonary microfocus on PET without detectable abnormality on CT: a manifestation of an iatrogenic FDG pulmonary embolus. Biomed Imaging Interv J. 2010;6:e37. [PMC free article] [PubMed] [Google Scholar].

Boellaard R, Delgado-Bolton R, Oyen WJ, et al. European Association of Nuclear Medicine (EANM). FDG PET/CT: EANM procedure guidelines for tumour imaging: version 2.0. Eur J Nucl Med Mol Imaging. 2015;42(2):328–54. https://doi.org/10.1007/s00259-014-2961-x . Epub 2014 Dec 2. PMID: 25452219; PMCID: PMC4315529.

Article   CAS   PubMed   Google Scholar  

Townsend DW. Dual-modality imaging: combining anatomy and function. J Nucl Med. 2008;49(6):938–55. https://doi.org/10.2967/jnumed.108.051276 . Epub 2008 May 15. PMID: 18483101.

Surti S. Update on time-of-flight PET imaging. J Nucl Med. 2015;56(1):98–105. https://doi.org/10.2967/jnumed.114.145029 . Epub 2014 Dec 18. PMID: 25525181; PMCID: PMC4287223.

Panin VY, Kehren F, Michel C et al. Fully 3-D PET reconstruction with system matrix derived from point source measurements. IEEE Trans Med Imaging. 2006;25(7):907 – 21. https://doi.org/10.1109/tmi.2006.876171 . PMID: 16827491.

van der Vos CS, Koopman D, Rijnsdorp S, et al. Quantification, improvement, and harmonization of small lesion detection with state-of-the-art PET. Eur J Nucl Med Mol Imaging. 2017;44(Suppl 1):4–16. https://doi.org/10.1007/s00259-017-3727-z . Epub 2017 Jul 8. PMID: 28687866; PMCID: PMC5541089.

van Sluis J, de Jong J, Schaar J, et al. Performance characteristics of the Digital Biograph Vision PET/CT system. J Nucl Med. 2019;60(7):1031–6. https://doi.org/10.2967/jnumed.118.215418 . Epub 2019 Jan 10. PMID: 30630944.

Donato AJ, Machin DR, Lesniewski LA. Mechanisms of dysfunction in the Aging vasculature and role in Age-Related Disease. Circ Res. 2018;123(7):825–48. https://doi.org/10.1161/CIRCRESAHA.118.312563 . PMID: 30355078; PMCID: PMC6207260.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Gleerup G, Winther K. The effect of ageing on platelet function and fibrinolytic activity. Angiology. 1995;46(8):715-8. https://doi.org/10.1177/000331979504600810 . PMID: 7639418.

Sgard B, Montravers F, Fourquet A, de la Taille A, Gauthé M. Pulmonary Vein Varix Mimicking Prostate Cancer Metastasis on 68Ga-Prostate Specific Membrane Antigen-11 PET/CT. Clin Nucl Med. 2020;45(1):e39-e40. https://doi.org/10.1097/RLU.0000000000002803 . PMID: 31693611.

Download references

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Author information

Authors and affiliations.

Nuclear Medicine Department, CHRU Brest, Boulevard Tanguy Prigent, Brest, France

Jacques Dzuko Kamga, Romain Floch, Kevin Kerleguer, David Bourhis, Romain Le Pennec, Simon Hennebicq, Pierre-Yves Salaün & Ronan Abgral

UMR Inserm GETBO 1304, University of Western Brittany, Brest, France

David Bourhis, Romain Le Pennec, Pierre-Yves Salaün & Ronan Abgral

You can also search for this author in PubMed   Google Scholar

Contributions

Each author has contributed to the submitted work as follows: JDK, RA are the guarantors of the paper. JDK, PYS, RA designed the study JDK, DB realized statistics. JDK, RF, KK, SH, RA analyzed the data. JDK, RA drafted the manuscriptRLP, PYS revised the manuscript. All authors contributed in drawing up the manuscript. All authors declare having no conflict of interest.

Corresponding authors

Correspondence to Jacques Dzuko Kamga or Ronan Abgral .

Ethics declarations

Ethical approval.

The study was conducted in accordance with the Declaration of Helsinki and was approved by the French Advisory Committee on Information Processing in Health Research (CCTIRS).

Consent to participate

All patients have expressed their non-objection to the use of their medical information and images in an anonymized form.

Consent to publish

The patients in Fig.  2 (cases 1 and 2 ) have expressed their non-objection to the use of their medical information and images in an anonymized form.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Dzuko Kamga, J., Floch, R., Kerleguer, K. et al. Case-control study of the characteristics and risk factors of hot clot artefacts on 18F-FDG PET/CT. Cancer Imaging 24 , 114 (2024). https://doi.org/10.1186/s40644-024-00760-1

Download citation

Received : 11 June 2024

Accepted : 07 August 2024

Published : 27 August 2024

DOI : https://doi.org/10.1186/s40644-024-00760-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Hot clot artifact
  • False-positive
  • FDG-PET pitfall

Cancer Imaging

ISSN: 1470-7330

hypothesis in case control study

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Europe PMC Author Manuscripts

Basic statistical analysis in genetic case-control studies

Geraldine m clarke.

1 Genetic and Genomic Epidemiology Unit, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

Carl A Anderson

2 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

Fredrik H Pettersson

Lon r cardon.

3 GlaxoSmithKline, King of Prussia, Pennsylvania, USA.

Andrew P Morris

Krina t zondervan, associated data.

This protocol describes how to perform basic statistical analysis in a population-based genetic association case-control study. The steps described involve the (i) appropriate selection of measures of association and relevance of disease models; (ii) appropriate selection of tests of association; (iii) visualization and interpretation of results; (iv) consideration of appropriate methods to control for multiple testing; and (v) replication strategies. Assuming no previous experience with software such as PLINK, R or Haploview, we describe how to use these popular tools for handling single-nucleotide polymorphism data in order to carry out tests of association and visualize and interpret results. This protocol assumes that data quality assessment and control has been performed, as described in a previous protocol, so that samples and markers deemed to have the potential to introduce bias to the study have been identified and removed. Study design, marker selection and quality control of case-control studies have also been discussed in earlier protocols. The protocol should take ~1 h to complete.

INTRODUCTION

A genetic association case-control study compares the frequency of alleles or genotypes at genetic marker loci, usually single-nucleotide polymorphisms (SNPs) (see Box 1 for a glossary of terms), in individuals from a given population—with and without a given disease trait—in order to determine whether a statistical association exists between the disease trait and the genetic marker. Although individuals can be sampled from families (‘family-based’ association study), the most common design involves the analysis of unrelated individuals sampled from a particular outbred population (‘population-based association study’). Although disease-related traits are usually the main trait of interest, the methods described here are generally applicable to any binary trait.

The result of interbreeding between individuals from different populations.

Cochran-Armitage trend test

Statistical test for analysis of categorical data when categories are ordered. It is used to test for association in a 2 × k contingency table ( k > 2). In genetic association studies, because the underlying genetic model is unknown, the additive version of this test is most commonly used.

Confounding

A type of bias in statistical analysis that occurs when a factor exists that is causally associated with the outcome under study (e.g., case-control status) independently of the exposure of primary interest (e.g., the genotype at a given locus) and is associated with the exposure variable but is not a consequence of the exposure variable.

Any variable other than the main exposure of interest that is possibly predictive of the outcome under study; covariates include confounding variables that, in addition to predicting the outcome variable, are associated with exposure.

False discovery rate

The proportion of non-causal or false positive significant SNPs in a genetic association study.

False positive

Occurs when the null hypothesis of no effect of exposure on disease is rejected for a given variant when in fact the null hypothesis is true.

Family-wise error rate

The probability of one or more false positives in a set of tests. For genetic association studies, family-wise error rates reflect false positive findings of associations between allele/genotype and disease.

Hardy-Weinberg equilibrium (HWE)

Given a minor allele frequency of p , the probabilities of the three possible unordered genotypes ( a/a , A/a , A/A ) at a biallelic locus with minor allele A and major allele a, are (1 – p ) 2 , 2 p (1 – p ), p 2 . In a large, randomly mating, homogenous population, these probabilities should be stable from generation to generation.

Linkage disequilibrium (LD)

The population correlation between two (usually nearby) allelic variants on the same chromosome; they are in LD if they are inherited together more often than expected by chance.

A measure of LD between two markers calculated according to the correlation between marker alleles.

A measure of association derived from case-control studies; it is the ratio of the odds of disease in the exposed group compared with the non-exposed.

The risk of disease in a given individual. Genotype-specific penetrances reflect the risk of disease with respect to genotype.

Population allele frequency

The frequency of a particular allelic variant in a general population of specified origin.

Population stratification

The presence of two or more groups with distinct genetic ancestry.

Relative risk

The risk of disease or of an event occurring in one group relative to another.

Single-nucleotide polymorphism (SNP)

A genetic variant that consists of a single DNA base-pair change, usually resulting in two possible allelic identities at that position.

Following previous protocols on study design, marker selection and data quality control 1 – 3 , this protocol considers basic statistical analysis methods and techniques for the analysis of genetic SNP data from population-based genome-wide and candidate-gene (CG) case-control studies. We describe disease models, measures of association and testing at genotypic (individual) versus allelic (gamete) level, single-locus versus multilocus methods of association testing, methods for controlling for multiple testing and strategies for replication. Statistical methods discussed relate to the analysis of common variants, i.e., alleles with a minor allele frequency (MAF) > 1%; different analytical techniques are required for the analysis of rare variants 4 . All methods described are proven and used routinely in our research group 5 , 6 .

Conceptual basis for statistical analysis

The success of a genetic association study depends on directly or indirectly genotyping a causal polymorphism. Direct genotyping occurs when an actual causal polymorphism is typed. Indirect genotyping occurs when nearby genetic markers that are highly correlated with the causal polymorphism are typed. Correlation, or non-random association, between alleles at two or more genetic loci is referred to as linkage disequilibrium (LD). LD is generated as a consequence of a number of factors and results in the shared ancestry of a population of chromosomes at nearby loci. The shared ancestry means that alleles at flanking loci tend to be inherited together on the same chromosome, with specific combinations of alleles known as haplotypes. In genome-wide association (GWA) studies, common SNPs are typically typed at such high density across the genome that, although any single SNP is unlikely to have direct causal relevance, some are likely to be in LD with any underlying common causative variants. Indeed, most recent GWA arrays containing up to 1 million SNPs use known patterns of genomic LD from sources such as HapMap 7 to provide the highest possible coverage of common genomic variation 8 . CG studies usually focus on genotyping a smaller but denser set of SNPs, including functional polymorphisms with a potentially higher previous probability of direct causal relevance 2 .

A fundamental assumption of the case-control study is that the individuals selected in case and control groups provide unbiased allele frequency estimates of the true underlying distribution in affected and unaffected members of the population of interest. If not, association findings will merely reflect biases resulting from the study design 1 .

Models and measures of association

Consider a genetic marker consisting of a single biallelic locus with alleles a and A (i.e., a SNP). Unordered possible genotypes are then a/a , a/A and A/A . The risk factor for case versus control status (disease outcome) is the genotype or allele at a specific marker. The disease penetrance associated with a given genotype is the risk of disease in individuals carrying that genotype. Standard models for disease penetrance that imply a specific relationship between genotype and phenotype include multiplicative, additive, common recessive and common dominant models. Assuming a genetic penetrance parameter γ (γ > 1), a multiplicative model indicates that the risk of disease is increased γ-fold with each additional A allele; an additive model indicates that risk of disease is increased γ-fold for genotype a/A and by 2γ-fold for genotype A/A ; a common recessive model indicates that two copies of allele A are required for a γ-fold increase in disease risk, and a common dominant model indicates that either one or two copies of allele A are required for a γ-fold increase in disease risk. A commonly used and intuitive measure of the strength of an association is the relative risk (RR), which compares the disease penetrances between individuals exposed to different genotypes. Special relationships exist between the RRs for these common models 9 (see Table 1 ).

Disease penetrance functions and associated relative risks.

Penetrance Relative risk
Disease modela/aA/aA/AA/aA/A
Multiplicative γ γ γ γ
Additive γ 2 γ γ
Common recessive γ 1 γ
Common dominant γ γ γ γ

Shown are disease penetrance functions for genotypes a/a , A/a and A/A and associated relative risks for genotypes A/a and A/a compared with baseline genotype a/a for standard disease models when baseline disease penetrance associated with genotype a/a is f 0 0 and genetic penetrance parameter is γ> 19.

RR estimates based on penetrances can only be derived directly from prospective cohort studies, in which a group of exposed and unexposed individuals from the same population are followed up to assess who develops disease. In a case-control study, in which the ratio of cases to controls is controlled by the investigator, it is not possible to make direct estimates of disease penetrance, and hence of RRs. In this type of study, the strength of an association is measured by the odds ratio (OR). In a case-control study, the OR of interest is the odds of disease (the probability that the disease is present compared with the probability that it is absent) in exposed versus non-exposed individuals. Because of selected sampling, odds of disease are not directly measurable. However, conveniently, the disease OR is mathematically equivalent to the exposure OR (the odds of exposure in cases versus controls), which we can calculate directly from exposure frequencies 10 . The allelic OR describes the association between disease and allele by comparing the odds of disease in an individual carrying allele A to the odds of disease in an individual carrying allele a . The genotypic ORs describe the association between disease and genotype by comparing the odds of disease in an individual carrying one genotype to the odds of disease in an individual carrying another genotype. Hence, there are usually two genotypic ORs, one comparing the odds of disease between individuals carrying genotype A/A and those carrying a/a and the other comparing the odds of disease between individuals carrying genotype a/A and those carrying genotype a/a. Beneficially, when disease penetrance is small, there is little difference between RRs and ORs (i.e., RR ≈ OR). Moreover, the OR is amenable to analysis by multivariate statistical techniques that allow extension to incorporate further SNPs, risk factors and clinical variables. Such techniques include logistic regression and other types of log-linear models 11 .

To work with observations made at the allelic (gamete) rather than the genotypic (individual) level, it is necessary to assume (i) that there is Hardy-Weinberg equilibrium (HWE) in the population, (ii) that the disease has a low prevalence ( < 10%) and (iii) that the disease risks are multiplicative. Under the null hypothesis of no association with disease, the first condition ensures that there is HWE in both controls and cases. Under the alternative hypothesis, the second condition further ensures that controls will be in HWE and the third condition further ensures that cases will also be in HWE. Under these assumptions, allelic frequencies in affected and unaffected individuals can be estimated from case-control studies. The OR comparing the odds of allele A between cases and controls is called the allelic RR (γ*). It can be shown that the genetic penetrance parameter in a multiplicative model of penetrance is closely approximated by the allelic RR, i.e., γ ≈ γ* ( ref. 10 ).

Tests for association

Tests of genetic association are usually performed separately for each individual SNP. The data for each SNP with minor allele a and major allele A can be represented as a contingency table of counts of disease status by either genotype count (e.g., a/a , A/a and A/A ) or allele count (e.g., a and A ) (see Box 2 ). Under the null hypothesis of no association with the disease, we expect the relative allele or genotype frequencies to be the same in case and control groups. A test of association is thus given by a simple χ 2 test for independence of the rows and columns of the contingency table.

CONTINGENCY TABLES AND ASSOCIATED TESTS

The risk factor for case versus control status (disease outcome) is the genotype or allele at a specific marker. The data for each SNP with minor allele a and major allele A in case and control groups comprising n individuals can be written as a 2 × k contingency table of disease status by either allele ( k = 2) or genotype ( k = 3) count.

Allele count

AlleleaATotal
Cases .
Controls .
Total . .
  • The allelic odds ratio is estimated by OR A = m 12 m 21 m 11 m 22 .
  • If the disease prevalence in a control individual carrying an a allele can be estimated and is denoted as P 0 , then the relative risk of disease in individuals with an A allele compared with an a allele is estimated by RR A = OR A 1 − P 0 + P o OR A .

An allelic association test is based on a simple χ 2 test for independence of rows and columns X 2 = ∑ i = 1 2 ∑ j = 1 2 ( m i j − E [ m i j ] ) 2 E [ m i j ] where E [ m i j ] = m i • m • j 2 n X 2 has a χ 2 distribution with 1 d.f. under the null hypothesis of no association.

Genotype count

Genotypea/aA/aA/ATotal
Case .
Controls .
Total . . .
  • The genotypic odds ratio for genotype A/A relative to genotype a/a is estimated by OR A A = n 13 n 21 n 11 n 23 . The genotypic odds ratio for genotype A/a relative to genotype a/a is estimated by OR A a = n 12 n 21 n 11 n 22 .
  • If the disease prevalence in a control individual carrying an a/a genotype can be estimated and is denoted as P 0 , then the relative risk of disease in individuals with an A/A [A/a] genotype compared with an a/a genotype is estimated by RR A A = OR A A 1 − P 0 + P o OR A A [ RR A a = OR A a 1 − P 0 + P o OR A a ] .
  • A genotypic association test is based on a simple χ 2 test for independence of rows and columns X 2 = ∑ i = 1 2 ∑ j = 1 3 ( n i j − E [ n i j ] ) 2 E [ n i j ] where E [ n i j ] = n i • n • j n X 2 has a χ 2 distribution with 2 d.f. under the null hypothesis of no association. To test for a dominant (recessive) effect of allele A, counts for genotypes a/A and A/A ( a/a and A/a ) can be combined and the usual 1 d.f. χ 2 -test for independence of rows and columns can be applied to the summarized 2 × 2 table.
  • A Cochran-Armitage trend test of association between disease and marker is given by T 2 = [ ∑ i = 1 3 w i ( n i n 2 • − n 2 n 1 • ) ] 2 n 1 • n 2 • n [ ∑ i = 1 3 w i 2 n • i ( n − n • i ) − 2 ∑ i = 1 2 ∑ j = i + 1 3 w i w j n • i n • j ] where w = ( w 1 , w 2 , w 3 ) are weights chosen to detect particular types of association. For example, to test whether allele A is dominant over allele a w = (0,1,1) is optimal; to test whether allele A is recessive to allele a , the optimal choice is w = (0,0,1). In genetic association studies, w = (0,1,2) is most often used to test for an additive effect of allele A . T 2 has a χ 2 distribut ion with 1 d.f. under the null hypothesis of no association.

In a conventional χ 2 test for association based on a 2 × 3 contingency table of case-control genotype counts, there is no sense of genotype ordering or trend: each of the genotypes is assumed to have an independent association with disease and the resulting genotypic association test has 2 degrees of freedom (d.f.). Contingency table analysis methods allow alternative models of penetrance by summarizing the counts in different ways. For example, to test for a dominant model of penetrance, in which any number of copies of allele A increase the risk of disease, the contingency table can be summarized as a 2 × 2 table of genotype counts of A/A versus both a/A and a/a combined. To test for a recessive model of penetrance, in which two copies of allele A are required for any increased risk, the contingency table is summarized into genotype counts of a/a versus a combined count of both a/A and A/A genotypes. To test for a multiplicative model of penetrance using contingency table methods, it is necessary to analyze by gamete rather than individual: a χ 2 test applied to the 2 × 2 table of case-control allele counts is the widely used allelic association test. The allelic association test with 1 d.f. will be more powerful than the genotypic test with 2 d.f., as long as the penetrance of the heterozygote genotype is between the penetrances of the two homozygote genotypes. Conversely, if there is extreme deviation from the multiplicative model, the genotypic test will be more powerful. In the absence of HWE in controls, the allelic association test is not suitable and alternative methods must be used to test for multiplicative models. See the earlier protocol on data quality assessment and control for a discussion of criteria for retaining SNPs showing deviation from HWE 3 . Alternatively, any penetrance model specifying some kind of trend in risk with increasing numbers of A alleles, of which additive, dominant and recessive models are all examples, can be examined using the Cochran-Armitage trend test 12 , 13 . The Cochran-Armitage trend test is a method of directing χ 2 tests toward these narrower alternatives. Power is very often improved as long as the disease risks associated with the a/A genotype are intermediate to those associated with the a/a and A/A genotypes. In genetic association studies in which the underlying genetic model is unknown, the additive version of this test is most commonly used. Table 2 summarizes the various tests of association that use contingency table methods. Box 2 outlines contingency tables and associated tests in statistical detail.

Tests of association using contingency table methods.

TestDegrees of
freedom (d.f.)
contingency table descriptionPLINK
keyword
Genotypic
association
22 × 3 table of case-control by
genotype ( ) counts
GENO
Dominant model12 × 2 table of case-control by
dominant genotype pattern of
inheritance ( , not ) counts
DOM
Recessive model12 × 2 table of case-control by
recessive genotype pattern of
inheritance (not , ) counts
REC
Cochran-Armitage
trend test
12 × 3 table of case-control by
genotype ( ) counts
TREND
Allelic association12 × 2 table of 2 case-control by
allele ( ) counts
ALLELIC

d.f. for tests of association based on contingency tables along with associated PLINK keyword are shown for allele and genotype counts in case and control groups, comprising N individuals at a bi-allelic locus with alleles a and A .

Tests of association can also be conducted with likelihood ratio (LR) methods in which inference is based on the likelihood of the genotyped data given disease status. The likelihood of the observed data under the proposed model of disease association is compared with the likelihood of the observed data under the null model of no association; a high LR value tends to discredit the null hypothesis. All disease models can be tested using LR methods. In large samples, the χ 2 and LR methods can be shown to be equivalent under the null hypothesis 14 .

More complicated logistic regression models of association are used when there is a need to include additional covariates to handle complex traits. Examples of this are situations in which we expect disease risk to be modified by environmental effects such as epidemiological risk factors (e.g., smoking and gender), clinical variables (e.g., disease severity and age at onset) and population stratification (e.g., principal components capturing variation due to differential ancestry 3 ), or by the interactive and joint effects of other marker loci. In logistic regression models, the logarithm of the odds of disease is the response variable, with linear (additive) combinations of the explanatory variables (genotype variables and any covariates) entering into the model as its predictors. For suitable linear predictors, the regression coefficients fitted in the logistic regression represent the log of the ORs for disease gene association described above. Linear predictors for genotype variables in a selection of standard disease models are shown in Table 3 .

Linear predictors for genotype variables in a selection of standard disease models.

GenotypeModel
MultiplicativeGenotypicRecessiveDominant
/
/ + + + +
/ + 2 + + +
Interpretation provides an estimate of the log odds ratio for disease risk associated with each additional allele (also called the haplotype relative risk). If is significant, then there is a multiplicative contribution to disease risk in that the odds ratio for disease risk increases multiplicatively for every additional allele and provide estimates of the log odds ratio for disease risk in individuals with genotypes and , respectively, relative to an individual with genotype . A likelihood ratio test of whether both and are significant is equivalent to the conventional 2 d.f. test for association in a 2 × 3 contingency table provides an estimate of the log odds ratio for disease risk in an individual with at least 1 allele (genotype or ) compared with an individual with no alleles (genotype ). A test of whether is significant corresponds to a 1 d.f. test for association in a 2 × 2 contingency table of disease outcome by genotype classified as or not provides an estimate of the log odds ratio for disease risk in an individual with two alleles (genotype compared with an individual with one or two alleles (genotype or ). A test of whether is significant corresponds to a 1 d.f. test for association in a 2 × 2 contingency table of disease outcome by genotype classified as or not

Multiple testing

Controlling for multiple testing to accurately estimate significance thresholds is a very important aspect of studies involving many genetic markers, particularly GWA studies. The type I error, also called the significance level or false-positive rate, is the probability of rejecting the null hypothesis when it is true. The significance level indicates the proportion of false positives that an investigator is willing to tolerate in his or her study. The family-wise error rate (FWER) is the probability of making one or more type I errors in a set of tests. Lower FWERs restrict the proportion of false positives at the expense of reducing the power to detect association when it truly exists. A suitable FWER should be specified at the design stage of the analysis 1 . It is then important to keep track of the number of statistical comparisons performed and correct the individual SNP-based significance thresholds for multiple testing to maintain the overall FWER. For association tests applied at each of n SNPs, per-test significance levels of α* for a given FWER of α can be simply approximated using Bonferroni (α* = α/ n ) or Sidak 15 , 16 (α* = 1 − (1 – α) 1/ n ) adjustments. When tests are independent, the Sidak correction is exact; however, in GWA studies comprising dense sets of markers, this is unlikely to be true and both corrections are then very conservative. A similar but slightly less-stringent alternative to the Bonferroni correction is given by Holm 17 . Alternatives to the FWER approach include false discovery rate (FDR) procedures 18 , 19 , which control for the expected proportion of false positives among those SNPs declared significant. However, dependence between markers and the small number of expected true positives make FDR procedures problematic for GWA studies. Alternatively, permutation approaches aim to render the null hypothesis correct by randomization: essentially, the original P value is compared with the empirical distribution of P values obtained by repeating the original tests while randomly permuting the case-control labels 20 . Although Bonferroni and Sidak corrections provide a simple way to adjust for multiple testing by assuming independence between markers, permutation testing is considered to be the ‘gold standard’ for accurate correction 20 . Permutation procedures are computationally intensive in the setting of GWA studies and, moreover, apply only to the current genotyped data set; therefore, unless the entire genome is sequenced, they cannot generate truly genome-wide significance thresholds. Bayes factors have also been proposed for the measurement of significance 6 . For GWA studies of dense SNPs and resequence data, a standard genome-wide significance threshold of 7.2 × 10 − 8 for the UK Caucasian population has been proposed by Dudbridge and Gusnanto 21 . Other thresholds for contemporary populations, based on sample size and proposed FWER, have been proposed by Hoggart et al 22 . Informally, some journals have accepted a genome-wide significance threshold of 5 × 10 − 7 as strong evidence for association 6 ; however, most recently, the accepted standard is 5 × 10 − 8 ( ref. 23 ). Further, graphical techniques for assessing whether observed P values are consistent with expected values include log quantile-quantile P value plots that highlight loci that deviate from the null hypothesis 24 .

Interpretation of results

A significant result in an association test rarely implies that a SNP is directly influencing disease risk; population association can be direct, indirect or spurious. A direct, or causal, association occurs when different alleles at the marker locus are directly involved in the etiology of the disease through a biological pathway. Such associations are typically only found during follow-up genotyping phases of initial GWA studies, or in focused CG studies in which particular functional polymorphisms are targeted. An indirect, or non-causal, association occurs when the alleles at the marker locus are correlated (in LD) with alleles at a nearby causal locus but do not directly influence disease risk. When a significant finding in a genetic association study is true, it is most likely to be indirect. Spurious associations can occur as a consequence of data quality issues or statistical sampling, or because of confounding by population stratification or admixture. Population stratification occurs when cases and controls are sampled disproportionately from different populations with distinct genetic ancestry. Admixture occurs when there has been genetic mixing of two or more groups in the recent past. For example, genetic admixture is seen in Native American populations in which there has been recent genetic mixing of individuals with both American Indian and Caucasian ancestry 25 . Confounding occurs when a factor exists that is associated with both the exposure (genotype) and the disease but is not a consequence of the exposure. As allele frequencies and disease frequencies are known to vary among populations of different genetic ancestry, population stratification or admixture can confound the association between the disease trait and the genetic marker; it can bias the observed association, or indeed can cause a spurious association. Principal component analyses or multidimensional scaling methods are commonly used to identify and remove individuals exhibiting divergent ancestry before association testing. These techniques are described in detail in an earlier protocol 3 . To adjust for any residual population structure during association testing, the principal components from principal component analyses or multidimensional scaling methods can be included as covariates in a logistic regression. In addition, the technique of genomic control 26 can be used to detect and compensate for the presence of fine-scale or within-population stratification during association testing. Under genomic control, population stratification is treated as a random effect that causes the distribution of the χ 2 association test statistics to have an inflated variance and a higher median than would otherwise be observed. The test statistics are assumed to be uniformly affected by an inflation factor λ, the magnitude of which is estimated from a set of selected markers by comparing the median of their observed test statistics with the median of their expected test statistics under an assumption of no population stratification. Under genomic control, if λ > 1, then population stratification is assumed to exist and a correction is applied by dividing the actual association test χ 2 statistic values by λ. As λ scales with sample size, λ 1,000 , the inflation factor for an equivalent study of 1,000 cases and 1,000 controls calculated by rescaling λ, is often reported 27 . In a CG study, λ can only be determined if an additional set of markers specifically designed to indicate population stratification are genotyped. In a GWA study, an unbiased estimation of λ can be determined using all of the genotyped markers; the effect on the inflation factor of potential causal SNPs in such a large set of genomic control markers is assumed to be negligible.

Replication

Replication occurs when a positive association from an initial study is confirmed in a subsequent study involving an independent sample drawn from the same population as the initial study. It is the process by which genetic association results are validated. In theory, a repeated significant association between the same trait and allele in an independent sample is the benchmark for replication. However, in practice, so-called replication studies often comprise findings of association between the same trait and nearby variants in the same gene as the original SNP, or between the same SNP and different high-risk traits. A precise definition of what constitutes replication for any given study is therefore important and should be clearly stated 28 .

In practice, replication studies often involve different investigators with different samples and study designs aiming to independently verify reports of positive association and obtain accurate effect-size estimates, regardless of the designs used to detect effects in the primary study. Two commonly used strategies in such cases are an exact strategy, in which only marker loci indicating a positive association are subsequently genotyped in the replicate sample, and a local strategy, in which additional variants are also included, thus combining replication with fine-mapping objectives. In general, the exact strategy is more balanced in power and efficiency; however, depending on local patterns of LD and the strength of primary association signals, a local strategy can be beneficial 28 .

In the past, multistage designs have been proposed as cost-efficient approaches to allow the possibility of replication within a single overall study. The first stage of a standard two-stage design involves genotyping a large number of markers on a proportion of available samples to identify potential signals of association using a nominal P value threshold. In stage two, the top signals are then followed up by genotyping them on the remaining samples while a joint analysis of data from both stages is conducted 29 , 30 . Significant signals are subsequently tested for replication in a second data set. With the ever-decreasing costs of GWA genotyping, two-stage studies have become less common.

Standard statistical software (such as R ( ref. 31 ) or SPSS) can be used to conduct and visualize all the analyses outlined above. However, many researchers choose to use custom-built GWA software. In this protocol we use PLINK 32 , Haploview 33 and the customized R package car 34 . PLINK is a popular and computationally efficient software program that offers a comprehensive and well-documented set of automated GWA quality control and analysis tools. It is a freely available open source software written in C++, which can be installed on Windows, Mac and Unix machines ( http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml ). Haploview ( http://www.broadinstitute.org/haploview/haploview ) is a convenient tool for visualizing LD; it interfaces directly with PLINK to produce a standard visualization of PLINK association results. Haploview is most easily run through a graphical user interface, which offers many advantages in terms of display functions and ease of use. car ( http://socserv.socsci.mcmaster.ca/jfox/ ) is an R package that contains a variety of functions for graphical diagnostic methods.

The next section describes protocols for the analysis of SNP data and is illustrated by the use of simulated data sets from CG and GWA studies (available as gzipped files from http://www.well.ox.ac.uk/ggeu/NPanalysis/ or .zip files as Supplementary Data 1 and Supplementary Data 2 ). We assume that SNP data for a CG study, typically comprising on the order of thousands of markers, will be available in a standard PED and MAP file format (for an explanation of these file formats, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped ) and that SNP data for a GWA study, typically comprising on the order of hundreds of thousands of markers, will be available in a standard binary file format (for an explanation of the binary file format, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed ). In general, SNP data for either type of study may be available in either format. The statistical analysis described here is for the analysis of one SNP at a time; therefore, apart from the requirement to take potentially differing input file formats into account, it does not differ between CG and GWA studies.

Computer workstation with Unix/Linux operating system and web browser

  • PLINK 32 software for association analysis ( http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml ).
  • Unzipping tool such as WinZip ( http://www.winzip.com ) or gunzip ( http://www.gzip.org )
  • Statistical software for data analysis and graphing such as R ( http://cran.r-project.org/ ) and Haploview 33 ( http://www.broadinstitute.org/haploview/haploview ).
  • SNPSpD 35 (Program to calculate the effective number of independent SNPs among a collection of SNPs in LD with each other; http://genepi.qimr.edu.au/general/daleN/SNPSpD/ )
  • Files: genome-wide and candidate-gene SNP data (available as gzipped files from http://www.well.ox.ac.uk/ggeu/NPanalysis/ or .zip files as Supplementary Data 1 and Supplementary Data 2 )

Identify file formats ● TIMING ~5 min

1 | For SNP data available in standard PED and MAP file formats, as in our CG study, follow option A. For SNP data available in standard binary file format, as in our GWA study, follow option B. The instructions provided here are for unpacking the sample data provided as gzipped files at http://www.well.ox.ac.uk/ggeu/NPanalysis/ . If using the .zip files provided as supplementary Data 1 or supplementary Data 2 , please proceed directly to step 2.

▲ CRITICAL STEP The format in which genotype data are returned to investigators varies according to genome-wide SNP platforms and genotyping centers. We assume that genotypes have been called by the genotyping center, undergone appropriate quality control filters as described in a previous protocol 3 and returned as clean data in a standard file format.

  • Download the file ‘cg-data.tgz’.

▲ CRITICAL STEP The simulated data used here have passed standard quality control filters: all individuals have a missing data rate of < 20%, and SNPs with a missing rate of > 5%, a MAF < 1% or an HWE P value < 1 × 10 − 4 have already been excluded. These filters were selected in accordance with procedures described elsewhere 3 to minimize the influence of genotype-calling artifacts in a CG study.

  • Download the file ‘gwa-data.tgz’.

▲ CRITICAL STEP We assume that covariate files are available in a standard file format. For an explanation of the standard format for covariate files, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar .

▲ CRITICAL STEP Optimized binary BED files contain the genotype information and the corresponding BIM/FAM files contain the map and pedigree information. The binary BED file is a compressed file that allows faster processing in PLINK and takes less storage space, thus facilitating the analysis of large-scale data sets 32 .

▲ CRITICAL STEP The simulated data used here have passed standard quality control: all individuals have a missing data rate of < 10%. SNPs with a missing rate > 10%, a MAF < 1% or an HWE P value < 1 × 10 − 5 have already been excluded. These filters were selected in accordance with procedures described elsewhere 3 to minimize the influence of genotype-calling artifacts in a GWA study.

? TROUBLESHOOTING

Basic descriptive summary ● TIMING ~5 min

2 | To obtain a summary of MAFs in case and control populations and an estimate of the OR for association between the minor allele (based on the whole sample) and disease in the CG study, type ‘plink --file cg --assoc --out data’. In any of the PLINK commands in this protocol, replace the ‘--file cg’ option with the ‘--bfile gwa’ option to use the binary file format of the GWA data rather than the PED and MAP file format of the CG data.

▲ CRITICAL STEP PLINK always creates a log file called ‘data.log’, which includes details of the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking the software is successfully completing commands.

▲ CRITICAL STEP The options in a PLINK command can be specified in any order.

3 | Open the output file ‘data.assoc’. It has one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the frequency of the minor allele in the cases [F_A] and controls [F_U], the major allele [A2] and statistical data for an allelic association test including the χ 2 -test statistic [CHISQ], the asymptotic P value [ P ] and the estimated OR for association between the minor allele and disease [OR].

Single SNP tests of association ● TIMING ~5 min

4 | When there are no covariates to consider, carry out simple χ 2 tests of association by following option A. For inclusion of multiple covariates and covariate interactions, follow option B.

▲ CRITICAL STEP Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case control by genotype counts contains less than five observations. This is because the χ 2 approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%. To change the threshold, use the ‘--cell’ option. For example, we could lower the threshold to 3 and repeat the χ 2 tests of association by typing ‘plink --file cg --model --cell 3 --out data’.

  • Open the output file ‘data.model’. It contains five rows per SNP, one for each of the association tests described in Table 2 . Each row contains the chromosome [CHR], the SNP identifier [SNP], the minor allele [A1], the major allele [A2], the test performed [TEST: GENO (genotypic association); TREND (Cochran-Armitage trend); ALLELIC (allelic association); DOM (dominant model); and REC (recessive model)], the cell frequency counts for cases [AFF] and controls [UNAFF], the χ 2 test statistic [CHISQ], the degrees of freedom for the test [DF] and the asymptotic P value [ P ].

▲ CRITICAL STEP To specify a genotypic, dominant or recessive model in place of a multiplicative model, include the model option --genotypic, --dominant or --recessive, respectively. To include sex as a covariate, include the option --sex. To specify interactions between covariates, and between SNPs and covariates, include the option --interaction. Open the output file ‘data.assoc.logistic’. If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. If the ‘--genotypic’ option has been selected, the first row will correspond to a test for additivity and the subsequent row to a separate test for deviation from additivity. If the ‘--dominant’ or ‘--recessive’ model options have been selected, then the first row will correspond to tests for a dominant or recessive model of association, respectively. If covariates have been included, each of these P values is adjusted for the effect of the covariates. The C ≥ 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. Finally, if the ‘--genotypic’ model option has been selected, there is a final row per SNP corresponding to a 2 d.f. LR test of whether both the additive and the deviation from additivity components of the regression model are significant. Each row contains the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the test performed [TEST: ADD (multiplicative model or genotypic model testing additivity), GENO_2DF (genotypic model), DOMDEV (genotypic model testing deviation from additivity), DOM (dominant model) or REC (recessive model)], the number of missing individuals included [NMISS], the OR, the coefficient z -statistic [STAT] and the asymptotic P value [ P ].▲ CRITICAL STEP ORs for main effects cannot be interpreted directly when interactions are included in the model; their interpretation depends on the exact combination of variables included in the model. Refer to a standard text on logistic regression for more details 36 .

Data visualization ● TIMING ~5 min

5 | To create quantile-quantile plots to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations, follow option A. Note that quantile-quantile plots are only suitable for GWA studies comprising hundreds of thousands of markers. To create a Manhattan plot to display the association test P values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance, follow option B. To visualize the LD between sets of markers in an LD plot, follow option C. Manhattan and LD plots are suitable for both GWA and CG studies comprising any number of markers. Otherwise, create customized graphics for the visualization of association test output using customized simple R 31 commands 37 (not detailed here)).

  • Start R software.
  • Create a quantile-quantile plot ‘chisq.qq.plot.pdf’ with a 95% confidence interval based on output from the simple χ 2 tests of association described in Step 4A for trend, allelic, dominant or recessive models, wherein statistics have a χ 2 distribution with 1 d.f. under the null hypothesis of no association. Create the plot by typing data < -read.table(“[path_to]/data.model”, header = TRUE); pdf(“[path_to]/chisq.qq.plot.pdf”); library(car); obs < - data[data$TEST = = “[model]”,]$CHISQ; qqPlot(obs, distribution = ”chisq”, df = 1, xlab = ”Expected chi-squared values”, ylab = “Observed test statistic”, grid = FALSE); dev.off()’, where [path_to] is the appropriate directory path and [model] identifies the association test output to be displayed, and where [model] can be TREND (Cochran-Armitage trend); ALLELIC (allelic association); DOM (dominant model); or REC (recessive model). For simple χ 2 tests of association based on a genotypic model, in which test statistics have a χ 2 distribution with 2 d.f. under the null hypothesis of no association, use the option [df] = 2 and [model] = GENO.
  • Create a quantile-quantile plot ‘pvalue.qq.plot.pdf’ based on – log10 P values from tests of association using logistic regression described in Step 4B by typing ‘data < - read.table(“[path_to]/data.assoc.logistic”, header = TRUE); pdf(“[path_to]/pvalue.qq.plot.pdf”); obs < - −log10(sort(data[data$TEST = = ”[model]”,]$P)); exp < - −log10( c(1:length(obs)) /(length(obs) + 1)); plot(exp, obs, ylab = “Observed (−logP)”, xlab = ”Expected(−logP) “, ylim = c(0,20), xlim = c(0,7)) lines(c(0,7), c(0,7), col = 1, lwd = 2) ; dev.off()’, where [path_to] is the appropriate directory path and [model] identifies the association test output to be displayed and where [model] is ADD (multiplicative model); GENO_2DF (genotypic model); DOMDEV (genotypic model testing deviation from additivity); DOM (dominant model); or REC (recessive model).
  • Start Haploview. In the ‘Welcome to Haploview’ window, select the ‘PLINK Format’ tab. Click the ‘browse’ button and select the SNP association output file created in Step 4. We select our GWA study χ 2 tests of association output file ‘data.model’. Select the corresponding MAP file, which will be the ‘.map’ file for the pedigree file format or the ‘.bim’ file for the binary file format. We select our GWA study file ‘gwa.bim’. Leave other options as they are (ignore pairwise comparison of markers > 500 kb apart and exclude individuals with > 50% missing genotypes). Click ‘OK’.
  • Select the association results relevant to the test of interest by selecting ‘TEST’ in the dropdown tab to the right of ‘Filter:’, ‘ = ’ in the dropdown menu to the right of that and the PLINK keyword corresponding to the test of interest in the window to the right of that. We select PLINK keyword ‘ALLELIC’ to visualize results for allelic tests of association in our GWA study. Click the gray ‘Filter’ button. Click the gray ‘Plot’ button. Leave all options as they are so that ‘Chromosomes’ is selected as the ‘X-Axis’. Choose ‘P’ from the drop-down menu for the ‘Y-Axis’ and ‘−log10′ from the corresponding dropdown menu for ‘Scale:’. Click ‘OK’ to display the Manhattan plot.
  • To save the plot as a scalable vector graphics file, click the button ‘Export to scalable vector graphics:’ and then click the ‘Browse’ button (immediately to the right) to select the appropriate title and directory.
  • Using the standard MAP file, create the locus information file required by Haploview for the CG data by typing ‘cg.map < - read.table(“[path_to]/cg.map”); write.table(cg.map[,c(2,4)],“[path_to]/cg.hmap”, col.names = FALSE, row.names = FALSE, quote = FALSE) where [path_to] is the appropriate directory path.
  • Start Haploview. In the ‘Welcome to Haploview’ window, select the ‘LINKAGE Format’ tab. Click the ‘browse’ button to enter the ‘Data File’ and select the PED file ‘cg.ped’. Click the ‘browse’ button to enter the ‘Locus Information File’ and select the file ‘cg.hmap’. Leave other options as they are (ignore pairwise comparison of markers > 500 kb apart and exclude individuals with > 50% missing genotypes). Click ‘OK’. Select the ‘LD Plot’ tab.

Adjustment for multiple testing ● TIMING ~5 min

6 | For CG studies, typically comprising hundreds of thousands of markers, control for multiple testing using Bonferroni’s adjustment (follow option A); Holm, Sidak or FDR (follow option B) methods; or permutation (follow option C). Although Bonferroni, Holm, Sidak and FDR are simple to implement, permutation testing is widely recommended for accurately correcting for multiple testing and should be used when computationally possible. For GWA studies, select an appropriate genome-wide significance threshold (follow option D).

▲ CRITICAL STEP If some of the SNPs are in LD so that there are fewer than 40 independent tests, the Bonferroni correction will be too conservative. Use LD information from HapMap and SNPSpD ( http://genepi.qimr.edu.au/general/daleN/SNPSpD/ ) 35 to estimate the effective number of independent SNPs 1 . Derive the per-test significance rate α* by dividing α by the effective number of independent SNPs.

  • To obtain significance values adjusted for multiple testing for trend, dominant and recessive tests of association, include the --adjust option along with the model specification option --model-[x] (where [x] is ‘trend’, ‘rec’ or ‘dom’ to indicate whether trend, dominant or recessive test association P values, respectively, are to be adjusted for) in any of the PLINK commands described in Step 4A. For example, adjusted significance values for a Cochran-Armitage trend test of association in the CG data are obtained by typing ‘plink --file cg --adjust --model-trend --out data’. Obtain significance values adjusted for an allelic test of association by typing ‘plink --file cg --assoc –adjust --out data’.
  • Open the output file ‘data.model.[x].adjusted’ for adjusted trend, dominant or recessive test association P values or ‘data.assoc.adjusted’ for adjusted allelic test of association P values. These files have one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the unadjusted P value [UNADJ] identical to that found in the original association output file, the genomic-control–adjusted P value [GC], the Bonferroni-adjusted P value [BONF], the Holm step-down–adjusted P value [HOLM], the Sidak single-step–adjusted P value [SIDAK_SS], the Sidak step-down–adjusted P value [SIDAK_SD], the Benjamini and Hochberg FDR control [FDR_BH] and the Benjamini and Yekutieli FDR control [FDR_BY]. To maintain a FWER or FDR of α = 0.05, only SNPs with adjusted P values less than α are declared significant.
  • To generate permuted P values, include the --mperm option along with the number of permutations to be performed and the model specification option –model-[x] (where [x] is ‘gen’, ‘trend’, ‘rec’ or ‘dom’ to indicate whether genotypic, trend, dominant or recessive test association P values are to be permuted) in any of the PLINK commands described in Step 4A. For example, permuted P values based on 1,000 replicates for a Cochran-Armitage trend test of association are obtained by typing ‘plink --file cg --model --mperm 1000 --model-trend --out data’ and permuted P values based on 1,000 replicates for an allelic test of association are obtained by typing ‘plink --file cg --assoc –mperm 1000 --out data’.
  • Open the output file ‘data.model.[x].mperm’ for permuted P values for genotypic, trend, dominant or recessive association tests or ‘data.assoc.mperm’ for permuted P values for allelic tests of association. These files have one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the point-wise estimate of the SNP’s significance [EMP1] and the family-wise estimate of the SNP’s significance [EMP2]. To maintain a FWER of α = 0.05, only SNPs with family-wise estimated significance of less than α are declared significant.

Population stratification ● TIMING ~5 min

7 | For CG studies, typically comprising hundreds of thousands of markers, calculate the inflation factor λ (follow option A). For GWA studies, obtain an unbiased evaluation of the inflation factor λ by using all testing SNPs (follow option B).

▲ CRITICAL STEP To assess the inflation factor in CG studies, an additional set of null marker loci, which are common SNPs not associated with the disease and not in LD with CG SNPs, must be available. We do not have any null loci data files available for our CG study.

Open the PLINK log file ‘data.log’ that records the inflation factor.

  • To obtain the inflation factor, include the --adjust option in any of the PLINK commands described in Step 4B. For example, the inflation factor based on logistic regression tests of association for all SNPs and assuming multiplicative or genotypic models in the GWA study is obtained by typing ‘plink --bfile gwa --genotypic --logistic --covar gwa.covar --adjust --out data’.

▲ CRITICAL STEP When the sample size is large, the inflation factor λ 1000 , for an equivalent study of 1,000 cases and 1,000 controls, can be calculated by rescaling λ according to the following formula

For general help on the programs and websites used in this protocol, refer to the relevant websites:

Step 1: If genotypes are not available in standard PED and MAP or binary file formats, both Goldsurfer2 (Gs2; see refs. 38 , 39 ) and PLINK have the functionality to read other file formats (e.g., HapMap, HapMart, Affymetrix, transposed file sets and long-format file sets) and convert these into PED and MAP or binary file formats.

Steps 2–6: The default missing genotype character is ‘0′. PLINK can recognize a different character as the missing genotype by using the ‘--missing-genotype’ option. For example, specify a missing genotype character of ‘N’ instead of ‘0′ in Step 2 by typing ‘plink --file cg --assoc --missing-genotype N --out data’.

● TIMING

None of the programs used take longer than a few minutes to run. Displaying and interpreting the relevant information are the rate-limiting steps.

ANTICIPATED RESULTS

Summary of results.

Table 4 shows the unadjusted P value for an allelic test of association in the CG region, as well as corresponding adjusted P values for SNPs with significant P values. Here we have defined a P value to be significant if at least one of the adjusted values is smaller than the threshold required to maintain a FWER of 0.05. The top four SNPs are significant according to every method of adjustment for multiple testing. The last SNP is only significant according to the FDR method of Benjamini and Hochberg, and statements of significance should be made with some caution.

SNPs in the CG study showing the strongest association signals.

Unadjusted Adjusted
ChrSNPAllelic
test of
association
Genomic
control
BonferroniHolmSidak
single step
Sidak
step-down
FDR BHFDR BYFamily-wise
permutation
3rs18012823.92E – 142.22E – 051.45E – 121.61E – 121.61E – 121.61E – 121.61E – 126.92E – 129.90E – 03
3rs126364545.54E – 074.99E – 032.05E – 052.22E – 052.27E – 052.22E – 051.14E – 054.89E – 059.90E – 03
3rs41352471.27E – 051.44E – 024.71E – 044.96E – 045.21E – 044.96E – 041.64E – 047.05E – 049.90E – 03
3rs21208251.60E – 051.56E – 025.92E – 046.08E – 046.56E – 046.08E – 041.64E – 047.05E – 049.90E – 03
3rs38568063.62E – 031.03E – 011.34E – 011.34E – 011.38E – 011.26E – 012.97E – 021.28E – 019.90E – 02

Shown are adjusted and unadjusted P values for those SNPs with significant P values in an allelic test of association according to at least one method of adjustment for multiple testing. Chr, chromosome; FDR, false discovery rate; BH, Benjamini and Hochberg; BY, Benjamini and Yekutieli.

Figure 1 shows an LD plot based on CG data. Numbers within diamonds indicate the r 2 values. SNPs with significant P values ( P value < 0.05 and listed in Table 4 ) in the CG study are shown in white boxes. Six haplotype blocks of LD across the region have been identified and are marked in black. The LD plot shows that the five significant SNPs belong to three different haplotype blocks with the region studied: three out of five significantly associated SNPs are located in Block 2, which is a 52-kb block of high LD ( r 2 > 0.34). The two remaining significant SNPs are each located in separate blocks, Block 3 and Block 5. Results indicate possible allelic heterogeneity (the presence of multiple independent risk-associated variants). Further fine mapping would be required to locate the precise causal variants.

An external file that holds a picture, illustration, etc.
Object name is ukmss-34429-f0022.jpg

LD plot. LD plot showing LD patterns among the 37 SNPs genotyped in the CG study. The LD between the SNPs is measured as r 2 and shown (× 100) in the diamond at the intersection of the diagonals from each SNP. r 2 = 0 is shown as white, 0 < r 2 < 1 is shown in gray and r 2 = 1 is shown in black. The analysis track at the top shows the SNPs according to chromosomal location. Six haplotype blocks (outlined in bold black line) indicating markers that are in high LD are shown. At the top, the markers with the strongest evidence for association (listed in Table 4 ) are boxed in white.

Quantile-quantile plot

Figure 2 shows the quantile-quantile plots for two different tests of association in the GWA data, one based on χ 2 statistics from a test of allelic association and another based on − log 10 P values from a logistic regression under a multiplicative model of association. These plots show only minor deviations from the null distribution, except in the upper tail of the distribution, which corresponds to the SNPs with the strongest evidence for association. By illustrating that the majority of the results follow the null distribution and that only a handful deviate from the null we suggest that we do not have population structure that is unaccounted for in the analysis. These plots thus give confidence in the quality of the data and the robustness of the analysis. Both these plots are included here for illustration purposes only; typically only one (corresponding to the particular test of association) is required.

An external file that holds a picture, illustration, etc.
Object name is ukmss-34429-f0023.jpg

Quantile-quantile plots. Quantile-quantile plots of the results from the GWA study of ( a ) a simple χ 2 allelic test of association and ( b ) a multiplicative test of association based on logistic regression for all 306,102 SNPs that have passed the standard quality control filters. The solid line indicates the middle of the first and third quartile of the expected distribution of the test statistics. The dashed lines mark the 95% confidence interval of the expected distribution of the test statistics. Both plots show deviation from the null distribution only in the upper tails, which correspond to SNPs with the strongest evidence for association.

Manhattan plot

Figure 3 shows a Manhattan plot for the allelic test of association in the GWA study. SNPs with significant P values are easy to distinguish, corresponding to those values with large log10 P values. Three black ellipses mark regions on chromosomes 3, 8 and 16 that reach genome-wide significance ( P < 5 × 10 −8 ). Markers in these regions would then require further scrutiny through replication in an independent sample for confirmation of a true association.

An external file that holds a picture, illustration, etc.
Object name is ukmss-34429-f0024.jpg

Manhattan plot. Manhattan plot of simple χ 2 allelic test of association P values from the GWA study. The plot shows –log10 P values for each SNP against chromosomal location. Values for each chromosome (Chr) are shown in different colors for visual effect. Three regions are highlighted where markers have reached genome-wide significance ( P value < 5 × 10 −8 ).

Supplementary Material

Acknowledgments.

G.M.C. is funded by the Wellcome Trust. F.H.P. is funded by the Welcome Trust. C.A.A. is funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. is supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. is supported by a Wellcome Trust Research Career Development Fellowship.

Note: Supplementary information is available in the HTML version of this article.

COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/ .

IMAGES

  1. PPT

    hypothesis in case control study

  2. What is a Case Control Study?

    hypothesis in case control study

  3. PPT

    hypothesis in case control study

  4. Research Methodology

    hypothesis in case control study

  5. PPT

    hypothesis in case control study

  6. PPT

    hypothesis in case control study

COMMENTS

  1. What Is a Case-Control Study?

    Revised on June 22, 2023. A case-control study is an experimental design that compares a group of participants possessing a condition of interest to a very similar group lacking that condition. Here, the participants possessing the attribute of study, such as a disease, are called the "case," and those without it are the "control.".

  2. Case Control Study: Definition & Examples

    Examples. A case-control study is an observational study where researchers analyzed two groups of people (cases and controls) to look at factors associated with particular diseases or outcomes. Below are some examples of case-control studies: Investigating the impact of exposure to daylight on the health of office workers (Boubekri et al., 2014).

  3. Case Control Studies

    A case-control study is a type of observational study commonly used to look at factors associated with diseases or outcomes.[1] The case-control study starts with a group of cases, which are the individuals who have the outcome of interest. The researcher then tries to construct a second group of individuals called the controls, who are similar to the case individuals but do not have the ...

  4. Case Control Study: Definition, Benefits & Examples

    A case control study is a retrospective, observational study that compares two existing groups. Researchers form these groups based on the existence of a condition in the case group and the lack of that condition in the control group. They evaluate the differences in the histories between these two groups looking for factors that might cause a ...

  5. Methodology Series Module 2: Case-control Studies

    Case-Control study design is a type of observational study. In this design, participants are selected for the study based on their outcome status. Thus, some participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure ...

  6. Case-Control Studies

    In the module entitled Overview of Analytic Studies it was noted that Rothman describes the case-control strategy as follows: "Case-control studies are best understood by considering as the starting point a source population, which represents a hypothetical study population in which a cohort study might have been conducted.The source population is the population that gives rise to the cases ...

  7. Formulating Hypotheses for Different Study Designs

    Formulating Hypotheses for Different Study Designs. Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate ...

  8. Case-Control Studies

    1.1 A Brief History. The case-control study examines the association between disease and potential risk factors by taking separate samples of diseased cases and of controls at risk of developing disease. Information may be collected for both cases and controls on genetic, social, behavioral, environmental, or other determinants of disease risk.

  9. Case-control study

    A case-control study (also known as case-referent study) is a type of observational study in which two existing groups differing in outcome are identified and compared on the basis of some supposed causal attribute. Case-control studies are often used to identify factors that may contribute to a medical condition by comparing subjects who have the condition with patients who do not have ...

  10. Introduction to study designs

    Formulation of a clearly defined hypothesis ... Selection bias in case-control studies Selection bias is a particular problem inherent in case-control studies, where it gives rise to non-comparability between cases and controls. Selection bias in case control studies may occur when: 'cases (or controls) are included in (or excluded from) a ...

  11. Case-control study in medical research: Uses and limitations

    While a case-control study can help to test a hypothesis about the link between a risk factor and an outcome, it is not as powerful as other types of study in confirming a causal relationship.

  12. Identification of causal effects in case-control studies

    Case-control designs are an important yet commonly misunderstood tool in the epidemiologist's arsenal for causal inference. We reconsider classical concepts, assumptions and principles and explore when the results of case-control studies can be endowed a causal interpretation. We establish how, and under which conditions, various causal estimands relating to intention-to-treat or per ...

  13. Case-Control Studies

    Case-control studies belong to observational studies. It set up a control group. In case-control studies, Odds Ratio was used to estimate the strength of the association between disease and exposure factors. Selection bias, information bias, and confounding bias are major sources of bias in case-control studies.

  14. Research Design: Case-Control Studies

    Case-control studies are observational studies in which cases are subjects who have a characteristic of interest, such as a clinical diagnosis, and controls are (usually) matched subjects who do not have that characteristic. After cases and controls are identified, researchers "look back" to determine what past events (exposures), if any ...

  15. Case Control Studies

    A case-control study is a type of observational study commonly used to look at factors associated with diseases or outcomes. The case-control study starts with a group of cases, which are the individuals who have the outcome of interest. The researcher then tries to construct a second group of individuals called the controls, who are similar to ...

  16. Level III Evidence: A Case-Control Study

    A case-control study seeks to understand whether some exposure (disease, procedure, condition, or patient characteristic) has any effect on the probability of developing an outcome of interest [7, 9].When reviewing the history of the cases and controls, the presence or absence of an exposure should be obtained from the medical record or via participant survey/interview.

  17. Epiville: Case-Control Study -- Study Design

    Use the case-control method to design a study that will allow you to compare the exposures to these products among your cases of Susser Syndrome and healthy controls of your choice. From all of your class work, you know that you want your hypotheses to be as explicit and detailed as possible. 1. Based on the information you gathered, which of ...

  18. Hypothesis Testing in Case-Control Studies

    Hypothesis testing in case-control studies BY A. J. SCOTT AND C. J. WILD Department of Mathematics and Statistics, University of Auckland, Auckland, New Zealand SUMMARY Prentice & Pyke (1979) have shown that valid estimators of the odds-ratio parameters in a logistic regression model may be obtained from case-control data by fitting the model ...

  19. Epidemiology in Practice: Case-Control Studies

    Introduction. A case-control study is designed to help determine if an exposure is associated with an outcome (i.e., disease or condition of interest). In theory, the case-control study can be described simply. First, identify the cases (a group known to have the outcome) and the controls (a group known to be free of the outcome).

  20. Chapter 8. Case-control and cross sectional studies

    Cross sectional studies. A cross sectional study measures the prevalence of health outcomes or determinants of health, or both, in a population at a point in time or over a short period. Such information can be used to explore aetiology - for example, the relation between cataract and vitamin status has been examined in cross sectional surveys.

  21. Tests of the null hypothesis in case-control studies

    The relative merits of the likelihood ratio statistic, the Wald statistic, and the score statistic are examined by an empirical evaluation based on matched case-control data. A mixture model for the relative-odds function is used. The likelihood ratio statistic is relatively constant for reasonable values of the mixture parameter, but the Wald ...

  22. Observational Studies: Cohort and Case-Control Studies

    Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures. In this review article, we describe these study designs, methodological issues, and provide examples from the plastic surgery literature. Keywords: observational studies, case-control study ...

  23. Case-control study of the characteristics and risk factors of hot clot

    The hypothesis proposed in the literature is of microembolic origin. Our objectives were to determine the incidence of HCa, to analyze its characteristics and to identify associated factors. ... The results of our large case-control study suggest that this focal pulmonary tracer uptake is mostly unique, intense and small in volume (< 1 ml ...

  24. Basic statistical analysis in genetic case-control studies

    Following previous protocols on study design, marker selection and data quality control 1-3, this protocol considers basic statistical analysis methods and techniques for the analysis of genetic SNP data from population-based genome-wide and candidate-gene (CG) case-control studies.We describe disease models, measures of association and testing at genotypic (individual) versus allelic ...