GENETIC EPIDEMIOLOGY GLOSSARY [M.Tevfik DORAK]

Genetics Population Genetics Genome Biology R Biostatistics Epidemiology Bias & Confounding Homepage

GENETIC EPIDEMIOLOGY GLOSSARY

Accompanying Genetic Epidemiology Lecture Note & Presentation

ACCE project (analytic validity, clinical validity, clinical utility, associated-ELSI): A CDC sponsored project for evaluating data on emerging genetic tests. It takes its name from the four components of evaluation: analytic validity, clinical validity, clinical utility and associated ethical, legal and social implications (ELSI). For details, see CDC: ACCE Project website. See also Grosse & Khoury, 2006 for the clinical utility of genetic testing, Offit, 2008 for issues surrounding genomic disease profiling, and Pharoah, 2008 for the possible utility of genomic profiling in breast cancer risk assessment.

Additive genetic model: In a disease association study, if the risk conferred by an allele is increased r-fold for heterozygotes and 2r-fold for homozygotes, this corresponds to additive model (Lewis, 2002; Minelli, 2005). These data are best analyzed using Armitage trend test for genotype frequencies or by logistic regression in which the genotypes are represented as (-1), 0, (+1). This genotype-based association test does not require the locus to be in Hardy-Weinberg equilibrium. In the case of an association with heterozygosity, the additive model test may be statistically non-significant despite the presence of an association. Thus, a non-significant additive model test does not rule out an association. It has been pointed out that “genes do not generally act in a simple additive manner, but through complex networks involving gene-gene and gene-environment interactions” (Colhoun, 2003). The effect that cannot be explained by an additive (or heterogeneity / non-interactive) model in complex disease genetics is due to the dominance (epistatic / interactive) model. See also multiplicative genetic model. See MODEL-online tool for genetic association analysis for different models.

Additive variance: The component of genetic variance (V_G) due to the additive effects of alleles segregating in the population. In evolutionary genetics, additive genetic variance (V_A) is a measure for the potential amount of evolutionary change caused by natural selection. The proportion of the additive variance to total phenotypic variance is the narrow heritability of a trait (the proportion of total genetic phenotypic variance, including dominance variance, to total variance is broad heritability). See Genetic Calculation Applets: Additive Variance Calculator and Genetic Variance from a Single Locus.

Admixture mapping (mapping by admixturelinkage disequilibrium-MALD): An association-based approach tolocalizing disease-causing variants thatdiffer in frequency between two historicallyseparated populations by a whole-genome scan. Fundamental to the use of admixture mapping is the knowledge that the disease of interest exhibits frequency differences across the two populations because of genetic differences. See Collins-Schramm, 2002; Smith, 2004; McKeigue, 2005; Seldin, 2007.

Adoption studies: A design for assessing the proportion of variance due to genetic and environmental sources. The assumptions are that the resemblance between an adopted child and biological parent is due only to genetic effects, while that between the adopted child and the adoptive parent is only environmental in origin. The important issues in the interpretation of adoption studies are adoptees are a highly selected group of children, age at adoption varies widely, and contact may have been maintained between adoptees and their biological parents. The Colorado Adoption Project is a rare and successful example of full adoption study.

Affected family-based controls (AFBAC) method: One of several family-based association study designs (Thomson, 1995). This one uses both affected members of a family (when there are two) and uses the allele or alleles not transmitted to the affected case(s) as control. See also HRR and TDT (AFBAC Software; FBAT software & manual)

Affected pedigree member (APM) method: Like the ASP method, a nonparametric and model-free method for testing linkage. This method compares observed allele sharing patterns among affected pedigree members (not necessarily siblings) against those expected under random assortment. An affected relative pair statistic is given by a weighted sum of the frequencies of observed identical-by-descent configurations.

Affected sibpair (ASP) method: A linkage study design that tests for excess sharing of marker alleles identical by descent in affected-affected sibpairs. This method is often described as a nonparametric and model-free alternative to the parametric lod score method.

Allele sharing method (affected sibling pairs method): A non-parametric test of linkage that does not require assumptions on model of inheritance, penetrance or disease allele frequency. Two siblings have 50% chance of inheriting an allele identical by descent (IBD). If the disease is genetically determined, two affected siblings are more likely to have inherited a common disease allele from one or both parents at the disease locus. So if affected siblings inherit IBD alleles at a given locus more often than expected by chance, there is a probability that the shared alleles are responsible for the disease, or in linkage with the trait allele (see Engelmark, 2004).

Alternative explanation: Chance effect (random error), bias (systematic error) and confounding are always alternative explanations for any observed association. See also Bias and Confounding Lecture Note and Presentation.

Analysis of molecular variance (AMOVA): A statistical (analysis of variance) method for analysis of molecular genetic data. It is used for partitioning diversity within and among populations using nucleotide sequence or other molecular data. AMOVA produces estimates of variance components and F-statistic analogs (designated as phi-statistics). The significance of the variance components and phi-statistics is tested using a permutational approach, eliminating the normality assumption that is inappropriate for molecular data (Excoffier, 1992). AMOVA can be performed on Arlequin or AMOVA. For examples, see Roewer, 1996; Sawkins, 2001; Stead, 2003; Watkins, 2003); see also AMOVA Lecture Note (EEB348); AMOVA & Population Differentiation; SMART Module on AMOVA.

Ancestry informative markers (AIMs): Genetic markers that show large differences in frequency across population groups. These loci are useful in ancestry determination as in case-control studies. These markers can be used in admixture mapping or admixture-matching in case-control studies. For sets of AIMs used in such studies, see Shriver, 2003; Choudhry, 2006; Tsai, 2006; Seldin, 2006; Tian, 2006 & 2007.

Association: A statistically significant correlation between an environmental exposure, a trait or a biochemical/genetic marker and a disease or condition. An association may be an artifact (random error-chance, bias, confounding) or a real one. Genotyping errors also cause spurious associations. In population genetics, an association may be due to confounders including population stratification (confounding by ethnicity), linkage disequilibrium, reverse causation or direct causation. Association studies may prove useful in identifying a genetic factor in a disease. A significant association should be presented together with a measure of the strength of association (odds ratio, relative risk or relative hazard and its 95% confidence interval) and when appropriate, with a measure of potential impact (attributable risk, prevented fraction, attributable fraction/etiologic fraction).

Balancing selection: Selection involving opposing forces in which selective advantages and disadvantages cancel each other out. Heterozygote advantage (or overdominant selection) is an example in which an allele selected against in the homozygous state is retained because of the superiority of heterozygotes. Other balanced states may occur including when: an allele is favored at one developmental stage and is selected against at another (antagonistic pleiotropy); an allele is favored in one sex and selected against in another (sexual antagonism); an allele is favored when it is rare and selected against when it is common (negative frequency dependent selection).

Beavis effect: Upward bias in significant quantitative trait (QTL) effects in a genome scan. This overestimation of QTL effects is a statistical artifact and named after William D Beavis following his simulation studies (see Xu, 2003). See also Rockman & Kruglyak, 2006 and Gibson, 2002.

Best inheritance model: In a genetic association study, the genetic model (additive, co-dominance, dominant, recessive, or dominance/heterozygote advantage) that yields the strongest association. When the best inheritance model is unknown, it is customary to test each model and choose the strongest result to indicate the best inheritance model. Alternatively, a statistical test to select the best fitting model may be used (such as RobustSNP or MAX-rank).

Bias: An estimator for a parameter is unbiased if its expected value is the true value of the parameter. Otherwise, the estimator is biased. It is the quantity E = (q-hat) - q. If the estimate of q is the same as actual but unknown q, the estimate is unbiased (as in estimating the mean of normal, binomial and Poisson distributions). If bias tends to decrease as n gets larger, this is called asymptotic unbiasedness. One of the most recent biases reported in genetic epidemiology is caused by differential bias in genotype scoring (half-call rate) between case and control DNA samples in large studies using automated genotype scoring algorithms (Clayton, 2005). See reviews by Sackett, 1979; Choi & Noseworthy, 1992; Grimez & Schulz, 2002; Campbell, 2002; Potter, 2003; Delgado-Rodriguez& Llorca, 2004 (Bias Glossary) and Bias & Confounding in Molecular Epidemiology, Bias of Ascertainment in Complex Disease Genetics and See also Bias and Confounding Lecture Note & Presentation.

Bioinformatics: In the context of genetic epidemiology, bioinformatics is the "research, development or application of computational tools and approaches for expanding the use of biological data". In genetic epidemiology, bioinformatics approaches are useful in functional annotations of genetic variants, gene enrichment analysis of GWAS results, and applications of machine learning methods for the primary analysis of GWAS data. See Bioinformatics for Genetic Epidemiologists Presentation; Bioinformatics Tools and Bioinformatics Courses by Kristel Van Steen.

Candidate gene study: A study of specifically selected (candidate) genes in which variation is hypothesized to influence the risk of a disease. Most initial genetic association studies were candidate gene studies until the emergence of the agnostic genetic association studies covering the whole genome variation (see genome-wide association studies).

Carrier: A healthy person who is a heterozygote for a recessive trait. Also includes persons with balanced chromosomal translocations. The unfortunate use of ‘carrier’ to describe individuals positive for a genetic marker is wrong, and the use of ‘carrier frequency’ in that context should be replaced by ‘marker frequency’.

Carter effect: Higher incidence of a genetically determined condition in relatives when the index case is the less commonly affected sex. This phenomenon was first demonstrated in Dr Cedric Carter's study of pyloric stenosis, where the incidence is highest in the sons of affected women and lowest in daughters of affected men.

Case-control study: A design preferred over cohort studies for relatively rare diseases in which cases with a disease or exposure are compared with controls randomly selected from the same study base. This design yields odds ratio (as opposed to relative risk from cohort studies) as the measure of the strength of association. See Case-control Studies Chapter in Epidemiology for the Uninitiated, Epidemiologic Study Designs (PPT), and Case-Control Genetic Association Studies in EFG Summer School.

Case-only design: A study design that is used to assess deviations from purely multiplicative interactions (Begg & Zhang, 1994; Khoury & Flanders, 1996; Botto & Khoury, 2004). The case-only design has been shown to be more efficient for detecting gene and environment interactions than case-control studies (Piegorsch, 1994; Goldstein & Andrieu, 1999). It estimates departure from multiplicative risk ratios (if genotype and environmental exposure are not associated in the population) as opposed to odds or rate ratio (Schmidt & Schaid, 1999). The method cannot be used as a substitute for traditional case-control studies since it is limited to the detection of interactions only, and cannot estimate population-based risk. It has higher power than traditional designs in detection of gene-gene and gene-environment interactions (Yang, 1997 & 1999; Gauderman, 2002). The case-only design cannot be used to estimate additive interactions.

Causal relationship: It does not matter how small it is, a P value does not signify causality (cause and effect relationship). To establish a causal relationship, the following non-statistical evidence is required: consistency (reproducibility), biological plausibility, dose-response, temporality (when applicable) and strength of the relationship (as measured by the effect size such as odds ratio/relative risk/hazard ratio). See Hills's Criteria of Causality & Seven Common Errors in Statistics.

Causal variant: A functional variant that is responsible for the association signal. A causal variant does not cause a complex disease on its own (polymorphisms are neither necessary nor sufficient for causation of multifactorial diseases) but causes some change in genome biology to contribute to the causation of the disease. Most GWAS signals are not causal themselves; in fact, only around 5% have on average 5% chance of being the causal SNP (Farh, 2014). Causal variants are typically distant (median 14kb) from causal variants and many are not in tight LD (Farh, 2014).

Coalescence time: Number of generations to the most recent common ancestor carrying a mutation or DNA variant currently present in a given population. See Gil McVean’s lecture on the coalescent. See a lecture note on Introduction to Coalescent Theory.

Codominance: Equal effect on the phenotype of two alleles of the same locus (as opposed to recessive and dominant).

Codominant genetic model: In a disease association study, if the risk conferred by the AB genotype (heterozygote) individuals lies between that of AA (wildtype homozygote) and BB (minor allele homozygote) individuals, but not in the specific relationship of a multiplicative or additive model, this corresponds to codominant model (Lewis, 2002; Minelli, 2005). This model is the most powerful one (over additive, recessive or dominant) to detect associations when the inheritance model is not known (Lettre, 2007).

Cohort effect: The tendency for persons born in certain years to carry a relatively higher or lower risk of a given disease. This may have to be taken into account in case-control studies. For example the penetrance of BRCA1 is greater for women born after 1930 than for those born earlier (Narod, 1993), thus, the risk of breast cancer has increased among BRCA1 mutation carriers.

Cohort study: A longitudinal follow-up study which begins with a group of people who do not have the trait of interest at the outset but a proportion of which will develop during the follow-up. The outcome is modeled for the explanatory variables to obtain the relative risk. Cohort studies may be historical or prospective. See Cohort Studies Chapter in Epidemiology for the Uninitiated; Epidemiologic Study Designs (PPT).

Common-disease-common-variant (CDCV) hypothesis: This hypothesis predicts that the genetic risk for common diseases will often be due to disease-predisposing alleles with relatively high frequencies; there will be one or a few predominating disease alleles at each of the major underlying disease loci (Lander, 1996; Chakravarti, 1999; Weiss & Clark, 2002; Becker, 2004). The hypothesis speculates that thegene variation underlying susceptibility to common heritablediseases existed within the founding population of contemporaryhumans. Whether the CDCV hypothesis is true for most diseases is yet unknown but there are a few prototypical examples: the APOE e4 allele in Alzheimer disease, Factor V^Leiden in deep venous thrombosis and PPARg Pro12Ala in type II diabetes. Recent studies have also shown the importance of rare variants in complex disease genetics (Liu, 2005; Kryukov, 2007).

Complementation: The production of a wildtype phenotype in spite of recessive mutations in two different genes because of the presence of normal copies of those genes on homologous chromosomes. If recessive mutations represent alleles of the same gene, this would be compound heterozygosity and would not complement each other to produce a wildtype phenotype because they both represent loss-of-function of the same gene. Deafness in humans can be caused by a recessive mutation at a number of genes, so it is not uncommon for two deaf parents to have children who hear.

Complex disease: The term complex trait/disease refers to any phenotype that does not exhibit classic Mendelian inheritance attributable to a single gene; although they may exhibit familial tendencies (familial clustering, concordance among relatives). The contrast between Mendelian diseases and complex diseases involves more than just a clear or unclear mode of inheritance. Other hallmarks of complex diseases include known or suspected environmental risk factors; seasonal, birth order, and cohort effects; late or variable age of onset; and variable disease progression. See Genetic Epidemiology for Complex Disorders: Principles and Practice (NHLBI Webcast) and Rannala, 2001 for a comprehensive review of complex disease genetics.

Compound heterozygote: An individual who is affected with an autosomal recessive disorder having two different mutations in the same gene on homologous chromosomes. An individual in whom each of the two alleles of the same locus carry a different mutation (for a recessive disorder). The C282Y and H63D mutations of HFE frequently occur as compound heterozygosity.

Confounding: The distortion of a measure of association because of a non-intermediate factor that is correlated with the variable of interest and independently associated with the outcome. An analysis done on observations that all have the same value of the confounder will not be confounded. This can be achieved by stratification for the confounder or by matching. See Taioli, 2002; Potter, 2003; Bias & Confounding in Molecular Epidemiology; Bias & Confounding (PPT).

Confounding variable: A variable that is associated with outcome and the exposure variables. A classic example is the relationship of heavy drinking or gambling with lung cancer. Here, the data should be controlled for smoking as it is related to both drinking/gambling and lung cancer. A positive confounder is related to exposure and response variables in the same direction (as in smoking); a negative confounder shows an opposite relationship to these two variables (age in a study of association between oral contraceptive use and myocardial infarction is a negative confounder). The data should be stratified before analyzing it if there if confounding is suspected. Mantel-Haenszel test is designed to analyze stratified data to control for a confounding variable. Alternatively, a multivariable regression model can be used to adjust for the effects of known confounders. The best strategy to avoid confounding is randomization. See Bias and Confounding (PPT).

Convenience sample: A sample that has not been specifically and randomly collected for the purpose of a specific study. Convenience samples are easy to find but may suffer from bias, mainly selection bias, and may not be a representative of the population they come from.

Copy number variation (CNV): Gains and losses of genomic segments resulting in variation in the number of copies of a genomic region or gene per diploid genome. Most genes show this variation and study of disease associations with CNV is becoming common. Reference gene in CNV studies is commonly RNAse P (RPPH1), which invariably exists in two copies in human diploid genome. See Redon, 2006; Estivil & Armengol, 2007; Sanger Institute CNV Project; Database of Genomic Variants; ABI TaqMan® Gene Copy Number Assays.

Cramer’s V: This measure of the strength association for any size of contingency tables is a transformation of the Chi-squared value for sample size. It provides a value between 0 and 1 for relative comparison of the strength of associations. For a 2x2 table, Cramer's V is equal to the Phi coefficient. Cramer’s V is most useful for large contingency tables and it can be used as a global linkage disequilibrium value for multiallelic loci (global D’ is another measure for multiallelic loci and both can be calculated on UNPHASED (manual; ref). (See GOLD-Disequilibrium Statistics; Online Cramer’s V calculation.)

Crossing-over (recombination): The exchange of genetic material between non-sister chromatids of homologous chromosomes (i.e., between maternal and paternal chromosomes) during meiosis. This results in a new and unique combination of genes on the daughter chromosome, which will be passed on to the offspring (if that particular gamete is involved in fertilization). See a Demonstration of Crossing-Over (JAVA Applet) and Genetic Linkage Tutorial by F Clerget-Darpoux.

Determinism: The belief that genes determine the phenotype. This is only partially true for monogenic traits and diseases. The genetic determinism is also behind the wrong view that DNA sequence is the blueprint for life. Almost all phenotypes (including some of the monogenic disorders) are a result of complex interactions between genetics and environment, which incorporates epigenetic influences. The lack of genetic determinism is behind the lack of progress with risk prediction for complex diseases. See Genetic Determinism by Council for Responsible Genetics.

Dominance: In classic genetics, dominance is the property possessed by some alleles of determining the phenotype by masking the effects of the other allele (when heterozygous). Thus, homozygosity or heterozygosity for the dominant allele results in the same genotype in complete dominance (if red is dominant over white, the petals of a flower heterozygous for red and white would be red). Incomplete dominance appears as a blend of the phenotypes corresponding to the two alleles (like pink petals as opposed to red or white). In codominance, both alleles equally contribute to the phenotype (red and white petals occur together). See also recessive.

Dominance variance: The component of genetic variance due to non-additive effects of alleles at the same locus (Cockerham, 1954). This component represents all genetic effects other than the additive effects and includes intra-locus allelic interactions. This component is commonly ignored in analysis of genetic associations but can be calculated without much trouble. Dominance variance modeling should not be mixed up with dominant models. See an applet on Genetic Variance from a Single Locus at HGSS.

Dominant allele: An allele that masks an alternative allele when both are present (in heterozygous form). Homozygous dominant and heterozygous genotypes contribute the same to the phenotype. Most common autosomal dominant diseases are due to mutations in transcription factor genes (Jimenez-Sanchez, 2001). See Clinical Genetics.

Dominant model: A genetic association analysis mode that examines association with a dominant allele. The comparison groups are wild-type homozygous genotypes vs allele positivity (combining heterozygotes and homozygotes for the variant). See MODEL-online tool for genetic association analysis for different models. See also Lewis, 2002; Minelli, 2005.

Dominant-negative mutation: A (heterozygous) dominant mutation on one allele blocking the activity of wild-type protein still encoded by the normal allele (often by dimerizing with it) causing a loss-of-function phenotype. The phenotype is indistinguishable from that of homozygous dominant mutation. P53 mutations may act as dominant-negative (see also haploinsufficiency). See Clinical Genetics.

Dosage compensation: The phenomenon in women, who have two copies of genes on the X chromosome, of having the same level of the products of those genes as males (who have a single X chromosome). This is due to the process of random inactivation of one of the X chromosomes in females (Lyonization).

Effect modification: The situation in which a measure of effect changes over values of another variable (the association estimates are different in different subsets of the sample). The relative risk or odds ratio associated with exposure will be different depending on the value of the effect modifier. For example if in a disease association study, the odds ratios are different in different age groups or in different sexes, age or sex are effect modifiers. Effect modification is highly related to statistical interaction in regression models. Where an exposure decreases the risk for one value of the effect modifier and increases the risk for another value of effect modifier, this is called crossover (Thompson, 1991). See also Bias and Confounding Lecture Note and Presentation.

Effect size: In statistics, effect size is the strength of an association. It is usually quantified by calculation of relative risk, odds ratio or hazard ratio. Effect size complements the P value and should always accompany it when an association is reported. The practice of exclusive reporting of the P values for associations without an effect size is unacceptable. Effect size is one of the determinants of statistical power.

ELSI: Ethical, legal and social implications of genetic research. See: NHGRI: ELSI Research Program; CDC: ELSI Information; WHO: ELSI; Citrin & Modell, 2003; Kaye, 2012.

EM algorithm: A method for calculating maximum likelihood estimates with incomplete data. E (expectation)-step computes the expected values for missing data and M (maximization)-step computes the maximum likelihood estimates assuming complete data. It was first used in genetics (Ceppellini R et al, 1955) to estimate allele frequency for phenotype data when genotypes are not fully observable (this requires the assumption of HWE and calculation of expected genotypes from phenotype frequencies). See ARC CIGMR: EM Algorithm.

ENCODE (Encyclopedia of DNA Elements) project: The Encyclopedia of DNA Elements (ENCODE) project has mapped regions of transcription, transcription factor binding, chromatin structure and histone modification in the whole of the human genome by analyzing multiple cell types. The results presented in a series of papers published in September 2012 (see the lead paper in Nature) revolutionized modern genetics by assigning a function to 80% of the genome and challenging the junk DNA concept. The data from the ENCODE project are fully integrated the UCSC genome browser. See also ENCODE website, GENCODE browser, and Nature ENCODE Explorer.

Enhancer: Enhancers are one of the seven major genomic landmarks involved in gene regulatory activities as described by the ENCODE project. They are genomic distal (in relation to transcription start sites (TSSs)) cis-regulatory elements that carry sequence information for transcription factor binding, regulate gene expression regardless of location and orientation (including trans effects), and control tissue-specific gene expression. Enhancers can be recognized by their DNAse I sensitivity, methylation status and unique histone modifications. They are usually located in intergenic regions, but may also be in exons. The cancer associations with 8q24 gene desert polymorphisms are due to functional alterations of the enhancer activity (Jia, 2009).

Environment: Almost anything that is not genetic. Environmental factors include diet (food, preservatives, coloring, composition of diet and amount); air (clean air, smog, pollution, tobacco, workplace chemical fumes, dust, humidity, temperature); radiation (sunlight, tanning lights, X rays, microwaves, radio waves); infectious agents (bacteria, viruses, fungi, parasites), hormonal exposures and in utero environment.

Epigenetics: The study of heritable changes in gene expression that occur without a change in DNA sequence. Epigenetic phenomena such as imprinting and paramutation violate Mendelian principles of heredity. Epigenetic studies link genotype to phenotype working out the chain of processes. See Epigenetics: Special Issue of Science, 2001; a review by Petronis, 2001; a lecture by Shuk-mei Ho.

Epistasis: Original meaning was related to the genetic interaction of two or more genes that encode enzymes catalyzing steps in a common pathway. It has come to be synonymous with almost any type of gene interaction. Formal definition is 'genetic variance due to non-additive effects of alleles at distinct loci' thus, it is included in the dominance variation component. The most extreme form of epistasis (interaction) results in a multiplicative model in which the total risk is the product of the individual risks at each locus (or allele). See a Review on Epistasis by Cordell; Commentary on Epistasis by JH Moore.

Epistatic interaction: In genetic epidemiology, an epistatic effect is the modification of the risk conferred by one marker by the presence of a marker from an unrelated gene (unlinked gene-gene interaction). For examples, see Kajiwara, 1994 (retinitis pigmentosa); Olson, 2002; Pastor, 2003; Robson, 2004 (Alzheimer Disease); and Martin, 2002 (KIR3DL in HIV-AIDS); a review on epistatic interaction (Cordell, 2002); Epistasis Blog and Software at Computational Genetics Laboratory.

Evolutionary-based haplotype association: An association study design which uses haplotypes grouped together based on their evolutionary (cladistic) relationships. Use of ancestral haplotype groups in association studies is an efficient way to increase power (Templeton, 1987; 1995; 2000; Schork, 1998; Seltman, 2003; Fejerman, 2004; Tzeng, 2005).

EWAS: Epigenome-wide association scans. See an example: Bell, 2012.

Ewens-Watterson neutrality test: Also called E-W homozygosity statistics. Described by Ewens (1972) and Watterson (1978). A widely used test in population genetics to estimate the selection acting on a locus. It compares the sum of observed homozygosity for each allele of a given locus (F_o) with the expected F_e value based on the number of alleles in the locus of interest, neutrality expectations and random mating assumption. A test of comparison yields an F_o value. Values close to zero mean that the locus is evolving under neutrality (genetic drift only) and there is no selection. Values of F_o significantly different from zero suggest selection. When F_o > F_e, the locus is undergoing purifying selection, and when F_e > F_o, the locus is under balancing selection (very common for HLA loci) (see Nielsen, 2001, Luikart, 2003, Harris & Meyer, 2006 for reviews). Alternative tests for neutrality include Tajima's D (Tajima, 1989) and Slatkin's exact test for neutrality (Slatkin, 1996; Slatkin & Muirhead, 2000). See also Basic Population Genetics.

Expression quantitative trait locus (eQTL): A polymorphic locus (like a SNP) that influences expression levels of a gene. This gene does not have to be the nearest gene. If the influenced (target) gene is in the vicinity (usually within 1Mb), the eQTL is called cis eQTL, if the target gene is far away or on a different chromosome, the eQTL is a trans eQTL. Major eQTL search tools are GTEx, GENEVAR, eQTL Browser-UChicago and SCAN.

Expressivity: The range of phenotypes resulting from a given genotype (cystic fibrosis, for example, may have a variable degree of severity). This is different from pleiotropy which refers to a variety of different phenotypes resulting from the same genotype, or from penetrance.

Extended haplotype homozygosity (EHH) test: The frequency of an allele corresponds to its age, which in turn, correlates with decay of LD with alleles of adjacent loci (an old allele has high frequency and is expected to show low LD with adjacent loci). The EHH test compares the age of an allele based on its frequency with its age based on its extended haplotype recombination. High frequency alleles in the middle of a high LD region (haplotype block) represent positive selection as opposed to neutral alleles that take a long time to reach high frequency accompanied by low LD with adjacent loci. For discussion and examples of EHH test, see Mueller & Andreoli, 2004, Miretti, 2005 and Wang, 2005. See also EHH web-tool.

External validity: The extent to which a study’s findings apply to populations other than the one that was being investigated. See also internal validity.

F₁: First filial (son or daughter) hybrids arising from a first cross. Subsequent generations are denoted by F₂, F₃ etc. In animal studies of quantitative trait locus (QTL) mapping studies, two animals with extremes of the phenotype (like lowest and highest blood pressure) are mated to generate F₁ and then F₁ x F₁ matings produce an F₂ generation with a wide spectrum of the phenotype which are then used for mapping studies.

Falconer's multifactorial liability threshold model: Originally described and modeled in an analysis of polydactyly in guinea pigs (Wright S, 1934) and applied to human genetics by Douglas Falconer (Falconer DS. The inheritance of liability to certain diseases, estimated from the incidence among relatives. Ann Hum Genet 1965;29:51-76; see also Falconer, 1967; Fraser FC 1976 & 1980). Nicely explained in Falconer's polygenic threshold model for dichotomous nonmendelian characters in Human Molecular Genetics. See also a Lecture Note by Dr R Tissot; Genetic Calculation Applets: Calculator for Heritability in Threshold Traits; Understanding the Threshold Model and an example by Wanstrat & Wakeland.

False discovery rate (FDR): One of the methods developed to avoid spurious associations arising from multiple comparisons. The FDR procedure quantifies the false discovery problem. The calculated q value indicates the proportion of "significant" results which are false positives. A q value is chosen that is considered acceptable, and it is used to determine the P value cutoff to use to declare statistical significance. Corrected P values can also be derived by multiplying each P value by total number of comparisons made and dividing this value by the rank of the P value (smallest is ranked 1 -is not changed by this procedure). This approach tolerates more false positives than Bonferroni correction, and results in less false negatives. The FDR method was originally described by Benjamini & Hochberg (1995). Specifically for GWAS, a hidden Markov model (HMM)-based pooled local index of significance (PLIS) testing procedure has been developed based on the FDR principle (Wei, 2009 & 2012). See: Graham Horgan's FDR page (with an Excel file for calculation of FDR); and False Discovery Rate Calculator for 2x2 Contingency Tables.

Founder effect: Coalescence of a mutation or DNA variant in a given population to one of the original population founders or his/her descendant.

Functional annotation: Assessment of functional consequences of a genetic variation either for candidate SNP selection of for causal assessment of SNPs found to be associated with a trait. See also Bioinformatics Tools.

Genetic architecture: The overall characteristics of genetic risk such as the number of risk alleles, their allele frequency spectrum and effect sizes, and the mode of interactions among them.

Genetic epidemiology: Genetic epidemiology is the epidemiological evaluation of the role of inherited causes of disease in families and in populations; it aims to detect the inheritance pattern of a particular disease, localize the gene and find a marker associated with disease susceptibility. Gene-gene and gene-environment interactions are also studied in genetic epidemiology of a disease. In its broad context, genetic epidemiology includes family studies, molecular epidemiologic studies with genetic components, and more traditional cohort and case-control studies with family history components. See Genetic Epidemiology Lecture Note and Presentation.

Genetic heterogeneity. Distinct alleles at the same or different loci that give rise independently to the same genetic disease. In clinical settings genetic heterogeneity refers to the presence of a variety of genetic defects which cause the same disease, which may be the mutations at different positions on the same gene, a finding common to many human diseases (including Alzheimer disease, cystic fibrosis, lipoprotein lipase and polycystic kidney disease).

Genetic nurturing: The process by which correlation between genotypes and phenotypes can be produced via the effect of parents’ genotypes on their offspring’s phenotypes through the parents’ phenotypes (See Shen & Feldman, 2020).

Genome-wide association study (GWAS): Simultaneous investigation of millions of genetic variants theoretically covering the whole genome in complex genetic diseases (Clark, 2005; Wang, 2005; Pearson, 2008; McCarthy, 2008). See NIH guide to GWAS; Presentation by G McVean; Lecture Note by D Clayton; Presentation by S Chanock; the WTCCC GWAS (PDF); GWAS Catalog (PPT) in NHGRI (NIH); GWAS Integrator (at HuGE Navigator) . PLINK2, BEAGLE, GenABEL and SNPassoc are commonly used statistical analysis packages for GWAS. See also the Max-rank approach (Li, 2008) for ranking associations, and WGAviewer (Ge, 2008; Workshop Notes) for annotating, visualizing, and interpreting the full set of P values emerging from a GWAS. EIGENSOFT can be used to detect population stratification by PCA algorithm in GWAS (Price, 2006). See Potential Criteria for Standardized Reporting of GWAS Results in Johnson & O’Donnell, 2009; PhenoScanner; GRASP; GWAS Central; GWAS Database (Japan); Statistics for GWAS Laurent Briollais @ Bioinformatics.ca: PDF | PPT. See Bioinformatics Tools for more GWAS-related bioinformatics tools.

Genomic control: One method to adjust for population stratification bias in case-control association studies is to use a 'genomic control markers' panel (Reich & Goldstein, 2001). The panel consists of 20-50 polymorphic markers unlinked to the loci of interest. The information obtained from unlinked markers may be used in a variety of ways (genomic control, structured association, latent-class approach). The adjustments requires some statistical manipulation (Pritchard, 1999 & 2000; Bacanu, 2000; Devlin, 2001; Reich & Goldstein, 2001; Ardlie, 2002; Devlin, 2004; Purcell, 2004; Shi, 2004; Shmulewitz, 2004; Hao, 2004; Fu, 2005), which can be handled using a variety of statistical approaches (UPMC genomic control software; STRUCTURE & STRAT; ADMIXMAP; L-POP).

Genotype: The two alleles inherited at a specific locus. If the alleles are the same, the genotype is homozygous, if different, heterozygous. In genetic association studies, genotypes can be used for analysis as well as alleles or haplotypes.

Genotype-environment (GxE) interaction (GEI): This term refers to both the modification of genetic risk factors by environmental risk and protective factors, and the role of specific genetic risk factors in determining individual differences in vulnerability to environmental risk factors. When GxE interaction is present, a specific environmental change influences the outcome in different ways depending on the genotype. This requires inclusion of a multiplicative interaction term into the statistical model. For reviews, see Ottman, 1996; Heath & Nelson, 2002; Cooper, 2003; Hemminki, 2006a (PDF) & 2006b; North & Martin, 2008; Understanding Gene-Environment Interactions; Environment, Genes, and Cancer; Online Book (Costa & Eaton): Gene-Environment Interactions; Gene-Environment & Cancer booklets by NCI and CCS. For an example, see Carbone, 2007 and see North & Martin, 2008 for a comprehensive and informative review.

Genotype relative risk (GRR): The risk of disease for one genotype at a locus versus another. It is usually assessed as having one copy of the allele of interest (Aa) vs having none (AA), which is GRR1; and having two copies of the allele (aa) vs having none, which is GRR2. In simple statistical analysis this is achieved by using dummy variables for each genotype, selecting the genotype AA as referent and obtaining odds ratios for other genotypes Aa and aa. Most of the time, what is presented is actually genotype odds ratio. See Schaid & Sommer, 1993; Risch & Merikangas, 1996; Camp, 1997.

GRIPS (strengthening the reporting of Genetic RIsk Prediction Studies): A set of recommendations for strengthening the reporting of Genetic RIsk Prediction Studies (GRIPS) consisting of a checklist of 25 items developed by a multidisciplinary workshop sponsored by the Human Genome Epidemiology (HuGE) Network. See: GRIPS statement, elaboration and explanation and the checklist. See also STROBE.

Haplotype: Linear arrangements of alleles on the same chromosome that have been inherited as a unit. A person has two haplotypesfor any such series of loci, one inherited maternally and theother paternally. A haplotype may be characterized by a single allele unless a discrete chromosomal segment flanked by two alleles is meant. See a discussion of the use of haplotypes as opposed to individual SNPs: Clark, 2004.

Haplotype blocks: A chromosomal region with high linkage disequilibrium and low haplotype diversity. Probably flanked by recombinational hotspots, haplotype blocks are shorter in African populations (average 11kb) than in other populations (average 22kb) (Gabriel, 2002). Haplotype block lengths correlate with recombinational rate (Greenwood, 2004) but most haplotype-block boundaries do not occur at hotspots (Wall, 2003). All pairs of polymorphisms within a block are expected to show high linkage disequilibrium. Haplotype blocks are useful in association studies and a representative set of haplotype tagging SNPs can be used instead of the whole set of polymorphisms within a block (Zhang, 2004). Haploview is the most popular software for haplotype block analysis (Barrett, 2005) (see documentation and tutorial). HapBlock (Zhang, 2005), HaploBlock Finder and SNPTagger can also be used for haplotype block partitioning. For a review, see Cardon, 2003.

Haplotype relative risk (HRR) method: This method uses parental haplotypes non-inherited by affected offsprings as the control group (pseudocontrols) in a parent-trio design, and thus eliminates the potential problems of using unrelated individuals as controls in case-control association studies. Haplotyping is not necessary to use this method; it can be used for allelic associations. See Falk & Rubinstein, 1987; Knapp, 1993; Terwilliger & Ott, 1992.

HapMap (International Haplotype Mapping Project): A major international effort designed to obtain a map of haplotype blocks, the specific SNPs that identify the haplotypes (htSNPs) and linkage disequilibrium patterns in European, African and Asian population (Manollo, 2008). HapMap has now retired and all HapMap samples have been incorporated into the 1000 Genomes Project.

Hardy-Weinberg equilibrium (HWE): In an infinitely large population, gene and genotype frequencies remain stable as long as there is no selection, mutation, or migration. For a biallelic locus where the gene frequencies are p and q: p²+2pq+q²= 1. HWE should be assessed in controls in a case-control study and any deviation from HWE should alert for genotyping errors (Gomes, 1999; Lewis, 2002) but see also Zou & Donner, 2006. Relying only on HWE tests to detect genotyping errors is not recommended as this is a low power test (Leal, 2005). (Online HWE Analysis; HWE and Association Testing for SNPs in Case-Control Studies; HWE Tutorial in Life, 7^th Ed; Basic Population Genetics).

Haseman-Elston regression: A sib-pair test for linkage between a quantitative trait and a marker locus (Haseman & Elston, 1972). It is a classical regression method using the squared sib-pair trait difference as a dependent variable and the proportion of shared alleles identical by descent by the sib pair as an independent variable, where a statistically significant negative regression coefficient suggests linkage. Since then it has been extended to multiple quantitative loci (Tiwari, 1997; Stoesz, 1997); revisited to incorporate information from full sibs and other pairs of relatives (Elston, 2000); applied to X-linked traits (Wiener, 2003); and further modified to increase its power (Wang, 2004).

Hazard ratio (relative hazard): Hazard ratio is an effect size which compares two groups differing in treatments or prognostic variables etc. If the hazard ratio is 2.0, then the rate of failure in one group is twice the rate in the other group. The computation of the hazard ratio assumes that the ratio is consistent over time, and that any differences are due to random sampling. Before performing any tests of hypotheses to compare survival curves, the proportionality of hazards assumption should be checked (and should hold for the validity of Cox's proportional hazard models).

Heritability: The proportion of the phenotypic variability due to genetic variance [(narrow-sense) h²= additive variance / total phenotypic variance]. Can be locus-specific or for all loci combined. A high h² does not mean that the trait cannot be influenced by the environment. In a different environment the same h² may not be that high. Heritability does not indicate the degree to which a trait is genetically determined; it measures the proportion of phenotypic variance that is the result of genetic variation (see Heritability 101 & Heritability 201 (Neale Lab); Introduction to Quantitative Genetics; Human Genetics Interactive Learning Exercises; Effect of Heritability on Response to Selection; Human Molecular Genetics: Chapter 19.2.2. See also UKBB Heritability Browser (Neale Lab)).

Heterozygosity: Presence of two different alleles at a locus in a diploid organism (see homozygosity). It is the result of inheritance of different alleles from parents. For relevance of heterozygosity in disease states, see Beckman, 1990; Vockley, 2000; Vladutiu, 2001. Rarely, only heterozygosity but neither homozygous genotypes cause a disease. For a review, see van Heyningen, 2004.

Heterozygote advantage: Also called overdominance (a form of balancing selection) and opposite of underdominance (homozygote advantage). For an example and a list of all known examples, see Gemmell & Slate, 2006 and Supplemental Table 1. Genome-wide heterozygosity has been reported to confer advantage for common diseases (Campbell, 2007) and in particular, in cancer (Assie, 2008). See also balancing selection.

High-throughput genotyping: Simultaneous genotyping of large numbers of samples. Most machines can run 4x96 (384) samples simultaneously (SNP typing, real-time PCR, sequencing) with a queuing system that would allow automatic continuation of the typing. A number of companies perform SNP high-throughput genotyping (including LGC in UK; DNAVision in Europe).

Homozygosity: Presence of two identical alleles at a locus in a diploid organism (see heterozygosity). It is the result of inheritance of identical alleles from both parents.

Homozygosity mapping: Recessive diseases require two copies of an allele for expression. Because of linkage disequilibrium, loci surrounding the disease locus will tend to be homozygous in affected individuals. Searching for homozygous segments in diseased individuals help to locate the disease gene. This is called homozygosity mapping (Lander & Botstein, 1987).

htSNP: Haplotype-tagging SNP.

iCHAV: An "independent set of correlated highly associated variants" referring to the set of statistically similar, highly trait-associated variants which includes the causal SNP and the SNP that has shown the association. See Glubb, 2015.

Identity by descent (IBD): Alleles that trace back to a shared ancestor. For sibs, refers to inheritance of the same allele from a given parent.

Interaction: If the effect of one factor depends on the level of another factor, these two factors are said to interact. Factors A and B interact if the effect of factor A is not independent of the level of factor B. For example, when there are two main effects on a response variable, if their combined effect is higher than the sum of their main effects, they have an interaction (meaning a simple additive model is not sufficient to account for the observed data and a multiplicative term must be added). Also, there would be an interaction between the factors sex and treatment if the effect of treatment is not the same for males and females in a drug trial. Interaction is closely linked with effect modification in epidemiology. See Wikipedia: Statistical Interaction.

Internal validity: Internal validity is determined by the presence or absence of systematic error (bias) that causes the study findings to differ from the true values. A study that suffers from non-causal reasons for an association between an exposure and outcome (bias, confounding and serious random error) lacks internal validity. See also external validity.

Kin-cohort study: A study design for estimation of penetrance of a disease mutation. Individuals with and without family histories are included in the study sample and the family histories of the mutation carriers are compared with the family histories of the non-carriers. This design works only when the carrier frequency is more than 1% and when a founder effect is present (i.e, no genetic heterogeneity). Described by Wacholder, 1998.

Linkage: The tendency of 'genes' on the same chromosome to segregate together. This means that linked genes are transmitted to the same gamete more than 50% of the time. Genetic linkage reflects a lack of meiotic crossovers between two genes one of which is usually a latent/unknown disease locus. A number of software is available to analyze linkage in pedigree data, most commonly used ones are Linkage, Genehunter and Allegro (Genetic Analysis Software list and A Survey of Current (2003) Software for Linkage Analysis by F Dudbridge). See exercises on Gametes under Linkage and Linkage Pedigrees. For a general review, see genetic linkage in Kimball’s Biology and Tutorial by F Clerget-Darpoux. See also quasi-linkage.

Linkage disequilibrium (LD): Two alleles at different loci that occur together on the same chromosome (or gamete) more often than would be predicted by random chance. It is a measure of co-segregation of alleles in a population. Also called population 'gametic association' and may be defined as 'nonzero’ if multilocus gamete frequencies are different from the product of allele frequencies at each locus. For details, see Basic Population Genetics; for software, see Genetic Epidemiology.

Linkage disequilibrium (LD) mapping of disease genes: Marker loci near a disease gene are often observed to be in LD with the disease; that is, the relative frequencies of marker alleles in affected individuals differ from those in the general population. LD occurs because each new disease-predisposing mutation originally appears on a single chromosome. Individuals who inherit a disease mutation are likely to also inherit the alleles of the original chromosome, at neighboring marker loci. Because recombination with the disease gene happens less often for nearby marker loci, markers in the immediate vicinity of the gene should remain in greater disequilibrium than more distant marker loci and this is the basis of associations with the disease. See Terwilliger & Weiss, 1998; Lazzeroni, 1998 & 2001; Pritchard, 2001; Tishkoff, 2002; Jorde, 2003 and GENESTAT: LD Mapping.

LOD score: The LOD score method for testing linkage was first proposed by Morton in 1955 (Morton, 1955). Stands for the logarithm of odds but it is not the logarithm of the odds for linkage but the logarithm of the likelihood ratio for a particular value of the recombination fraction vs. free recombination (q = 0.5) (Elston, 1998; Borecki, 2001). Thus, the LOD score serves as a test of the null hypothesis of free recombination versus the alternative hypothesis of linkage. It is a statistical measure of the likelihood that two genetic markers occur together on the same chromosome and are inherited as a single unit of DNA. Determination of LOD scores requires pedigree analysis and a score of >+3 is traditionally taken as evidence for linkage (and -2 may mean the opposite). Linkage is between two genetic loci but not alleles. An example is the linkage between the hemochromatosis gene (HFE) and HLA-A. This means that within the same family all affected subjects will have the same HLA-A allele, i.e., there will be no recombination between HFE and HLA-A. LOD score has nothing to do with linkage disequilibrium. See also Significance of LOD Scores by Dave Curtis and a presentation on LOD Score).

Loss of heterozygosity (LOH): Loss of heterozygosity refers to genomic deletions (or epigenetic events) that eliminate the normal (wildtype) copies of tumor suppressor genes in a person heterozygous for a mutation and an existing mutation is therefore uncovered due to hemizygosity.

Major gene: A gene whose variant(s) confer a high lifetime risk of a disease. The penetrance of a major gene might be conditional on the presence of the relevant variant of a modifier gene. All high-penetrance cancer predisposition genes (BRCA1, BRCA2, TP53, APC, MSH2, LMH1, PTEN, CDNK2A etc) are major genes. For an example, see Narod, 2002.

Manifesting heterozygotes: A heterozygote for a recessive autosomal gene mutation or a female heterozygote for a recessive sex-linked gene mutation who has the same phenotype as homozygotes for the same mutation. Manifesting heterozygotes usually have a milder form of the phenotype and may only have biochemical signs without clinical phenotype. This situation is an exception rather than a rule but occurs in a proportion of heterozygotes for most major autosomal recessive disease genes: CYP21A2 (Witchel, 1997), HFE (Bulaj, 1996; Burt, 1998), CFTR (Super, 1999), ATM (Fearon, 1997; Scott, 2002) and McArdle disease (Manfredi, 1993) are among the examples. See Medline, OMIM and Google searches for manifesting heterozygotes; see also Clinical Genetics.

Marker frequency: In model-free analysis of an HLA association study, the use of allele frequencies is not favored and it is recommended that marker frequencies (frequency per number of subjects corresponding to the dominant model) be used in comparisons (Svejgaard, 1994; Sasieni, 1997). The use of allele frequencies is appropriate when a multiplicative model is hypothesized (and when the locus is in Hardy-Weinberg equilibrium). See also Lewis, 2002.

Markov Chain Monte Carlo (MCMC) strategy: A randomized computational approach for identifying the most likely among many possible models. For MCMC applications in biostatistics, see Gelman, 1996; MCMC algorithm has been used in segregation and linkage analysis (Heath, 1997), analysis of association with polymorphic loci (Sham, 1995; CLUMP), LD estimation (Ayres, 2001), haplotype construction (Stephens, 2001; PHASE), and multilocus association analysis (Kilpikari, 2003; BAMA). WinBUGS is freely available software for MCMC applications.

Mendelian gene: A gene with a strong effect on phenotype, giving rise to a (near) one-to-one correspondence between genotype and phenotype. Phenotypes caused by such a gene is called Mendelian traits or Mendelian (single-gene) conditions.

Mendelian randomization: A natural randomization process that occurs at conception to determine a person's genotype. It is possible to use 'Mendelian randomization' to derive an estimate of the association that is free of the confounding and reverse causation typical of classical epidemiology. According to the second law of Mendel (random assignment of genes), the inheritance of one traitis independent of the inheritance of other traits. Thedistribution of genetic polymorphisms is largely unrelated to the confounders (socioeconomic or behavioral) thatdistort interpretations ofobservational epidemiological studies. The basis of Mendelian randomization is best seen inparent-offspring designs that study the way phenotypeand alleles co-segregate during transmission from parents tooffspring. Thisstudy design is closely analogous to that of randomized clinical trials as by Mendelianprinciples there should be an equal probability of either allelebeing randomly transmitted to the offspring. Due to Mendelian randomization, genetic association studies are less prone to confounding than conventional risk-factor epidemiology (pleiotropy and linkage disequilibrium can still produce confounding; see Lee & Ho, 2003). Mendelian randomization concept can be used as a tool for epidemiological inference on environmental risk factors by examining the genetic counterpart of a suspected environmental exposure association free of confounding by conventional confounders (Davey-Smith & Ebrahim, 2003; Khoury, 2004; Davey Smith, 2005; Lawlor, 2007; Sheehan, 2008; Oqbuanu, 2009; Bochud & Rousson, 2010). See also a commentary on Mendelian Randomization by F Cambien.

Meta-analysis: A systematic approach yielding an overall answer by analyzing a set of studies that address a related question. This approach is best suited to questions, which remain unanswered after a series of studies. Meta-analysis provides a weighted average of the measure of effect (such as odds ratio). The rationale is to increase the power by analyzing the sets of data. The selection of studies to include in a meta-analysis study is the main problem with this approach. Funnel Plot is an informal method to assess the effect of publication bias in this context. See also Introduction to Meta-Analysis by the Cochrane Collaboration; Meta-Analysis by Genstat; Meta-Analysis in Epidemiology by Stroup et al (2000); Methods for Meta-Analysis in Medical Research by AJ Sutton; Introduction to Meta-Analysis by Borenstein et al (2009), and Online Meta-Analysis Tests. See also a comparative study of meta-analysis and consortium studies in genetic associations (Janssens, 2009).

microRNAs (miRNAs): Small non-protein-coding RNA molecules that regulate gene expression at the translational level. Changes in their expression levels and activity that may be due to polymorphisms are frequently found in disease states (see Kulkarni, 2011 for an example and Mishra, 2008; Chen, 2008 for reviews). The microRNA database miRBase lists known human miRNAs. See also Bioinformatics Tools for additional miRNA analysis tools.

Microsatellite: A DNA variant due to tandem repetition of a short DNA sequence (usually two to four nucleotides). Also called short tandem repeat (STR). As multilallelic markers, they provide higher polymorphism information content (PIC) than SNPs (see Schaid, 2004 for a comparative study). Average length of LD with microsatellites is 100kb which is considerably higher than for SNPs (Bahram & Inoko, 2007). It is therefore to do a whole genome association study using 30K microsatellites (Tamiya, 2005).

Microsatellite instability (MSI): Changes in a microsatellite size in the tumor tissue compared with normal (germline) tissue. It is an indication of DNA mismatch repair defect and commonly seen in cancer.

Migrant studies: Studies on migrants based on the assumption that in migrants genetic components remain the same but environment has changed. If the rates of disease among migrants change in the new environment, this is taken as evidence for environmental influence. Considerations in the interpretation of migrant studies include the following: migrants are a highly selected group (usually younger, healthier and of higher socioeconomic status), age at migration varies (exposure to relevant environmental factor may have already occurred) and most migrants may retain their lifestyle (environmental) factors. Successful examples of migrant studies are the increased colon cancer incidence in the Japanese migrants to USA (Cancers in Asian-Americans & Pacific Islanders: Migrant Studies; see also Parkin & Khlat, 1996; Kolonel, 2004) and decreased risk of multiple sclerosis in migrants from high to low altitude countries in the first two decades of their lives (Gale & Martyn, 1995).

Misclassification: Errors in the classification of individuals by phenotype, exposures or genotype that can lead to errors in results. The probability of misclassification can be the same across all groups in a study (nondifferential) or vary among groups (differential). One group of major biases. See also Bias and Confounding Lecture Note and Presentation.

Mode of inheritance: The manner in which a particular genetic trait or disorder is passed from one generation to the next. Autosomal dominant or recessive, X-linked dominant or recessive, multifactorial and mitochondrial inheritance are examples. Complex traits encompass modes of inheritance involving more than a single genetic factor, reduced penetrance and variation due to environmental factors.

Modifier genes: Not all genes that influence the appearance of a trait contribute equally to the phenotype: major genes have a large influence, while modifier genes have a more subtle, secondary effect. Modifier genes alter the phenotypes produced by the alleles of other genes. There is no formal distinction between major and modifier genes; there is a continuum between the two and the cut-off is arbitrary. Modifier genes may affect the action of a major gene or the trait independently. See Narod, 2002 for modifier genes in BRCA1/BRCA2 carriers; Dipple, 2000 for modifier genes in simple Mendelian disorders.

Multifactor dimensionality reduction (MDR): Algorithms and software for the detection and characterization of epistasis (gene-gene interactions) and plastic reaction norms (gene-environment interactions) in genetic and epidemiological studies of common human diseases developed at the Computational Genetics Laboratory, Dartmouth Medical School, Lebanon, NH, USA (MDR website; MDR software). See also Ritchie, 2001; Hahn, 2003; Moore, 2004. Alternative algorithms to select most interesting subsets of polymorphisms include LOTUS (manual; Nickolov & Milanov, 2007) and PIA (Mechanic, 2008).

Multifactorial inheritance with a threshold: Quite often certain characters have a discontinued binary distribution, meaning that they are present or not in an individual (cleft palate, pyloric stenosis, diabetes, leukemia, schizophrenia) but they are inherited as if they were multifactorial characters; this is due to a threshold effect that makes them appear as discontinued. This is called multifactorial inheritance with a threshold. See Understanding the Threshold Model. See also Falconer's multifactorial liability threshold model.

Multiple enhancer variant hypothesis: Multiple variants in linkage disequilibrium found associated with a common trait in GWAS may impact multiple enhancers and cooperatively affect gene expression in the pathogenesis of a common trait. Therefore, a statistically similar SNP set cannot always be reduced to a single SNP representing the set (see Corradin, 2013).

Multiplicative genetic model: In a disease association study, if the risk conferred by an allele is increased r-fold for heterozygotes and r²-fold for homozygotes, this corresponds to the multiplicative genetic risk model (Lewis, 2002). These data should be analyzed using the allele frequencies (Sasieni, 1997). See also Additive genetic model. See MODEL-online tool for genetic association analysis for different models.

Multivariable analysis: As opposed to univariable analysis, statistical analysis performed in the presence of more than one explanatory variable to determine the relative contributions of each is (or should be) called multivariable analysis (in practice, however, it is called univariate and multivariate analysis more frequently). It is a method to simultaneously assess contributions of multiple variables or adjust for the effects of known confounders. Multiple linear regression, multiple logistic regression, proportional hazards analysis are examples of multivariable analysis, which has no similarity whatsoever to multivariate analysis (see also Peter TJ, 2009). See a review on Multivariable Methods by MH Katz (and a book on Multivariable Analysis by MH Katz).

Multivariate analysis: Methods to deal with more than one related 'outcome/dependent variable' (like two outcome measures from the same individual) simultaneously with adjustment for multiple variables (covariates). When there is more than one dependent variable, it is inappropriate to do a series of univariate tests. Hotelling's T² test is used when there are two groups (like cases and controls) with multiple dependent measures, and multivariate analysis of variance (MANOVA) is used for more than two groups. Unfortunately, the word 'multivariate' is most frequently used instead of 'multivariable' analysis (which means multiple independent/explanatory variables but one outcome/dependent variable; see also Peter TJ, 2009). See multivariate analysis book, notes, lecture notes, slide presentation, glossary and MultiVariate Statistical Package-MVSP.

Multivariate analysis of variance (MANOVA): An extension of Hotelling's T² test to more than two groups with related multiple outcome measures. Groups are compared on all variables simultaneously (rather than one-by-one as ANOVA does).

Mutation: Any heritable change (not only point mutation) brought about by an alteration in the genetic material. Includes gene conversion, deletion, duplication, insertion and so forth. Mutation is preferred to polymorphism to describe a disease causing gene variation regardless of its frequency. Link to Human Gene Mutation Database (Cardiff, UK).

Nomenclature (reports): Any report of a human genetic study should conform to the requirements of HUGO Gene Nomenclature Committee - Guidelines and HGVS - Nomenclature for the Description of Sequence Variations (mirror; based on den Dunnen & Antonarakis, 2000; see also Wildeman, 2008 and Taschner & den Dunnen, 2011). The current NCBI policy on disease names is not to use (‘s) in them: see OMIM entries for Alzheimer Disease, Down Syndrome, Crohn disease and Hodgkin Lymphoma, for example.

Non-coding RNA (ncRNA): RNA species that do not code for a peptide. MicroRNAs (miRNA) and long intergenic non-coding RNAs (lincRNAs) belong to this class of RNAs. The number of non-coding RNA genes in the human genome is approaching the number of protein-coding genes (see the latest statistics in GENCODE, HGNC, VEGA and NCBI). For more information, see Ensembl ncRNA page.

Non-mendelian gene: A gene with some but not a strong effect on phenotype, giving rise to significant overlap of genotype distributions and lack of one-to-one correspondence between genotype and phenotype.

Null markers: Genetic loci that are not associated with any trait in any population, and are not in LD with any marker that show associations with any trait. These markers are used in estimation of the inflation factor (l) in the genomic control method to correct for population structure.

Observational study: An epidemiologic study design in which subjects select themselves into groups (such as cases and controls) and the investigator is a passive observer (no intervention). An observational study may be descriptive or analytic. In an observational study, causality of the observed associations cannot be established, and residual confounding cannot be entirely ruled out. Bias may also play a role in generating a spurious correlation or deviating the results. Cohort studies are the highest-ranking observational study type in the hierarchy of causality pyramid (but still below interventional studies).

Odds ratio (OR): Also known as relative odds and approximate relative risk. It is the ratio of the odds of the risk factor in a diseased group and in a non-diseased (control) group (the ratio of the frequency of presence / absence of the marker in cases to the frequency of presence / absence of the marker in controls). The interpretation of the OR is that the risk factor increases the odds of the disease ‘OR’ times. OR is used in retrospective case-control studies (relative risk (RR) is the ratio of proportions in two groups which can be estimated in a prospective -cohort- study). These two and relative hazard (or hazard ratio) are measures of the strength/magnitude of an association. As opposed to the P value, these do not change with the sample size. OR and RR are considered interchangeable when certain assumptions are met, especially for large samples and rare diseases. Odds ratio is calculated as ad/bc where a,b,c,d are the entries in a 2x2 contingency table (hence the alternative definition as the cross-product ratio). In logistic regression, the coefficient b corresponds to the log_e of the odds ratio. There are statistical methods to test the homogeneity of odds ratios (Online Odds-Ratio Calculation (with 95% CI; Odds Ratio-Relative Risk Calculation). See Grimes & Shulz: Making Sense of Odds and Odds Ratio. Obstet & Gynecol, 2008 for more.

Overlapping genes: Genes that are encoded on the sense and anti-sense strands of the same chromosome region in opposite direction (for an example, see CYP21A2 and TNXB; Tee, 1995). Overlapping genes are frequent in viruses and plasmid / phages who need to pack a lot of information in a small, compact genome (HIV is an example). Degeneracy of the genetic code facilitates the presence of overlapping genes. It has been suggested that overlapping gene groups are more likely to be disease-associated (Karlin, 2002).

Overmatching bias: When cases and controls are matched by a non-confounding variable that is associated to the exposure but not to the disease, this is called overmatching. Overmatching can underestimate an association. For a numerical example, see slides 41-49 in the Case-Control Studies presentation by Chen. See also Bland & Altman, 1994 and Sorensen & Gillman, 1995. Matching should only be considered for confounding variables but such known confounding can be controlled at the analysis phase in an unmatched design.

Paralog sequence variants (PSV): Sequence variants present on paralog copies of genes. These variants are source of difficulty for genotyping design if known, and when unknown a cause of genotyping errors (and extreme departures from Hardy-Weinberg equilibrium).

Penetrance: The proportion of individuals with a given genotype (heterozygotes for a dominant gene) who express an expected trait, even if mildly. If a disease gene is not causing the disease in all its carriers, its penetrance is low [not to be mixed with variable expression]. BRCA1 mutations show both age-dependent penetrance and overall reduced penetrance, the lifetime risk for a female mutation carrier being estimated at around 70%. Breast cancer is also an example of an autosomal condition where penetrance is sex-dependent. While male mutation carriers can develop breast cancer (particularly with BRCA2 mutations), females are at much greater risk. HFE has a very low penetrance, which is age and sex-dependent.

Permutation test: A statistical approach to examine statistical significance of associations based on Monte Carlo methods that accounts for multiple comparisons issue (McIntyre, 2000; Becker & Knapp, 2004; Becker, 2005). Haploview and UNPHASED (ref) can do permutation test (both require data in linkage format). WHAP and BEAGLE are also a freely available software packages that can analyze multiallelic markers.

PHASE: Haplotype construction from multilocus population data software that employs a Markov Chain Monte Carlo algorithm based on the coalescent model (Stephens, 2001). The newer version fastPHASE is for faster haplotype reconstruction and estimation of missing genotypes from population data (Scheet & Stephens, 2006). See PHASE or fastPHASE Download and PHASE Documentation. See also UNPHASED (manual; ref).

Phenocopy: A non-genetic condition resembling a genetically determined one. Such conditions confound the interpretation of pedigrees and therefore genetic counseling. Some teratogens may cause congenital anomalies mimicking genetically caused anomalies (thalidomide syndrome vs phocomegalia). Deafness is another example of phenocopy which may be genetic (autosomal or sex-linked) or non-genetic (rubella embryopathy).

Phenome-wide association study (PheWAS): A method (which usually relies on electronic medical records) investigates associations of single genetic markers with any number of phenotypes. See PheWas Catalog (incl. HLA associations); Verma, 2018 for a large PheWAS; UK Biobank PheWAS at PheWeb.

Phenotype: The visible or measurable (i.e., expressed) characteristics of an organism.

Pleiotropy: The potential for genotypes to havemore than one specific phenotypic effect.

Polymorphism: Theexistence of two or more variants at a locus. Conventionally,the prevalence in the population should be above 1% to be referredto as a polymorphism; if prevalence is below this, variantsare referred to as mutations (especially if they are disease-causing ones). Because of the confusion between polymorphism and mutation, the Human Genome Variation Society recommends the use of 'sequence variant', 'alteration' or 'allelic variant' for any genomic change regardless of their frequency or phenotypic effects. Polymorphism at a genetic locus is due to either balanced polymorphism (heterozygous advantage, frequency-dependent selection) or unequilibrium states (temporary polymorphism) as occurs during frequency-dependent selection and genetic drift (alleles becoming fixed or extinct).

Polymorphism information content (PIC): An index of informativeness of a genetic marker which takes into account the number of alleles and their frequencies (Botstein, 1980; Guo & Elston, 1999). For details, see a Lecture Note.

Population genetics: The branch of genetics that deals with frequencies of alleles and genotypes in breeding populations. It also deals with selective influences on the genetic composition of the population (links to freeware population genetic data analysis software: Arlequin 2000, PopGene, GDA, Genetix, Tools for Population Genetic Analysis, GenePop, GeneStrut, SGS, GenAlEx; WinPop, Quanto, features of data analysis software; lectures on population genetics). See also Basic Population Genetics.

Population stratification: Anexample of 'confounding by ethnicity' in which the co-existence of different disease rates and allele frequencies within population sub-sections leads to a spurious association at the population level. Differing allele frequencies in ethnically different strata in a single population may lead to a spurious association or 'mask' an association by artificially modifying allele frequencies in cases and controls when there is no real association (for this to happen, the subpopulations should differ not only in allele frequencies but also in baseline risk to the disease being studied) (Mark, 1996; Altshuler, 1998). Confounding, cryptic relatedness (which increases overdispersion of the test statistics and leads to inflation of significance levels overall) and selection bias are potential consequences of population stratification (Thomas, 2005). It is notable that the consequences of population structure on association outcomes increases with sample size, i.e., larger sample size is not a remedy for this issue and may make it worse (Marchini, 2004). Case-control association studies can still be conducted by using genomic controls (Devlin, 1999; Pritchard, 1999) even when population stratification is present. The software STRUCTURE and STRAT, ADMIXMAP or L-POP can be used to analyze case-control data with genomic controls. See Cardon & Palmer, 2003 for an example of spurious association due to population stratification; a presentation by David Clayton on Confounding by Stratification and Admixture. See presentations on Genetic Epidemiology and Pitfalls in Genetic Association Studies.

Predisposition gene: A gene that is necessary and sufficient to cause a disease. This is different from a 'susceptibility gene' (neither necessary nor sufficient for disease development).

Principal component analysis (PCA): In genetic epidemiology, PCA is used to detect population stratification in genome-wide association studies (Price, 2006). It is implemented in a software program called EIGENSOFT.

Protein-coding genes: Genes that code for peptides. The current human protein-coding gene number is around 20,000. See the latest statistics in GENCODE, HGNC, VEGA and NCBI. As of 2015, there are already more non-coding RNA genes than protein-coding genes in the human genome.

Proteomics: Proteomics is the study of proteins in aggregate. It applies to the translation from the mRNA to the primary protein products, and their maturation and modification to yield active proteins as components of a cell, tissue or organism. The collection of proteins in a given cell at a given stage of differentiation is called proteome. See the websites for the Human Proteome Organization (HUPO) and Human Proteome Project (HPP). See also a 'Review of Proteomics with Applications to Genetic Epidemiology' (Sellers & Yates, 2003) and Ahsan & Rundle, 2003.

Pseudoautosomal inheritance: The X and Y chromosomes share a common ancestor. There is a part of X chromosome, which has its homologous counterpart on the Y chromosome. The pattern of inheritance for a gene located on both the X and Y chromosomes may appear to be autosomal. The genes in these segments escape X inactivation. The major pseudoautosomal region (PAR1) at the tip of the short arms has very high recombination frequency (the sex-averaged recombination frequency is 28% which, for a region of only 2.6 Mb, is approximately 10 times the normal recombination frequency). The high figure is due to the obligatory crossover in male meiosis resulting in a crossover frequency approaching 50%. The minor pseudoautosomal region (PAR2) extends over 320 kb at the extreme tips of the long arms of the X and Y; crossover between the X and Y in this region is not so frequent. The part of the Y-chromosome between the two PARs is called the nonrecombining portion of the Y chromosome (NRY) which is exclusive to Y-chromosome. The genes within PARs of the X and Y chromosomes have a unique segregation pattern that affected sibs will tend to be same sex. See Flaquer, 2008; Evolution of Sex Chromosomes in Human Molecular Genetics and Map Viewer: Y-chromosome.

Pseudo-SNP: Ectopic sequence variants (ESVs) and paralogous sequence variants (PSVs) (Estivill, 2002; Cheung, 2003). Pseudo-SNPs are one reason for genotyping errors and the main non-biological reason for violation of HWE (Leal, 2005).

Publication bias: Editors and authors tend to publish articles containing positive findings as opposed to negative result papers. This results in a belief that there is a consistent association while this may not be the case. Plots of relative risks by study may be used to check publication bias in meta-analyses. If publication bias is operating, one would expect that, of published studies, the larger ones report the smaller effects, as small positive trials are more likely to be published than negative ones. This can be examined using the funnel plot in which the effect size is plotted against sample size (Sterne & Egger, 2001). If this is done, the plot resembles an inverted funnel, with the results of the smaller studies being more widely scattered than those of the larger studies, as would be expected if there is no publication bias. One consequence of publication bias is that the first report of a given association may suffer from an inflated effect size (Ioannidis, 2001). See Publication Bias in Cochrane Collaboration.

Quantile-quantile plot (Q-Q plot): In a GWAS, the Q-Q plot is used to assess the number and magnitude of observed associations compared with the expectations under no association. The nature of deviations from the identity line provide clues whether the observed associations are true associations or may be due to systematic errors such as population stratification or cryptic relatedness. See WTCCC GWAS (PDF); Pearson, 2008 (Figure 1); McCarthy, 2008.

Quantitative character: A character displaying a 'continuous' phenotypic range rather than discrete classes; characters measured rather than counted such as metabolic activity, height, length, width, arm span, body fat content, growth rate, milk production, blood pressure. The genetic variation underlying a continuous character distribution may be the result of segregation at a single genetic locus or more frequently, at numerous interacting loci which produce a cumulative effect on the phenotype (with contributions from the environment). A gene affecting a quantitative character is a quantitative trait locus, or QTL (should be seen as a continuous trait locus). See also Introduction to Genetic Epidemiology.

Quantitative genetics: The statistical study of the genetics of quantitative characters (biometrical genetics) as opposed to Mendelian (discrete) characters. Quantitative genetic characters are those that do not assort in a simple way in crosses. Examples include physiological activity, behavior, size and height. A major task of quantitative genetics is to determine the ways in which genes (QTL) interact with the environment to contribute to the formation of a given quantitative trait distribution (and the estimation of genetic and environmental variance). For a review, see Posthuma, 2003. See GusevLab for the latest research in quantitative genetics.

Quasi-dominance: Direct transmission, generation to generation, of a recessive trait giving the impression of dominance. It happens if the recessive gene is frequent or inbreeding is intense.

Quasi-linkage: The non-random segregation of non-homologous chromosomes, which can be a confounding factor in linkage studies of complex traits. This phenomenon results in significant linkage finding between unlinked markers. See a review by Sivagnanasundaram, 2004.

R: A language and environment for statistical computing and graphics. R is an open platform and offers thousands of programs (called libraries) to achieve a wide variety of statistical, graphical, biological tasks. It has a large number of (statistical) genetics- and phylogenetics-related libraries. To learn how to use R (not how to program R!), see this self-paced R course and the links within.

Random sampling: A method of selecting a sample from a target population or study base using simple or systematic random methods. In random sampling, each subject in the target population has equal chance of being selected to the sample. Sampling is a crucially important point in selection of controls for a case-control study. By randomization, systematic effects are turned into error (term), and there is an expected balancing out effect: known and unknown factors that might influence the outcome are assigned equally to the comparison groups. One disadvantage of randomization is generation of a potentially large error term. This can be avoided by using a block design. See Basic Concepts of Sampling and Wikipedia: Random Sampling.

Randomization: Randomization of the study population in groups, differing only for the factor of interest leads to a random distribution of known and unknown confounders in the different groups, therefore removing potential bias that might result in a spurious finding. See also Bias and Confounding Lecture Note and Presentation.

Recall bias: Bias in results due to systematic differences in the accuracy or completeness of recall of past exposures or family history. One group of major biases. See also Bias and Confounding Lecture Note and Presentation.

Receiver operating characteristics (ROC) curve analysis: Also called discrimination statistics used in diagnostic test accuracy assessment and the utility of predictive tests (but see Pencina, 2008; Pencina, 2012). See ROC in Clinical Research Calculators and Difference Between the Areas Under Two ROC Curves at Vassar, ROC101 by Tom Fawcett. See reviews on ROC analysis: Hanley & McNeil. 1982; Bewick, 2004; Obuchowski, 2005; Cook NR, 2007; Steyerberg, 2012. See also the Supplementary Data File for Mamtani, 2006 for the use of Stata in ROC analysis.

Recessive: A trait that is not expressed in heterozygotes (i.e., that can only be expressed in the homozygotes). Most common recessive disease genes are those encoding metabolic enzymes (Jimenez-Sanchez, 2001). See Clinical Genetics.

Recessive model: A genetic association analysis mode that examines association with a recessive allele. The comparison groups are variant homozygous genotypes vs the rest (combining heterozygotes for the variant and homozygotes for the wild-type allele). See MODEL-online tool for genetic association analysis for different models. See also Lewis, 2002; Minelli, 2005.

Relative recurrence risk (RRR): A measure of familial aggregation for a disease. This is the probability that a particular type of relative (sibling, cousin etc) of a proband is affected, divided by the prevalence of the disease in general population. These are quantities denoted by l_R, where R denotes a relationship (S=sib, O=offspring, DZ= dizygotic twin, etc), and whose values are the risks of relatives of type R of affected individuals being themselves affected, divided by the population prevalence. In general, the risk of recurrence in first-degree relatives equals the square root of the incidence of the disease in general population (P^1/2; where P = incidence in general population). For second and third degree relatives, corresponding figures are P^3/4 and P^7/8, respectively. See Genetic Epidemiology Lecture Note and Presentation.

Relative risk (RR): The ratio of the risk of the phenotype among individuals with a particular exposure, genotype or haplotype to the risk among those without that exposure, genotype or haplotype. Also known as risk ratio. RR is measured in cohort studies (and its counterpart is the odds ratio in case-control studies).

Residual confounding: Confounding within stratum. If stratification is used to control confounding but the strata are broad (like a broad age range), there may still be residual confounding within stratum. Residual confounding is also used to describe confounding from factors that are not controlled at all or controlled but inaccurately measured.

Reverse causation: The possibility that an observed association may actually reflect the relationship in the opposite direction. Childhood infections are believed to reduce the risk for asthma but 'reverse causation' meaning that asthma may cause increased risk for infections to result in the observed association is a distinct possibility (see Pekkanen, 2004). Increased cancer risk associated with low lipid levels (Davey-Smith & Ebrahim, 2003) and the relationship between sleeping less and obesity may be examples of reverse causation. For a discussion of reverse causation, see Dowd & Town: Does X Really Cause Y.

R project for statistical computing: R is a language and environment for statistical computing and graphics which can be seen as a different implementation of the S language. R and a comprehensive set of programs written for a variety of statistical analysis are all available as Free Software. See the R Project Website & List of Contributed R Packages (including gap, genetics, popgen, qgen, GenABEL, SNPassoc).

Sampling: In genetic epidemiologic research, sampling is an important design consideration. The sampling unit, the sampling method, and the sample size are all critical. For example, sampling larger sibships yields more power per sampled subject than sampling independent sibpairs (Todorov, 1997). Use of extremely discordant (ED) and/or extremely discordant and concordant (EDAC) sibpairs increases the power (Gu, 1996; Gu, 1997).

Selection bias: A bias in results due to systematic differences between those who are selected for study and those who are not selected. See Bias and Confounding Lecture Note and Presentation.

Short tandem repeat (STR): See microsatellite.

Sibling recurrence risk (sibling risk ratio): The disease risk for a sibling of an affected individual compared to the disease risk in the general population. See relative recurrence risk.

Signal-to-noise ratio: In an association study of a complex disease, detection of a significant signal from a single locus is diminished due to genetic heterogeneity. This is a major problem in outbred populations. Isolated populations such as Finland, Iceland and Newfoundland with relative genetic and environmental homogeneity offer better opportunities to detect modest signals because of the lack of too much noise.

Single nucleotide polymorphism (SNP): A single nucleotide variation in the DNA code. It is the most common type of stable genetic variation and usually bi-allelic. SNPs may be silent -no change in phenotype- (sSNP), may cause a change in phenotype (cSNP) or may be in a regulatory region (rSNP) with potential to change phenotype. Thus the effects of SNPs, if any, are generally on gene expression or protein structure (Williams, 2007). Functional changes that may be caused by SNPs are gene transcription changes (promoter and intronic enhancer SNPs), truncated protein (nonsense coding region SNPs), structural changes (coding region SNPs), alternative splicing (intronic splice site SNPs), and mRNA stability changes (3’UTR SNPs). Synonymous SNPs are the most common ones. These are in non-coding regions and used as genetic markers. On average, each 1 kb of human genome contains 2-10 SNPs, i.e., one in every 100-500 nucleotides is polymorphic; most frequently a C to T (C>T) substitution (links to a Overview, SNP Consortium Website, dbSNP, SNP500 Cancer, SNPator, SNPedia, MedRefSNP, SNP Control, HapMap-B36 (Rel28), Ensembl (tutorial), GeneSNPs, MIT SNP DBase, Seattle SNPs, Regulatory-rSNP Guide, F-SNP). See also Bioinformatics Tools and GENEPI TOOLBOX. For simple SNP data analysis online, see SNPStats & HWA.

Single nucleotide variation (SNV): Same as SNP except that this variant does not exist in a population but in an individual (a private mutation).

SNP@Ethnos: A database of ethnically variant SNPs (Park, 2007). For ethnic distribution of HLA alleles, see IMGT/HLA Allele Ethnicity Tool.

Software: Software development for genetic epidemiology studies has gained momentum in recent years. For an up-to-date list of software, see Genetic Epidemiology.

Splice site variant: A DNA sequence variant that alters the sequence of a splice site. An "essential splice site" is a splice donor variant within the 2bp region at the 5' end of an intron, a "splice site" is a variant within 1-3bp of the exon or 3-8bp of an intron.

Stata: A powerful statistical package particularly useful for epidemiologic and longitudinal data management and analysis. It is mainly a command driven program produced by Stata Corporation. See the list of Stata capabilities, Stata starter kit with learning modules by UCLA.

Statistical power: The probability that a test will produce a significant difference at a given significance level is called the power of the test. This is equal to the probability of rejecting the null hypothesis when it is untrue, i.e., making the correct decision. It is 1 minus the probability of a type II error. The true differences between the populations compared (effect size), the sample size and the significance level chosen affect the power of a statistical test. Ideally, power should be at least 0.80 to detect a reasonable departure from the null hypothesis. See a discussion of statistical power; and online calculators: General Statistical Calculators Including a Power Calculator (UCLA); Statistical Power Calculator for Frequencies; Retrospective Power Calculation; Genetic Power Calculator; Wise Project Applets: Power Applet; Downloadable calculators: CaTS (Skol, 2006), Quanto (sample size or power calculation for association studies of genes, gene-environment or gene-gene interactions); Calculation of Power for Genetic Association Studies 'AssocPow' (Ambrosius, 2004), PS: Power and Sample Size Calculation; and Power & Sample Size Calculations on STATA.

STROBE (STrengthening the Reporting of OBservational studies in Epidemiology): An international collaboration integrating epidemiology, statistics and other relevant disciplines to strengthen the reporting of observational studies in epidemiology. See the checklists for different types of epidemiologic studies.

Susceptibility gene: A gene that is neither necessary nor sufficient to cause a disease but increases the risk of its development. These low-penetrance genes would be detected by association studies but would show no evidence for linkage with the disease. See Greenberg, 1993; Greenberg & Doneshka, 1996. (Weakly penetrant predisposing genes may act as a susceptibility gene.)

T² test for genome association: Instead of examining the association of a single marker in a population-based case-control study of a complex disease, this test measures the strength of cumulative association of multiple markers. First described by Xiong et al (2002) and then extended to haplotype blocks by Fan & Knapp (2003).

Transmission disequilibrium test (TDT): A family-based study to compare the proportion of alleles transmitted (or inherited) from a heterozygous parent to a disease-affected child. Any significant deviation from 0.50 in transmission ratio implies an association (Spielman, 1993 & 1994; Lewis, 2002). See also FBAT software (manual) and SAGE (manual) for family-based association tests.

Variant: Because of the ambiguity in the definitions of mutation and polymorphism, any genetic change is called a sequence variation and such alleles are called variant (see Nomenclature for the Description of Sequence Variations and Cotton, 2001). See also polymorphism.

Variant call format (VCF): A format for files that contains information on genetic variants and was created for 1KG project (see Danecek, 2011). Usually, the first five columns are needed and these are (1) chromosome number, (2) chromosomal position in the right assembly, (3) rsID for the variant, (4) reference allele (REF), and (5) alternative (ALT) allele. Most databases provide data in this format and most bioinformatics suites accept the input file in this format (see for example, CADD).

Whole genome amplification (WGA): Representational amplification of total genomic DNA to increase the quantity and quality for further studies. WGA improves amplification success with degraded DNA (Holbrook, 2005; Ballantyne, 2006). Reliability, robustness and accuracy of WGA methods in general have been shown in genotyping of highly polymorphic loci such as HLA (Gillespie, 2000; Shao, 2004) and SNP genotyping and sequencing (Dean, 2002; Lovmar, 2003; Hosono, 2003; Alsmadi, 2003; Tranah, 2003; Shao, 2004; Yan, 2004; Bannai, 2004; Paez, 2004; Barker, 2004; Holbrook, 2005; Thompson, 2005) and particularly useful in molecular (childhood cancer) epidemiology studies (Zheng, 2001; Yan, 2004). STR genotyping may require a little more attention (Dickson, 2005; Ballantyne, 2006). WGA may also be used with Illumina Golden-Gate assays (Cunningham, 2008). As long as a minimum of 10 nanogram genomic DNA is used in WGA, SNP genotyping can be accurately performed on whole genome amplified DNA (Lovmar, 2003; Bergen, 2005a; Bergen, 2005b; Holbrook, 2005) with possible exception of loci near the end of chromosomes (Tzvetkov, 2005). Commercially available WGA kits include GenomePlex (OmniPlex PCR-based WGA), REPLI-g (multiple displacement amplification) and GenomiPhi (multiple displacement amplification).

Whole transcriptome amplification (WTA): Representational amplification of total transcriptome. See SIGMA: Transplex whole transcriptome amplification (WTA) kit (manual) and QIAGEN: QuantiTect Whole Transcriptome Kit.

Winner’s curse: A type of bias similar to the “regression to the mean” that arises from the actual genetic effect in a replication study being smaller than its estimate from the first study. See Zhong & Prentice, 2010 for winner’s curse in GWAS, and Zollner & Pritchard, 2007 for a correction method for winner’s curse. An important implication of the winner’s curse is that if the sample size of a replication study is chosen on the basis of the odds ratio observed in the first study, then the replication will almost certainly be underpowered (Xiao & Boehnke, 2009). See Kraft, 2008 for winner’s and other curses in genetic epidemiology.

Y-chromosome: The male-specific sex chromosome in humans, which is much smaller than the X-chromosome. 95% of the Y-chromosome is not involved in recombinations with the X-chromosome and called the male-specific or nonrecombining Y-chromosome. For Y-chromosome polymorphisms, see ISOGG: Y-DNA SNP Index; Nature Web Focus: Y chromosome.

Y-chromosome haplogroups: The tips of the Y-chromosome can recombine with corresponding parts of the X-chromosome (pseudoautosomal regions), but around 95% of the Y-chromosome is not involved in any recombination. This non-recombining region is therefore passed intact from one generation to the next as frozen blocks. These haplotypes are the Y-chromosome haplogroups (lineages) that are used in phylogenetic studies.

Genetic Epidemiology: Basic & Advanced

Glossaries of Genome / Human Genetics Terms

Genetics Glossary Biostatistics Glossary VEGA (Functional Genomics) Glossary

Genome Biology for Genetic Epidemiologists

Mehmet Tevfik DORAK, MD, PhD

Last edited on 8 June 2021

Genetics Population Genetics Genome Biology R Biostatistics Epidemiology Bias & Confounding Homepage