[go: up one dir, main page]

Biostatistics: Difference between revisions

Content deleted Content added
m Section heading change: Biostatistics and Genetics → Biostatistics and genetics using a script
Reverting edit(s) by Shivani Saini 06 (talk) to rev. 1240687734 by Wburrow: non-constructive (RW 16.1)
 
(42 intermediate revisions by 32 users not shown)
Line 3:
{{for|the academic journal|Biostatistics (journal)}}
 
'''Biostatistics''' (also known as '''biometry''') areis thea development and applicationbranch of [[statisticalstatistics]] that applies statistical methods to a wide range of topics in [[biology]]. It encompasses the design of biological [[experiment]]s, the collection and analysis of data from those experiments and the interpretation of the results.
 
== History ==
=== Biostatistics and genetics ===
 
Biostatistical modeling forms an important part of numerous modern biological theories. [[Genetics]] studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. [[Gregor Mendel]] started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. [[Francis Galton]] tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "[[Francis Galton|Law of Ancestral Heredity]]". His ideas were strongly disagreed by [[William Bateson]], who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as [[WalterRaphael Weldon]], [[Arthur Dukinfield Darbishire]] and [[Karl Pearson]], and Mendelians, who supported Bateson's (and Mendel's) ideas, such as [[Charles Davenport]] and [[Wilhelm Johannsen]]. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian [[Modern synthesis (20th century)|modern evolutionary synthesis]].
 
Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of [[population genetics]] and this synthesis all relied on statistics and developed its use in biology.
* [[Ronald Fisher]] developedworked alongside statistician Betty Allan developing several basic statistical methods in support of his work studying the crop experiments at [[Rothamsted Research]], includingpublished in hisFisher's books [[Statistical Methods for Research Workers]] (1925) endand [[The Genetical Theory of Natural Selection]] (1930), as well as Allan's scientific papers.<ref>{{Cite Heweb |last=Centre for Transformative Innovation |first=Swinburne University of Technology |title=Allan, Frances Elizabeth (Betty) - Person - Encyclopedia of Australian Science and Innovation |url=https://www.eoas.info/biogs/P001468b.htm |access-date=2022-10-26 |website=www.eoas.info |language=en-gb}}</ref> Fisher went on to gavegive many contributions to genetics and statistics. Some of them include the [[ANOVA]], [[p-value]] concepts, [[Ronald Fisher|Fisher's exact test]] and [[Ronald Fisher|Fisher's equation]] for [[population dynamics]]. He is credited for the sentence “Natural"Natural selection is a mechanism for generating an exceedingly high degree of improbability”improbability".<ref>{{cite journal|last1=Gunter|first1=Chris |title=Quantitative Genetics|journal=Nature|date=10 December 2008|volume=456|issue=7223 |pages=719|doi=10.1038/456719a|pmid=19079046 |bibcode=2008Natur.456..719G|doi-access=free}}</ref>
 
* [[Sewall G. Wright]] developed [[F-statistics|''F''-statistics]] and methods of computing them and defined [[inbreeding coefficient]].
* [[Ronald Fisher]] developed several basic statistical methods in support of his work studying the crop experiments at [[Rothamsted Research]], including in his books [[Statistical Methods for Research Workers]] (1925) end [[The Genetical Theory of Natural Selection]] (1930). He gave many contributions to genetics and statistics. Some of them include the [[ANOVA]], [[p-value]] concepts, [[Ronald Fisher|Fisher's exact test]] and [[Ronald Fisher|Fisher's equation]] for [[population dynamics]]. He is credited for the sentence “Natural selection is a mechanism for generating an exceedingly high degree of improbability”.<ref>{{cite journal|last1=Gunter|first1=Chris |title=Quantitative Genetics|journal=Nature|date=10 December 2008|volume=456|issue=7223 |pages=719|doi=10.1038/456719a|pmid=19079046 |bibcode=2008Natur.456..719G|doi-access=free}}</ref>
* [[J. B. S. Haldane]]'s book, ''The Causes of Evolution'', reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. AlsoHe also developed the theory of [[primordial soup]].
* [[Sewall G. Wright]] developed [[F-statistics]] and methods of computing them and defined [[inbreeding coefficient]].
* [[J. B. S. Haldane]]'s book, ''The Causes of Evolution'', reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. Also developed the theory of [[primordial soup]].
 
These and other biostatisticians, [[mathematical biology|mathematical biologists]], and statistically inclined geneticists helped bring together [[evolutionary biology]] and [[genetics]] into a consistent, coherent whole that could begin to be [[Statistics|quantitative]]ly modeled.
Line 20 ⟶ 19:
In parallel to this overall development, the pioneering work of [[D'Arcy Thompson]] in ''On Growth and Form'' also helped to add quantitative discipline to biological study.
 
Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not [[qualitative data|qualitatively]] apparent. One anecdote describes [[Thomas Hunt Morgan]] banning the [[Friden, Inc.|Friden calculator]] from his department at [[Caltech]], saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in [[placer mining]]."<ref>{{cite web|url=http://www.tilsonfunds.com/MungerUCSBspeech.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.tilsonfunds.com/MungerUCSBspeech.pdf |archive-date=2022-10-09 |url-status=live|title=Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs|author=Charles T. Munger|date=2003-10-03}}</ref>
 
== Research planning ==
 
Any research in [[life sciences]] is proposed to answer a [[scientific question]] we might have. To answer this question with a high certainty, we need [[Accuracy and precision|accurate]] results. The correct definition of the main [[hypothesis]] and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the [[experimental design]], [[data collection]] methods, [[data analysis]] perspectives and costs evolvedinvolved. It is essential to carry the study based on the three basic principles of experimental statistics: [[randomization]], [[Replication (statistics)|replication]], and local control.
 
=== Research question ===
 
The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the [[scientific question]], an exhaustive [[literature review]] might be necessary. So, the research can be useful to add value to the [[scientific community]].<ref name=":3">{{cite journal|last1=Nizamuddin|first1=Sarah L.|last2=Nizamuddin|first2=Junaid|last3=Mueller|first3=Ariel|last4=Ramakrishna|first4=Harish|last5=Shahul|first5=Sajid S.|title=Developing a Hypothesis and Statistical Planning|journal=Journal of Cardiothoracic and Vascular Anesthesia|date=October 2017|volume=31|issue=5|pages=1878–1882|doi=10.1053/j.jvca.2017.04.020|pmid=28778775}}</ref>
 
=== Hypothesis definition ===
Line 40 ⟶ 39:
=== Sampling ===
 
Usually, a study aims to understand an effect of a phenomenon over a [[population]]. In [[biology]], a [[population]] is defined as all the [[individualsindividual]]s of a given [[species]], in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a [[population]] is not only the [[individuals]], but the total of one specific component of their [[organismsorganism]]s, as the whole [[genome]], or all the sperm [[cell (biology)|cells]], for animals, or the total leaf area, for a plant, for example.
 
It is not possible to take the [[Measurement|measures]] from all the elements of a [[population]]. Because of that, the [[Sampling (statistics)|sampling]] process is very important for [[statistical inference]]. [[Sampling (statistics)|Sampling]] is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the [[Sample (statistics)|sample]] might catch the most [[Statistical variability|variability]] across a population.<ref name=":2">{{cite journal| doi= 10.1177/0115426507022006629| pmid= 18042950| title= Biostatistics Primer: Part I| journal= Nutrition in Clinical Practice| volume= 22| issue= 6| pages= 629–35| year= 2017| last1= Overholser| first1= Brian R| last2= Sowinski| first2= Kevin M}}</ref> The [[sample size]] is determined by several things, since the scope of the research to the resources available. In [[clinical research]], the trial type, as [[inferiority]], [[Equivalence (measure theory)|equivalence]], and [[superior (hierarchy)|superior]]ity is a key in determining sample [[size]].<ref name=":3" />
 
=== Experimental design ===
[[Experimental designs]] sustain those basic principles of [[design of experiments|experimental statistics]]. There are three basic experimental designs to randomly allocate [[treatment group|treatments]] in all [[Quadrat|plots]] of the [[experiment]]. They are [[completely randomized design]], [[randomized block design]], and [[factorial designs]]. Treatments can be arranged in many ways inside the experiment. In [[agriculture]], the correct [[experimental design]] is the root of a good study and the arrangement of [[treatment group|treatments]] within the study is essential because [[environment (systems)|environment]] largely affects the [[Quadrat|plots]] ([[plants]], [[livestock]], [[microorganismsmicroorganism]]s). These main arrangements can be found in the literature under the names of "[[lattice model (physics)|lattices]]", “incomplete"incomplete blocks”blocks", "[[split plot]]", “augmented"augmented blocks”blocks", and many others. All of the designs might include [[Scientific control|control plots]], determined by the researcher, to provide an [[Estimation theory|error estimation]] during [[inference]].
 
In [[clinical studies]], the [[sample (statistics)|sample]]s are usually smaller than in other biological studies, and in most cases, the [[environment (systems)|environment]] effect can be controlled or measured. It is common to use [[Randomized controlled trial|randomized controlled clinical trials]], where results are usually compared with [[observational study]] designs such as [[case–control]] or [[cohort (statistics)|cohort]].<ref>{{cite journal|last1=Szczech|first1=Lynda Anne|last2=Coladonato|first2=Joseph A.|last3=Owen|first3=William F.|title=Key Concepts in Biostatistics: Using Statistics to Answer the Question "Is There a Difference?"|journal=Seminars in Dialysis|date=4 October 2002|volume=15|issue=5|pages=347–351|doi=10.1046/j.1525-139X.2002.00085.x|pmid=12358639|s2cid=30875225}}</ref>
Line 53 ⟶ 52:
Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.
 
Data collection varies according to type of data. For [[qualitative data]], collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence.<ref>{{cite journal|last1=Sandelowski|first1 = Margarete|title=Combining Qualitative and Quantitative Sampling, Data Collection, and Analysis Techniques in Mixed-Method Studies|journal=Research in Nursing & Health |date=2000|volume=23|issue=3|pages=246–255|doi=10.1002/1098-240X(200006)23:3<246::AID-NUR9>3.0.CO;2-H|pmid=10871540|citeseerx=10.1.1.472.7825|s2cid=10733556 }}</ref> For [[quantitative data]], collection is done by measuring numerical information using instruments.
 
In agriculture and biology studies, yield data and its components can be obtained by [[metric measure]]s. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection.
Line 67 ⟶ 66:
==== Frequency tables ====
 
One type of tablestable areis the [[frequency]] table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:<ref>{{Cite web|url=https://www.sangakoo.com/en/unit/absolute-relative-cumulative-frequency-and-statistical-tables|title=Absolute, relative, cumulative frequency and statistical tables – Probability and Statistics|last=Maths|first=Sangaku|website=www.sangakoo.com|language=en|access-date=2018-04-10}}</ref>
 
'''Absolute''': represents the number of times that a determined value appear;
Line 121 ⟶ 120:
==== Histograms ====
 
[[File:Example_histogramExample histogram.png|thumb|'''Example of a histogram.'''|350x350px]]The [[histogram]] (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by [[Karl Pearson]].<ref>{{Cite journal|last=Pearson|first=Karl|date=1895-01-01|title=X. Contributions to the mathematical theory of evolution.—II. Skew variation in homogeneous material|url=http://rsta.royalsocietypublishing.org/content/186/343|journal=Phil. Trans. R. Soc. Lond. A|language=en|volume=186|pages=343–414|doi=10.1098/rsta.1895.0010|issn=0264-3820|bibcode=1895RSPTA.186..343P|doi-access=free}}</ref>
 
==== Scatter plot ====
Line 143 ⟶ 142:
 
The [[mode (statistics)|mode]] is the value of a set of data that appears most often.<ref>{{Cite book|title=Econometrics|last=Gujarati|first=Damodar N.|publisher=McGraw-Hill Irwin|year=2006}}</ref>
{| class="wikitable" href="Caltech"
|+ href="placer mining" |Comparison among mean, median and mode<br />
Values = { 2,3,3,3,3,3,4,4,11 }
!Type
Line 150 ⟶ 149:
!Result
|-
| align="center" href="frequency" |[[Arithmetic mean|Mean]]<td
| align="center"> | ( 2 + 3 + 3 + 3 + 3 + 3 + 4 + 4 + 11 ) / 9</td>
| align="center" |'''4'''
|-
Line 172:
==== Pearson correlation coefficient ====
 
[[File:Correlation_coefficientCorrelation coefficient.png|right|thumb|Scatter diagram that demonstrates the Pearson correlation for different values of ''ρ.'']] [[Pearson correlation coefficient]] is a measure of association between two variables, X and Y. This coefficient, usually represented by ''ρ'' (rho) for the population and ''r'' for the sample, assumes values between −1 and 1, where ''ρ'' = 1 represents a perfect positive correlation, ''ρ'' = &minus;1−1 represents a perfect negative correlation, and ''ρ'' = 0 is no linear correlation.<ref name=":0" />
 
=== Inferential statistics ===
{{Main| Statistical inference}}
 
It is used to make [[inference]]s<ref>{{Cite journal|title=Essentials of Biostatistics in Public Health & Essentials of Biostatistics Workbook: Statistical Computing Using Excel|journal=Australian and New Zealand Journal of Public Health|volume=33|issue=2|pages=196–197|doi=10.1111/j.1753-6405.2009.00372.x|issn=1326-0200|year=2009|doi-access=free |last1=Watson |first1=Lyndsey }}</ref> about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The [[Standard error| standard error of the mean]] is a measure of variability that is crucial to do inferences.<ref name=":2" />
 
* [[Statistical hypothesis testing|Hypothesis testing]]
 
Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:<ref name=":2"/>
 
# ''The hypothesis to be tested'': as stated earlier, we have to work with the definition of a [[null hypothesis]] (H<sub>0</sub>), that is going to be tested, and an [[alternative hypothesis]]. But they must be defined before the experiment implementation.
# ''Significance level and decision rule'': A decision rule depends on the [[significance level|level of significance]], or in other words, the acceptable error rate (α). It is easier to think that we define a ''critical value'' that determines the statistical significance when a [[test statistic]] is compared with it. So, α also has to be predefined before the experiment.
# ''Experiment and statistical analysis'': This is when the experiment is really implemented following the appropriate [[Design of experiments|experimental design]], data is collected and the more suitable statistical tests are evaluated.
# ''Inference'': Is made when the [[null hypothesis]] is rejected or not rejected, based on the evidence that the comparison of [[p-value]]s and α brings. It is pointed that the failure to reject H<sub>0</sub> just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.
 
* [[Confidence intervals]]
 
Line 196 ⟶ 194:
=== Power and statistical error ===
 
When testing a hypothesis, there are two types of statistic errors possible: [[Type I error]] and [[Type II error]].

* The type I error or [[False positives and false negatives|false positive]] is the incorrect rejection of a true null hypothesis
* and theThe type II error or [[False positives and false negatives|false negative]] is the failure to reject a false [[null hypothesis]].

The [[significance level]] denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and [[Statistical power|statistical power of the test]] is 1 − β.
 
=== p-value ===
Line 229 ⟶ 232:
=== Bioinformatics advances in databases, data mining, and biological interpretation ===
 
The development of [[biological database]]s enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as [[PubMed]]. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to [[Single-nucleotide polymorphism|SNPs]] ([[dbSNP]]), the knowledge on genes characterization and their pathways ([[KEGG]]) and the description of gene function classifying it by cellular component, molecular function and biological process ([[Gene ontology|Gene Ontology]]).<ref name=":4">{{cite journal|doi=10.1002/jcp.21218|pmid=17654500|title=Bioinformatics|journal=Journal of Cellular Physiology|volume=213|issue=2|pages=365–9|year=2007|last1=Moore|first1=Jason H|s2cid=221831488|doi-access=free}}</ref> In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the ''[[Arabidopsis thaliana]]'' genetic and molecular database – TAIR.<ref>{{cite web|url=https://www.arabidopsis.org/|title=TAIR - Home Page|website=www.arabidopsis.org}}</ref> Phytozome,<ref>{{cite web|url=https://phytozome.jgi.doe.gov/pz/portal.html|title=Phytozome|website=phytozome.jgi.doe.gov}}</ref> in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the [[International Nucleotide Sequence Database Collaboration]] (INSDC)<ref>{{cite web|url=http://www.insdc.org/|title=International Nucleotide Sequence Database Collaboration - INSDC|website=www.insdc.org}}</ref> which relates data from DDBJ,<ref>{{cite web|url=https://www.ddbj.nig.ac.jp/index-e.html|title=Top|website=www.ddbj.nig.ac.jp|date=11 January 2024 }}</ref> EMBL-EBI,<ref>{{cite web|url=https://www.ebi.ac.uk/|title=The European Bioinformatics Institute < EMBL-EBI|website=www.ebi.ac.uk}}</ref> and NCBI.<ref>{{cite web|url=https://www.ncbi.nlm.nih.gov/|title=National Center for Biotechnology Information|publisher=U. S. National Library of Medicine – |website=www.ncbi.nlm.nih.gov}}</ref>
 
Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by [[machine learning]] area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of [[Supervised learning|supervised]] and [[unsupervised learning]], regression, detection of [[Cluster analysis|clusters]] and [[Association rule learning|association rule mining]], among others.<ref name=":4"/> To indicate some of them, [[self-organizing map]]s and [[k-means clustering|''k''-means]] are examples of cluster algorithms; [[Artificial neural network|neural networks]] implementation and [[support vector machine]]s models are examples of common machine learning algorithms.
Line 251 ⟶ 254:
=== Quantitative genetics ===
 
The study of [[Populationpopulation genetics]] and [[Statisticalstatistical genetics]] in order to link variation in [[genotype]] with a variation in [[phenotype]]. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called a [[Quantitativequantitative trait locus]] (QTL). The study of QTLs become feasible by using [[molecular marker]]s and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or [[Recombinantrecombinant inbred strain]]s/lines (RILs). To scan for QTLs regions in a genome, a [[gene map]] based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping.<ref>{{cite journal|doi=10.1007/s10709-004-2705-0|pmid=15881678|title=QTL mapping and the genetic basis of adaptation: Recent developments|journal=Genetica|volume=123|issue=1–2|pages=25–37|year=2005|last1=Zeng|first1=Zhao-Bang|s2cid=1094152}}</ref>
 
However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population.<ref>{{cite journal|doi=10.1186/1746-4811-9-29|pmid=23876160|pmc=3750305|title=The advantages and limitations of trait analysis with GWAS: A review|journal=Plant Methods|volume=9|pages=29|year=2013|last1=Korte|first1=Arthur|last2=Farlow|first2=Ashley |doi-access=free }}</ref> For this reason, the [[Genomegenome-wide association study]] was proposed in order to identify QTLs based on [[linkage disequilibrium]], that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput [[SNP genotyping]].<ref>{{cite journal|doi=10.3835/plantgenome2008.02.0089|title=Status and Prospects of Association Mapping in Plants|journal= The Plant Genome|volume=1|pages=5–20|year=2008|last1=Zhu|first1=Chengsong|last2=Gore|first2=Michael|last3=Buckler|first3=Edward S|last4=Yu|first4=Jianming|doi-access=free}}</ref>
 
In [[Animal breeding|animal]] and [[plant breeding]], the use of markers in [[Selective breeding|selection]] aiming for breeding, mainly the molecular ones, collaborated to the development of [[marker-assisted selection]]. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population.<ref>{{cite journal|doi=10.1016/j.tplants.2017.08.011|pmid=28965742|title=Genomic Selection in Plant Breeding: Methods, Models, and Perspectives|journal=Trends in Plant Science|volume=22|issue=11|pages=961–975|year=2017|last1=Crossa|first1=José|last2=Pérez-Rodríguez|first2=Paulino|last3=Cuevas|first3=Jaime|last4=Montesinos-López|first4=Osval|last5=Jarquín|first5=Diego|last6=De Los Campos|first6=Gustavo|last7=Burgueño|first7=Juan|last8=González-Camacho|first8=Juan M|last9=Pérez-Elizalde|first9=Sergio|last10=Beyene|first10=Yoseph|last11=Dreisigacker|first11=Susanne|last12=Singh|first12=Ravi|last13=Zhang|first13=Xuecai|last14=Gowda|first14=Manje|last15=Roorkiwal|first15=Manish|last16=Rutkoski|first16=Jessica|last17=Varshney|first17=Rajeev K|bibcode=2017TPS....22..961C |url=http://oar.icrisat.org/10280/1/Genomic%20Selection%20in%20Plant%20Breeding%20Methods%2C%20Models%2C%20and%20Perspectives.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://oar.icrisat.org/10280/1/Genomic%20Selection%20in%20Plant%20Breeding%20Methods%2C%20Models%2C%20and%20Perspectives.pdf |archive-date=2022-10-09 |url-status=live}}</ref> This kind of study could also include a validation population, thinking in the concept of [[cross-validation (statistics)|cross-validation]], in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model.
 
As a summary, some points about the application of quantitative genetics are:
* This has been used in agriculture to improve crops ([[Plant breeding]]) and [[livestock]] ([[Animal breeding]]).
* In biomedical research, this work can assist in finding candidates [[gene]] [[allele]]s that can cause or influence predisposition to diseases in [[human genetics]]
 
=== Expression data ===
 
Studies for differential expression of genes from [[RNA-Seq]] data, as for [[Real-time polymerase chain reaction|RT-qPCR]] and [[microarrays]], demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as [[exon]]s that are part of a gene sequence. As [[microarray]] results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the [[Poisson distribution|Poisson]] one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a [[negative binomial distribution]]. [[Generalized linear model]]s are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered.<ref>{{cite journal| doi =10.1186/gb-2010-11-12-220| pmid =21176179| pmc =3046478| title =From RNA-seq reads to differential expression results| journal =Genome Biology| volume =11| issue =12| pages =220| year =2010| last1 =Oshlack| first1 =Alicia| last2 =Robinson| first2 =Mark D| last3 =Young| first3 =Matthew D| doi-access =free}}</ref> Some examples of other analysis on [[genomics]] data comes from microarray or [[proteomics]] experiments.<ref>{{cite book|title=Statistical Analysis of Gene Expression Microarray Data|author1=Helen Causton |author2=John Quackenbush |author3=Alvis Brazma |publisher=Wiley-Blackwell|year=2003}}</ref><ref>{{cite book|title=Microarray Gene Expression Data Analysis: A Beginner's Guide|author=Terry Speed|publisher=Chapman & Hall/CRC|year=2003}}</ref> Often concerning diseases or disease stages.<ref>{{cite book|title=Medical Biostatistics for Complex Diseases|author1=Frank Emmert-Streib |author2=Matthias Dehmer |publisher=Wiley-Blackwell|year=2010|isbn= 978-3-527-32585-6}}</ref>
 
=== Other studies ===
 
* [[Ecology]], [[ecological forecasting]]
* Biological [[sequence analysis]]<ref>{{cite book|title=Statistical Methods in Bioinformatics: An Introduction|author1=Warren J. Ewens |author2=Gregory R. Grant |publisher=Springer|year=2004}}</ref>
* [[Systems biology]] for gene network inference or pathways analysis.<ref>{{cite book|title=Applied Statistics for Network Biology: Methods in Systems Biology|author1=Matthias Dehmer |author2=Frank Emmert-Streib |author3=Armin Graber |author4=Armindo Salvador |publisher=Wiley-Blackwell|year=2011|isbn= 978-3-527-32750-8}}</ref>
* [[Clinical research]] and pharmaceutical development
* [[Population dynamics]], especially in regards to [[fisheries science]].
* [[Phylogenetics]] and [[evolution]]
* [[Pharmacodynamics]]
* [[Pharmacokinetics]]
* [[Neuroimaging]]
 
== Tools ==
 
There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them:
* [[ASReml]]: Another software developed by VSNi<ref name="vsni">{{cite web|url=https://www.vsni.co.uk/|title=Home - VSN International|website=www.vsni.co.uk}}</ref> that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using [[restricted maximum likelihood]] (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different [[Covariance matrix|variance-covariance]] matrix structures.
 
* CycDesigN:<ref>{{cite web|url=https://www.vsni.co.uk/software/cycdesign/|title=CycDesigN - VSN International|website=www.vsni.co.uk}}</ref> A computer package developed by VSNi<ref name="vsni" /> that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and [[Crossover study|crossover designs]]. It includes less used designs the Latinized ones, as t-Latinized design.<ref>{{cite journal|last1=Piepho|first1=Hans-Peter|last2=Williams|first2=Emlyn R|last3=Michel|first3=Volker|year=2015|title=Beyond Latin Squares: A Brief Tour of Row-Column Designs|journal=Agronomy Journal|volume=107|issue=6|pages=2263|doi=10.2134/agronj15.0144|bibcode=2015AgrJ..107.2263P }}</ref>
*[[ASReml]]: Another software developed by VSNi<ref name="vsni">{{cite web|url=https://www.vsni.co.uk/|title=Home - VSN International|website=www.vsni.co.uk}}</ref> that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using [[restricted maximum likelihood]] (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different [[Covariance matrix|variance-covariance]] matrix structures.
* [[Orange (software)|Orange]]: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.<ref name=":4" />
*CycDesigN:<ref>{{cite web|url=https://www.vsni.co.uk/software/cycdesign/|title=CycDesigN - VSN International|website=www.vsni.co.uk}}</ref> A computer package developed by VSNi<ref name="vsni" /> that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and [[Crossover study|crossover designs]]. It includes less used designs the Latinized ones, as t-Latinized design.<ref>{{cite journal|last1=Piepho|first1=Hans-Peter|last2=Williams|first2=Emlyn R|last3=Michel|first3=Volker|year=2015|title=Beyond Latin Squares: A Brief Tour of Row-Column Designs|journal=Agronomy Journal|volume=107|issue=6|pages=2263|doi=10.2134/agronj15.0144}}</ref>
* [[R (programming language)|R]]: An [[open source]] environment and programming language dedicated to statistical computing and graphics. It is an implementation of [[S (programming language)|S]] language maintained by CRAN.<ref>{{cite web|url=https://cran.r-project.org/|title=The Comprehensive R Archive Network|website=cran.r-project.org}}</ref> In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.<ref>{{cite book|title=Biostatistics explored through R software: An overview|author=Renganathan V|year=2021|ISBNpublisher=Vinaitheerthan Renganathan |isbn=9789354936586}}</ref> In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as [[Bioconductor]]. It is also possible to use packages under development that are shared in hosting-services as [[GitHub]].
*[[Orange (software)|Orange]]: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.<ref name=":4" />
* [[SAS (software)|SAS]]: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name ([[SAS Institute]]), it uses [[SAS language]] for programming.
*[[R (programming language)|R]]: An [[open source]] environment and programming language dedicated to statistical computing and graphics. It is an implementation of [[S (programming language)|S]] language maintained by CRAN.<ref>{{cite web|url=https://cran.r-project.org/|title=The Comprehensive R Archive Network|website=cran.r-project.org}}</ref> In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.<ref>{{cite book|title=Biostatistics explored through R software: An overview|author=Renganathan V|year=2021|ISBN=9789354936586}}</ref> In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as [[Bioconductor]]. It is also possible to use packages under development that are shared in hosting-services as [[GitHub]].
*[[SAS (software)|SAS]]: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name ([[SAS Institute]]), it uses [[SAS language]] for programming.
* PLA 3.0:<ref>{{Cite web|url=https://www.bioassay.de/products/pla-30/|title=PLA 3.0|last=Stegmann|first=Dr Ralf|date=2019-07-01|website=PLA 3.0 – Software for Biostatistical Analysis|language=en|access-date=2019-07-02}}</ref> Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data.
* [[Weka (machine learning)|Weka]]: A [[Java (programming language)|Java]] software for [[machine learning]] and [[data mining]], including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R.<ref name=":4" />
* [[Python (programming language)]] image analysis, deep-learning, machine-learning
* [[SQL]] databases
* [[NoSQL]]
* [[NumPy]] numerical python
* [[SciPy]]
* [[SageMath]]
* [[LAPACK]] linear algebra
* [[MATLAB]]
* [[Apache Hadoop]]
* [[Apache Spark]]
* [[Amazon Web Services]]
 
== Scope and training programs ==
Line 321 ⟶ 337:
* [https://www.biometricsociety.org/ The International Biometric Society]
* [https://web.archive.org/web/20080827161431/http://www.biostatsresearch.com/repository/ The Collection of Biostatistics Research Archive]
* [http://www.medpagetoday.com/lib/content/Medpage-Guide-to-Biostatistics.pdf Guide to Biostatistics (MedPageToday.com)] {{Webarchive|url=https://web.archive.org/web/20120522144801/http://www.medpagetoday.com/lib/content/Medpage-Guide-to-Biostatistics.pdf |date=2012-05-22 }}
* [https://web.archive.org/web/20150402180351/http://www.biostat.katerynakon.in.ua/en/ Biomedical Statistics]