Content deleted Content added
m Section heading change: Biostatistics and Genetics → Biostatistics and genetics using a script |
Reverting edit(s) by Shivani Saini 06 (talk) to rev. 1240687734 by Wburrow: non-constructive (RW 16.1) |
||
(42 intermediate revisions by 32 users not shown) | |||
Line 3:
{{for|the academic journal|Biostatistics (journal)}}
'''Biostatistics''' (also known as '''biometry''')
== History ==
=== Biostatistics and genetics ===
Biostatistical modeling forms an important part of numerous modern biological theories. [[Genetics]] studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. [[Gregor Mendel]] started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. [[Francis Galton]] tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "[[Francis Galton|Law of Ancestral Heredity]]". His ideas were strongly disagreed by [[William Bateson]], who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as [[
Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of [[population genetics]] and this synthesis all relied on statistics and developed its use in biology.
* [[Ronald Fisher]]
* [[Sewall G. Wright]] developed [[F-statistics|''F''-statistics]] and methods of computing them and defined [[inbreeding coefficient]].▼
▲* [[Ronald Fisher]] developed several basic statistical methods in support of his work studying the crop experiments at [[Rothamsted Research]], including in his books [[Statistical Methods for Research Workers]] (1925) end [[The Genetical Theory of Natural Selection]] (1930). He gave many contributions to genetics and statistics. Some of them include the [[ANOVA]], [[p-value]] concepts, [[Ronald Fisher|Fisher's exact test]] and [[Ronald Fisher|Fisher's equation]] for [[population dynamics]]. He is credited for the sentence “Natural selection is a mechanism for generating an exceedingly high degree of improbability”.<ref>{{cite journal|last1=Gunter|first1=Chris |title=Quantitative Genetics|journal=Nature|date=10 December 2008|volume=456|issue=7223 |pages=719|doi=10.1038/456719a|pmid=19079046 |bibcode=2008Natur.456..719G|doi-access=free}}</ref>
* [[J. B. S. Haldane]]'s book, ''The Causes of Evolution'', reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics.
▲* [[Sewall G. Wright]] developed [[F-statistics]] and methods of computing them and defined [[inbreeding coefficient]].
▲* [[J. B. S. Haldane]]'s book, ''The Causes of Evolution'', reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. Also developed the theory of [[primordial soup]].
These and other biostatisticians, [[mathematical biology|mathematical biologists]], and statistically inclined geneticists helped bring together [[evolutionary biology]] and [[genetics]] into a consistent, coherent whole that could begin to be [[Statistics|quantitative]]ly modeled.
Line 20 ⟶ 19:
In parallel to this overall development, the pioneering work of [[D'Arcy Thompson]] in ''On Growth and Form'' also helped to add quantitative discipline to biological study.
Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not [[qualitative data|qualitatively]] apparent. One anecdote describes [[Thomas Hunt Morgan]] banning the [[Friden, Inc.|Friden calculator]] from his department at [[Caltech]], saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in [[placer mining]]."<ref>{{cite web|url=http://www.tilsonfunds.com/MungerUCSBspeech.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.tilsonfunds.com/MungerUCSBspeech.pdf |archive-date=2022-10-09 |url-status=live|title=Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs|author=Charles T. Munger|date=2003-10-03}}</ref>
== Research planning ==
Any research in [[life sciences]] is proposed to answer a [[scientific question]] we might have. To answer this question with a high certainty, we need [[Accuracy and precision|accurate]] results. The correct definition of the main [[hypothesis]] and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the [[experimental design]], [[data collection]] methods, [[data analysis]] perspectives and costs
=== Research question ===
The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the [[scientific question]], an exhaustive [[literature review]] might be necessary. So
=== Hypothesis definition ===
Line 40 ⟶ 39:
=== Sampling ===
Usually, a study aims to understand an effect of a phenomenon over a [[population]]. In [[biology]], a [[population]] is defined as all the [[
It is not possible to take the [[Measurement|measures]] from all the elements of a [[population]]. Because of that, the [[Sampling (statistics)|sampling]] process is very important for [[statistical inference]]. [[Sampling (statistics)|Sampling]] is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the [[Sample (statistics)|sample]] might catch the most [[Statistical variability|variability]] across a population.<ref name=":2">{{cite journal| doi= 10.1177/0115426507022006629| pmid= 18042950| title= Biostatistics Primer: Part I| journal= Nutrition in Clinical Practice| volume= 22| issue= 6| pages= 629–35| year= 2017| last1= Overholser| first1= Brian R| last2= Sowinski| first2= Kevin M}}</ref> The [[sample size]] is determined by several things, since the scope of the research to the resources available. In [[clinical research]], the trial type, as [[inferiority]], [[Equivalence (measure theory)|equivalence]], and [[superior (hierarchy)|superior]]ity is a key in determining sample [[size]].<ref name=":3" />
=== Experimental design ===
[[Experimental designs]] sustain those basic principles of [[design of experiments|experimental statistics]]. There are three basic experimental designs to randomly allocate [[treatment group|treatments]] in all [[Quadrat|plots]] of the [[experiment]]. They are [[completely randomized design]], [[randomized block design]], and [[factorial designs]]. Treatments can be arranged in many ways inside the experiment. In [[agriculture]], the correct [[experimental design]] is the root of a good study and the arrangement of [[treatment group|treatments]] within the study is essential because [[environment (systems)|environment]] largely affects the [[Quadrat|plots]] ([[plants]], [[livestock]], [[
In [[clinical studies]], the [[sample (statistics)|sample]]s are usually smaller than in other biological studies, and in most cases, the [[environment (systems)|environment]] effect can be controlled or measured. It is common to use [[Randomized controlled trial|randomized controlled clinical trials]], where results are usually compared with [[observational study]] designs such as [[case–control]] or [[cohort (statistics)|cohort]].<ref>{{cite journal|last1=Szczech|first1=Lynda Anne|last2=Coladonato|first2=Joseph A.|last3=Owen|first3=William F.|title=Key Concepts in Biostatistics: Using Statistics to Answer the Question "Is There a Difference?"|journal=Seminars in Dialysis|date=4 October 2002|volume=15|issue=5|pages=347–351|doi=10.1046/j.1525-139X.2002.00085.x|pmid=12358639|s2cid=30875225}}</ref>
Line 53 ⟶ 52:
Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.
Data collection varies according to type of data. For [[qualitative data]], collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence.<ref>{{cite journal|last1=Sandelowski|first1 = Margarete|title=Combining Qualitative and Quantitative Sampling, Data Collection, and Analysis Techniques in Mixed-Method Studies|journal=Research in Nursing & Health |date=2000|volume=23|issue=3|pages=246–255|doi=10.1002/1098-240X(200006)23:3<246::AID-NUR9>3.0.CO;2-H|pmid=10871540|citeseerx=10.1.1.472.7825|s2cid=10733556 }}</ref> For [[quantitative data]], collection is done by measuring numerical information using instruments.
In agriculture and biology studies, yield data and its components can be obtained by [[metric measure]]s. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection.
Line 67 ⟶ 66:
==== Frequency tables ====
One type of
'''Absolute''': represents the number of times that a determined value appear;
Line 121 ⟶ 120:
==== Histograms ====
[[File:
==== Scatter plot ====
Line 143 ⟶ 142:
The [[mode (statistics)|mode]] is the value of a set of data that appears most often.<ref>{{Cite book|title=Econometrics|last=Gujarati|first=Damodar N.|publisher=McGraw-Hill Irwin|year=2006}}</ref>
{| class="wikitable"
|+
Values = { 2,3,3,3,3,3,4,4,11 }
!Type
Line 150 ⟶ 149:
!Result
|-
| align="center"
| align="center" | align="center" |'''4'''
|-
Line 172:
==== Pearson correlation coefficient ====
[[File:
=== Inferential statistics ===
{{Main| Statistical inference}}
It is used to make [[inference]]s<ref>{{Cite journal|title=Essentials of Biostatistics in Public Health & Essentials of Biostatistics Workbook: Statistical Computing Using Excel|journal=Australian and New Zealand Journal of Public Health|volume=33|issue=2|pages=196–197|doi=10.1111/j.1753-6405.2009.00372.x|issn=1326-0200|year=2009|doi-access=free |last1=Watson |first1=Lyndsey }}</ref> about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The [[Standard error|
* [[Statistical hypothesis testing|Hypothesis testing]]
Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:<ref name=":2"/>
# ''The hypothesis to be tested'': as stated earlier, we have to work with the definition of a [[null hypothesis]] (H<sub>0</sub>), that is going to be tested, and an [[alternative hypothesis]]. But they must be defined before the experiment implementation.
# ''Significance level and decision rule'': A decision rule depends on the [[significance level|level of significance]], or in other words, the acceptable error rate (α). It is easier to think that we define a ''critical value'' that determines the statistical significance when a [[test statistic]] is compared with it. So, α also has to be predefined before the experiment.
# ''Experiment and statistical analysis'': This is when the experiment is really implemented following the appropriate [[Design of experiments|experimental design]], data is collected and the more suitable statistical tests are evaluated.
# ''Inference'': Is made when the [[null hypothesis]] is rejected or not rejected, based on the evidence that the comparison of [[p-value]]s and α brings. It is pointed that the failure to reject H<sub>0</sub> just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.
* [[Confidence intervals]]
Line 196 ⟶ 194:
=== Power and statistical error ===
When testing a hypothesis, there are two types of statistic errors possible: [[Type I error]] and [[Type II error]].
* The type I error or [[False positives and false negatives|false positive]] is the incorrect rejection of a true null hypothesis * The [[significance level]] denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and [[Statistical power|statistical power of the test]] is 1 − β. === p-value ===
Line 229 ⟶ 232:
=== Bioinformatics advances in databases, data mining, and biological interpretation ===
The development of [[biological database]]s enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as [[PubMed]]. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to [[Single-nucleotide polymorphism|SNPs]] ([[dbSNP]]), the knowledge on genes characterization and their pathways ([[KEGG]]) and the description of gene function classifying it by cellular component, molecular function and biological process ([[Gene ontology|Gene Ontology]]).<ref name=":4">{{cite journal|doi=10.1002/jcp.21218|pmid=17654500|title=Bioinformatics|journal=Journal of Cellular Physiology|volume=213|issue=2|pages=365–9|year=2007|last1=Moore|first1=Jason H|s2cid=221831488|doi-access=free}}</ref> In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the ''[[Arabidopsis thaliana]]'' genetic and molecular database – TAIR.<ref>{{cite web|url=https://www.arabidopsis.org/|title=TAIR - Home Page|website=www.arabidopsis.org}}</ref> Phytozome,<ref>{{cite web|url=https://phytozome.jgi.doe.gov/pz/portal.html|title=Phytozome|website=phytozome.jgi.doe.gov}}</ref> in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the [[International Nucleotide Sequence Database Collaboration]] (INSDC)<ref>{{cite web|url=http://www.insdc.org/|title=International Nucleotide Sequence Database Collaboration - INSDC|website=www.insdc.org}}</ref> which relates data from DDBJ,<ref>{{cite web|url=https://www.ddbj.nig.ac.jp/index-e.html|title=Top|website=www.ddbj.nig.ac.jp|date=11 January 2024 }}</ref> EMBL-EBI,<ref>{{cite web|url=https://www.ebi.ac.uk/|title=The European Bioinformatics Institute < EMBL-EBI|website=www.ebi.ac.uk}}</ref> and NCBI.<ref>{{cite web|url=https://www.ncbi.nlm.nih.gov/|title=National Center for Biotechnology Information|publisher=U. S. National Library of Medicine – |website=www.ncbi.nlm.nih.gov}}</ref>
Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by [[machine learning]] area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of [[Supervised learning|supervised]] and [[unsupervised learning]], regression, detection of [[Cluster analysis|clusters]] and [[Association rule learning|association rule mining]], among others.<ref name=":4"/> To indicate some of them, [[self-organizing map]]s and [[k-means clustering|''k''-means]] are examples of cluster algorithms; [[Artificial neural network|neural networks]] implementation and [[support vector machine]]s models are examples of common machine learning algorithms.
Line 251 ⟶ 254:
=== Quantitative genetics ===
The study of [[
However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population.<ref>{{cite journal|doi=10.1186/1746-4811-9-29|pmid=23876160|pmc=3750305|title=The advantages and limitations of trait analysis with GWAS: A review|journal=Plant Methods|volume=9|pages=29|year=2013|last1=Korte|first1=Arthur|last2=Farlow|first2=Ashley |doi-access=free }}</ref> For this reason, the [[
In [[Animal breeding|animal]] and [[plant breeding]], the use of markers in [[Selective breeding|selection]] aiming for breeding, mainly the molecular ones, collaborated to the development of [[marker-assisted selection]]. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population.<ref>{{cite journal|doi=10.1016/j.tplants.2017.08.011|pmid=28965742|title=Genomic Selection in Plant Breeding: Methods, Models, and Perspectives|journal=Trends in Plant Science|volume=22|issue=11|pages=961–975|year=2017|last1=Crossa|first1=José|last2=Pérez-Rodríguez|first2=Paulino|last3=Cuevas|first3=Jaime|last4=Montesinos-López|first4=Osval|last5=Jarquín|first5=Diego|last6=De Los Campos|first6=Gustavo|last7=Burgueño|first7=Juan|last8=González-Camacho|first8=Juan M|last9=Pérez-Elizalde|first9=Sergio|last10=Beyene|first10=Yoseph|last11=Dreisigacker|first11=Susanne|last12=Singh|first12=Ravi|last13=Zhang|first13=Xuecai|last14=Gowda|first14=Manje|last15=Roorkiwal|first15=Manish|last16=Rutkoski|first16=Jessica|last17=Varshney|first17=Rajeev K|bibcode=2017TPS....22..961C |url=http://oar.icrisat.org/10280/1/Genomic%20Selection%20in%20Plant%20Breeding%20Methods%2C%20Models%2C%20and%20Perspectives.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://oar.icrisat.org/10280/1/Genomic%20Selection%20in%20Plant%20Breeding%20Methods%2C%20Models%2C%20and%20Perspectives.pdf |archive-date=2022-10-09 |url-status=live}}</ref> This kind of study could also include a validation population, thinking in the concept of [[cross-validation (statistics)|cross-validation]], in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model.
As a summary, some points about the application of quantitative genetics are:
* This has been used in agriculture to improve crops ([[Plant breeding]]) and [[livestock]] ([[Animal breeding]]).
* In biomedical research, this work can assist in finding candidates [[gene]] [[allele]]s that can cause or influence predisposition to diseases in [[human genetics]]
=== Expression data ===
Studies for differential expression of genes from [[RNA-Seq]] data, as for [[Real-time polymerase chain reaction|RT-qPCR]] and [[microarrays]], demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as [[exon]]s that are part of a gene sequence. As [[microarray]] results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the [[Poisson distribution|Poisson]] one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a [[negative binomial distribution]]. [[Generalized linear model]]s are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered.<ref>{{cite journal| doi =10.1186/gb-2010-11-12-220| pmid =21176179| pmc =3046478| title =From RNA-seq reads to differential expression results| journal =Genome Biology| volume =11| issue =12| pages =220| year =2010| last1 =Oshlack| first1 =Alicia| last2 =Robinson| first2 =Mark D| last3 =Young| first3 =Matthew D| doi-access =free}}</ref> Some examples of other analysis on [[genomics]] data comes from microarray or [[proteomics]] experiments.<ref>{{cite book|title=Statistical Analysis of Gene Expression Microarray Data|author1=Helen Causton |author2=John Quackenbush |author3=Alvis Brazma |publisher=Wiley-Blackwell|year=2003}}</ref><ref>{{cite book|title=Microarray Gene Expression Data Analysis: A Beginner's Guide|author=Terry Speed|publisher=Chapman & Hall/CRC|year=2003}}</ref> Often concerning diseases or disease stages.<ref>{{cite book|title=Medical Biostatistics for Complex Diseases|author1=Frank Emmert-Streib |author2=Matthias Dehmer |publisher=Wiley-Blackwell|year=2010|isbn= 978-3-527-32585-6}}</ref>
=== Other studies ===
* [[Ecology]], [[ecological forecasting]]
* Biological [[sequence analysis]]<ref>{{cite book|title=Statistical Methods in Bioinformatics: An Introduction|author1=Warren J. Ewens |author2=Gregory R. Grant |publisher=Springer|year=2004}}</ref>
* [[Systems biology]] for gene network inference or pathways analysis.<ref>{{cite book|title=Applied Statistics for Network Biology: Methods in Systems Biology|author1=Matthias Dehmer |author2=Frank Emmert-Streib |author3=Armin Graber |author4=Armindo Salvador |publisher=Wiley-Blackwell|year=2011|isbn= 978-3-527-32750-8}}</ref>
* [[Clinical research]] and pharmaceutical development
* [[Population dynamics]], especially in regards to [[fisheries science]].
* [[Phylogenetics]] and [[evolution]]
* [[Pharmacodynamics]]
* [[Pharmacokinetics]]
* [[Neuroimaging]]
== Tools ==
There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them:
* [[ASReml]]: Another software developed by VSNi<ref name="vsni">{{cite web|url=https://www.vsni.co.uk/|title=Home - VSN International|website=www.vsni.co.uk}}</ref> that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using [[restricted maximum likelihood]] (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different [[Covariance matrix|variance-covariance]] matrix structures.▼
* CycDesigN:<ref>{{cite web|url=https://www.vsni.co.uk/software/cycdesign/|title=CycDesigN - VSN International|website=www.vsni.co.uk}}</ref> A computer package developed by VSNi<ref name="vsni" /> that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and [[Crossover study|crossover designs]]. It includes less used designs the Latinized ones, as t-Latinized design.<ref>{{cite journal|last1=Piepho|first1=Hans-Peter|last2=Williams|first2=Emlyn R|last3=Michel|first3=Volker|year=2015|title=Beyond Latin Squares: A Brief Tour of Row-Column Designs|journal=Agronomy Journal|volume=107|issue=6|pages=2263|doi=10.2134/agronj15.0144|bibcode=2015AgrJ..107.2263P }}</ref>▼
▲*[[ASReml]]: Another software developed by VSNi<ref name="vsni">{{cite web|url=https://www.vsni.co.uk/|title=Home - VSN International|website=www.vsni.co.uk}}</ref> that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using [[restricted maximum likelihood]] (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different [[Covariance matrix|variance-covariance]] matrix structures.
* [[Orange (software)|Orange]]: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.<ref name=":4" />▼
▲*CycDesigN:<ref>{{cite web|url=https://www.vsni.co.uk/software/cycdesign/|title=CycDesigN - VSN International|website=www.vsni.co.uk}}</ref> A computer package developed by VSNi<ref name="vsni" /> that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and [[Crossover study|crossover designs]]. It includes less used designs the Latinized ones, as t-Latinized design.<ref>{{cite journal|last1=Piepho|first1=Hans-Peter|last2=Williams|first2=Emlyn R|last3=Michel|first3=Volker|year=2015|title=Beyond Latin Squares: A Brief Tour of Row-Column Designs|journal=Agronomy Journal|volume=107|issue=6|pages=2263|doi=10.2134/agronj15.0144}}</ref>
* [[R (programming language)|R]]: An [[open source]] environment and programming language dedicated to statistical computing and graphics. It is an implementation of [[S (programming language)|S]] language maintained by CRAN.<ref>{{cite web|url=https://cran.r-project.org/|title=The Comprehensive R Archive Network|website=cran.r-project.org}}</ref> In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.<ref>{{cite book|title=Biostatistics explored through R software: An overview|author=Renganathan V|year=2021|
▲*[[Orange (software)|Orange]]: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.<ref name=":4" />
* [[SAS (software)|SAS]]: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name ([[SAS Institute]]), it uses [[SAS language]] for programming.▼
▲*[[R (programming language)|R]]: An [[open source]] environment and programming language dedicated to statistical computing and graphics. It is an implementation of [[S (programming language)|S]] language maintained by CRAN.<ref>{{cite web|url=https://cran.r-project.org/|title=The Comprehensive R Archive Network|website=cran.r-project.org}}</ref> In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.<ref>{{cite book|title=Biostatistics explored through R software: An overview|author=Renganathan V|year=2021|ISBN=9789354936586}}</ref> In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as [[Bioconductor]]. It is also possible to use packages under development that are shared in hosting-services as [[GitHub]].
▲*[[SAS (software)|SAS]]: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name ([[SAS Institute]]), it uses [[SAS language]] for programming.
* PLA 3.0:<ref>{{Cite web|url=https://www.bioassay.de/products/pla-30/|title=PLA 3.0|last=Stegmann|first=Dr Ralf|date=2019-07-01|website=PLA 3.0 – Software for Biostatistical Analysis|language=en|access-date=2019-07-02}}</ref> Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data.
* [[Weka (machine learning)|Weka]]: A [[Java (programming language)|Java]] software for [[machine learning]] and [[data mining]], including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R.<ref name=":4" />
* [[Python (programming language)]] image analysis, deep-learning, machine-learning
* [[SQL]] databases
* [[NoSQL]]
* [[NumPy]] numerical python
* [[SciPy]]
* [[SageMath]]
* [[LAPACK]] linear algebra
* [[MATLAB]]
* [[Apache Hadoop]]
* [[Apache Spark]]
* [[Amazon Web Services]]
== Scope and training programs ==
Line 321 ⟶ 337:
* [https://www.biometricsociety.org/ The International Biometric Society]
* [https://web.archive.org/web/20080827161431/http://www.biostatsresearch.com/repository/ The Collection of Biostatistics Research Archive]
* [http://www.medpagetoday.com/lib/content/Medpage-Guide-to-Biostatistics.pdf Guide to Biostatistics (MedPageToday.com)] {{Webarchive|url=https://web.archive.org/web/20120522144801/http://www.medpagetoday.com/lib/content/Medpage-Guide-to-Biostatistics.pdf |date=2012-05-22 }}
* [https://web.archive.org/web/20150402180351/http://www.biostat.katerynakon.in.ua/en/ Biomedical Statistics]
|