Materialized view

Materialized view

In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function. The process of setting up a materialized view is sometimes called materialization. This is a form of caching the results of a query, similar to memoization of the value of a function in functional languages, and it is sometimes described as a form of precomputation. As with other forms of precomputation, database users typically use materialized views for performance reasons, i.e. as a form of optimization. Materialized views that store data based on remote tables were also known as snapshots (deprecated Oracle terminology). In any database management system following the relational model, a view is a virtual table representing the result of a database query. Whenever a query or an update addresses an ordinary view's virtual table, the DBMS converts these into queries or updates against the underlying base tables. A materialized view takes a different approach: the query result is cached as a concrete ("materialized") table (rather than a view as such) that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of extra storage and of some data being potentially out-of-date. Materialized views find use especially in data warehousing scenarios, where frequent queries of the actual base tables can be expensive. In a materialized view, indexes can be built on any column. In contrast, in a normal view, it's typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables; often this functionality is not offered at all. == Implementations == === Oracle === Materialized views were implemented first by the Oracle Database: the Query rewrite feature was added from version 8i. Example syntax to create a materialized view in Oracle: === PostgreSQL === In PostgreSQL, version 9.3 and newer natively support materialized views. In version 9.3, a materialized view is not auto-refreshed, and is populated only at time of creation (unless WITH NO DATA is used). It may be refreshed later manually using REFRESH MATERIALIZED VIEW. In version 9.4, the refresh may be concurrent with selects on the materialized view if CONCURRENTLY is used. Example syntax to create a materialized view in PostgreSQL: === SQL Server === Microsoft SQL Server differs from other RDBMS by the way of implementing materialized view via a concept known as "Indexed Views". The main difference is that such views do not require a refresh because they are in fact always synchronized to the original data of the tables that compound the view. To achieve this, it is necessary that the lines of origin and destination are "deterministic" in their mapping, which limits the types of possible queries to do this. This mechanism has been realised since the 2000 version of SQL Server. Example syntax to create a materialized view in SQL Server: === Stream processing frameworks === Apache Kafka (since v0.10.2), Apache Spark (since v2.0), Apache Flink, Kinetica DB, Materialize, RisingWave, and Epsio all support materialized views on streams of data. === Others === Materialized views are also supported in Sybase SQL Anywhere. In IBM Db2, they are called "materialized query tables". ClickHouse supports materialized views that automatically refresh on merges. MySQL doesn't support materialized views natively, but workarounds can be implemented by using triggers or stored procedures or by using the open-source application Flexviews. Materialized views can be implemented in Amazon DynamoDB using data modification events captured by DynamoDB Streams. Google announced in 8 April 2020 the availability of materialized views for BigQuery as a beta release.

Language model benchmark

A language model benchmark is a standardized test designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning. Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like answering questions, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust, and sustainability. == Overview == === Types === Benchmarks may be described by the following adjectives, not mutually exclusive: Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores. Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc. Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after state of the art (SOTA) models have saturated (achieved 100% performance) a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark have access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, and validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log-likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm, whereby a model is first trained on massive, unlabeled datasets to learn general language patterns, syntax, and knowledge (pretraining), and the base model is then adapted to specific, downstream tasks using smaller, labeled datasets (fine-tuning). === Lifecycle === Generally, the life cycle of a benchmark consists of the following steps: Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). Growth: More papers and models use the benchmark, and the performance on the benchmark grows. Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress. === Construction === Like datasets, benchmarks are typically constructed by several methods, individually or in combination: Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest. === Evaluation === Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall, F1 score, etc. pass@n: The model is given n {\displaystyle n} attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. k@n: The model makes n {\displaystyle n} attempts to solve each problem, but only k {\displaystyle k} attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. cons@n: The model is given n {\displaystyle n} attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making N > n {\displaystyle N>n} attempts, and use the unbiased estimator 1 − ( N − c n ) ( N n ) {\displaystyle 1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}} , where c {\displaystyle c} is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores including BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, and SPICE. === Issues === error: Some benchmark answers may be wrong. ambiguity: Some benchmark questions may be ambiguously worded. subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible. open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the se

Bioz

Bioz is a search engine for life science experimentation. == History == Bioz was founded by Karin Lachmi and Daniel Levitt. Lachmi is a scientist who completed her postdoc in molecular and cellular biology at the Stanford University School of Medicine. During her lab work she found little available data regarding preferable lab tools, reagents and related products for experimentation. There are 50,000 vendors selling 300 million scientific products. She decided to start the company in order to provide researchers with adequate information for that purpose. Co-founder Daniel Levitt is an entrepreneur who sold his company WebAppoint to Microsoft in the year 2000. He also co-founded the company StemRad. At Bioz, Lachmi serves as the Chief Scientific Officer and Levitt serves as the chief executive officer. Bioz claims to have over a million researcher-users from 196 countries. Among the investors are Esther Dyson and the Stanford-StartX Fund. The company's advisory board includes Nobel Laureates in Chemistry Michael Levitt, Roger Kornberg, and Ada Yonath. == Technology == The company uses artificial intelligence, machine learning and natural language processing in order to extract experimentation data from scientific articles, such as the products that researchers used, the companies that supply the products, the protocol conditions that researchers selected, and the types of experiments and techniques. The algorithm ranks products based on how frequently they were used by researchers in their experiments, how recently a product was used, and the impact factor of the journal. The algorithm's output is a Bioz stars score for each product that was mentioned in an article. Bioz is a data-driven platform for product recommendations, which is contrary to platforms such as TripAdvisor and OpenTable that are based on user-generated reviews and ratings. The recommendations and scoring system that the company has developed are meant to assist researchers with the process of developing future medications and finding cures for diseases. They are guided towards products and techniques that were previously used by other researchers when planning and performing experiments. The company's revenue is based on selling SaaS subscriptions to researchers in biopharma companies. They also charge product suppliers for content syndication.

Margin classifier

In machine learning (ML), a margin classifier is a type of classification model which is able to give an associated distance from the decision boundary for each data sample. For instance, if a linear classifier is used, the distance (typically Euclidean, though others may be used) of a sample from the separating hyperplane is the margin of that sample. The notion of margins is important in several ML classification algorithms, as it can be used to bound the generalization error of these classifiers. These bounds are frequently shown using the VC dimension. The generalization error bound in boosting algorithms and support vector machines is particularly prominent. == Margin for boosting algorithms == The margin for an iterative boosting algorithm given a dataset with two classes can be defined as follows: the classifier is given a sample pair ( x , y ) {\displaystyle (x,y)} , where x ∈ X {\displaystyle x\in X} is a domain space and y ∈ Y = { − 1 , + 1 } {\displaystyle y\in Y=\{-1,+1\}} is the sample's label. The algorithm then selects a classifier h j ∈ C {\displaystyle h_{j}\in C} at each iteration j {\displaystyle j} where C {\displaystyle C} is a space of possible classifiers that predict real values. This hypothesis is then weighted by α j ∈ R {\displaystyle \alpha _{j}\in R} as selected by the boosting algorithm. At iteration t {\displaystyle t} , the margin of a sample x {\displaystyle x} can thus be defined as y ∑ j t α j h j ( x ) ∑ | α j | . {\displaystyle {\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}.} By this definition, the margin is positive if the sample is labeled correctly, or negative if the sample is labeled incorrectly. This definition may be modified and is not the only way to define the margin for boosting algorithms. However, there are reasons why this definition may be appealing. == Examples of margin-based algorithms == Many classifiers can give an associated margin for each sample. However, only some classifiers utilize information of the margin while learning from a dataset. Many boosting algorithms rely on the notion of a margin to assign weight to samples. If a convex loss is utilized (as in AdaBoost or LogitBoost, for instance) then a sample with a higher margin will receive less (or equal) weight than a sample with a lower margin. This leads the boosting algorithm to focus weight on low-margin samples. In non-convex algorithms (e.g., BrownBoost), the margin still dictates the weighting of a sample, though the weighting is non-monotone with respect to the margin. == Generalization error bounds == One theoretical motivation behind margin classifiers is that their generalization error may be bound by the algorithm parameters and a margin term. An example of such a bound is for the AdaBoost algorithm. Let S {\displaystyle S} be a set of m {\displaystyle m} data points, sampled independently at random from a distribution D {\displaystyle D} . Assume the VC-dimension of the underlying base classifier is d {\displaystyle d} and m ≥ d ≥ 1 {\displaystyle m\geq d\geq 1} . Then, with probability 1 − δ {\displaystyle 1-\delta } , we have the bound: P D ( y ∑ j t α j h j ( x ) ∑ | α j | ≤ 0 ) ≤ P S ( y ∑ j t α j h j ( x ) ∑ | α j | ≤ θ ) + O ( 1 m d log 2 ⁡ ( m / d ) / θ 2 + log ⁡ ( 1 / δ ) ) {\displaystyle P_{D}\left({\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}\leq 0\right)\leq P_{S}\left({\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}\leq \theta \right)+O\left({\frac {1}{\sqrt {m}}}{\sqrt {d\log ^{2}(m/d)/\theta ^{2}+\log(1/\delta )}}\right)} for all θ > 0 {\displaystyle \theta >0} .

Consensus clustering

Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning. == Issues with existing clustering techniques == Current clustering techniques do not address all the requirements adequately. Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; Effectiveness of the method depends on the definition of "distance" (for distance-based clustering) If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces. The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways. == Justification for using consensus clustering == There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the SOM and k-means clustering circumvent some of the shortcomings of hierarchical clustering by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori. == The Monti consensus clustering algorithm == The Monti consensus clustering algorithm is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, K {\displaystyle K} . Given a dataset of N {\displaystyle N} total number of points to cluster, this algorithm works by resampling and clustering the data, for each K {\displaystyle K} and a N × N {\displaystyle N\times N} consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal K {\displaystyle K} . More specifically, given a set of points to cluster, D = { e 1 , e 2 , . . . e N } {\displaystyle D=\{e_{1},e_{2},...e_{N}\}} , let D 1 , D 2 , . . . , D H {\displaystyle D^{1},D^{2},...,D^{H}} be the list of H {\displaystyle H} perturbed (resampled) datasets of the original dataset D {\displaystyle D} , and let M h {\displaystyle M^{h}} denote the N × N {\displaystyle N\times N} connectivity matrix resulting from applying a clustering algorithm to the dataset D h {\displaystyle D^{h}} . The entries of M h {\displaystyle M^{h}} are defined as follows: M h ( i , j ) = { 1 , if points i and j belong to the same cluster 0 , otherwise {\displaystyle M^{h}(i,j)={\begin{cases}1,&{\text{if}}{\text{ points i and j belong to the same cluster}}\\0,&{\text{otherwise}}\end{cases}}} Let I h {\displaystyle I^{h}} be the N × N {\displaystyle N\times N} identicator matrix where the ( i , j ) {\displaystyle (i,j)} -th entry is equal to 1 if points i {\displaystyle i} and j {\displaystyle j} are in the same perturbed dataset D h {\displaystyle D^{h}} , and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix C {\displaystyle C} is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every K {\displaystyle K} . C ( i , j ) = ( ∑ h = 1 H M h ( i , j ) ∑ h = 1 H I h ( i , j ) ) {\displaystyle C(i,j)=\left({\frac {\textstyle \sum _{h=1}^{H}M^{h}(i,j)\displaystyle }{\sum _{h=1}^{H}I^{h}(i,j)}}\right)} That is the entry ( i , j ) {\displaystyle (i,j)} in the consensus matrix is the number of times points i {\displaystyle i} and j {\displaystyle j} were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range [ 0 , 1 ] {\displaystyle [0,1]} . A consensus matrix is calculated for each K {\displaystyle K} to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal K {\displaystyle K} . One way of quantifying the stability of the K {\displaystyle K} th consensus matrix is examining its CDF curve (see below). == Over-interpretation potential of the Monti consensus clustering algorithm == Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu et al. It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study. If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research, and has been addressed by methods such as SigClust and the GAP-statistic. However, these methods rely on certain assumptions for the null model that may not always be appropriate. Şenbabaoğlu et al demonstrated the original delta K metric to decide K {\displaystyle K} in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the K {\displaystyle K} value having the lowest PAC. == Related work == Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm

Cheekd

Cheekd is a dating app based in New York City. It was founded in 2010 by Lori Cheek. == History == The service debuted with the name "Cheek'd". Founder Lori Cheek appeared on the television program, Shark Tank in February 2014, but did not succeed in obtaining funding from any of the five judges. She said Cheek’d only had 1000 subscribers at that time. === Business card model === Cheek'd offered two plans, paid and free. For $25, subscribers got a set of 50 business cards that could be given out once someone caught their eye. Each card had a phrase, an online code, and a URL to the subscriber's account. Recipients could look up the giver's profile. In addition to purchasing cards, there was a $9.95 monthly membership fee. === Smartphone app === In 2015, the service's name changed from "Cheek'd" to "Cheekd". The new app used Bluetooth technology to alert users whenever a compatible user was within a 30-foot radius, instead of using cards. == Patent lawsuit == The original business card-based model for Cheekd had been claimed as a patented process by Lori Cheek, as U.S. patent 8,543,465. In September 2017, a complaint was filed, alleging that the idea was not original to Lori Cheek. Cheek responded, stating that the complaint was baseless, and a complete fabrication. The lawsuit Pirri v. Cheek was dismissed in a pre-trial conference in New York's Federal Court on April 5, 2018.

Tanagra (machine learning)

Tanagra is a free suite of machine learning software for research and academic purposes developed by Ricco Rakotomalala at the Lumière University Lyon 2, France. Tanagra supports several standard data mining tasks such as: Visualization, Descriptive statistics, Instance selection, feature selection, feature construction, regression, factor analysis, clustering, classification and association rule learning. Tanagra is an academic project. It is widely used in French-speaking universities. Tanagra is frequently used in real studies and in software comparison papers. == History == The development of Tanagra was started in June 2003. The first version was distributed in December 2003. Tanagra is the successor of Sipina, another free data mining tool which is intended only for supervised learning tasks (classification), especially the interactive and visual construction of decision trees. Sipina is still available online and is maintained. Tanagra is an "open source project" as every researcher can access the source code and add their own algorithms, as long as they agree and conform to the software distribution license. The main purpose of the Tanagra project is to give researchers and students a user-friendly data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing the analyzation of either real or synthetic data. From 2006, Ricco Rakotomalala made an important documentation effort. A large number of tutorials are published on a dedicated website. They describe the statistical and machine learning methods and their implementation with Tanagra on real case studies. The use of other free data mining tools on the same problems is also widely described. The comparison of the tools enables readers to understand the possible differences in the presentation of results. == Description == Tanagra works similarly to current data mining tools. The user can design visually a data mining process in a diagram. Each node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike the majority of tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are represented in a tree diagram. The results are displayed in an HTML format. This makes it is easy to export the outputs in order to visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet. Tanagra makes a good compromise between statistical approaches (e.g. parametric and nonparametric statistical tests), multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).