AI For Business For Dummies

AI For Business For Dummies — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Tandem (app)

    Tandem (app)

    Tandem is a mobile language exchange and language learning app. == History == Tandem was founded in Hannover, Germany in 2014 by Arnd Aschentrup, Tobias Dickmeis, and Matthias Kleimann. Prior to founding Tandem, the trio had launched Vive, a members-only mobile video chat platform. Tandem has been criticised for not accepting members into the community immediately, as opposed to competitors including HelloTalk, Speaky or Cafehub. In some countries, there is a waiting list and applicants can wait up to seven days for their application to be processed by human moderators. In 2015, Tandem completed its first funding round (seed funding) of €600,000. Participating investors included business angels such as Atlantic Labs (Christophe Maire), Hannover Beteiligungsfonds, Marcus Englert (Chairman of the Supervisory Board of Rocket Internet SE ), Catagonia, Ludwig zu Salm, Florian Langenscheidt, Heiko Hubertz, Martin Sinner, and Zehden Enterprises. In 2016, the company received a further €2 million from new investors Rubylight and Faber Ventures, as well as from existing investors Hannover Beteiligungsfonds, Atlantic Labs, and Zehden Enterprises. Since 2018, the premium membership Tandem Pro has been available, which offers members unlimited access to all language learning features of the app as well as the removal of advertising for a monthly fee.

    Read more →
  • FastICA

    FastICA

    FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration. == Algorithm == === Prewhitening the data === Let the X := ( x i j ) ∈ R N × M {\displaystyle \mathbf {X} :=(x_{ij})\in \mathbb {R} ^{N\times M}} denote the input data matrix, M {\displaystyle M} the number of columns corresponding with the number of samples of mixed signals and N {\displaystyle N} the number of rows corresponding with the number of independent source signals. The input data matrix X {\displaystyle \mathbf {X} } must be prewhitened, or centered and whitened, before applying the FastICA algorithm to it. Centering the data entails demeaning each component of the input data X {\displaystyle \mathbf {X} } , that is, for each i = 1 , … , N {\displaystyle i=1,\ldots ,N} and j = 1 , … , M {\displaystyle j=1,\ldots ,M} . After centering, each row of X {\displaystyle \mathbf {X} } has an expected value of 0 {\displaystyle 0} . Whitening the data requires a linear transformation L : R N × M → R N × M {\displaystyle \mathbf {L} :\mathbb {R} ^{N\times M}\to \mathbb {R} ^{N\times M}} of the centered data so that the components of L ( X ) {\displaystyle \mathbf {L} (\mathbf {X} )} are uncorrelated and have variance one. More precisely, if X {\displaystyle \mathbf {X} } is a centered data matrix, the covariance of L x := L ( X ) {\displaystyle \mathbf {L} _{\mathbf {x} }:=\mathbf {L} (\mathbf {X} )} is the ( N × N ) {\displaystyle (N\times N)} -dimensional identity matrix, that is, A common method for whitening is by performing an eigenvalue decomposition on the covariance matrix of the centered data X {\displaystyle \mathbf {X} } , E { X X T } = E D E T {\displaystyle E\left\{\mathbf {X} \mathbf {X} ^{T}\right\}=\mathbf {E} \mathbf {D} \mathbf {E} ^{T}} , where E {\displaystyle \mathbf {E} } is the matrix of eigenvectors and D {\displaystyle \mathbf {D} } is the diagonal matrix of eigenvalues. The whitened data matrix is defined thus by === Single component extraction === The iterative algorithm finds the direction for the weight vector w ∈ R N {\displaystyle \mathbf {w} \in \mathbb {R} ^{N}} that maximizes a measure of non-Gaussianity of the projection w T X {\displaystyle \mathbf {w} ^{T}\mathbf {X} } , with X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} denoting a prewhitened data matrix as described above. Note that w {\displaystyle \mathbf {w} } is a column vector. To measure non-Gaussianity, FastICA relies on a nonquadratic nonlinear function f ( u ) {\displaystyle f(u)} , its first derivative g ( u ) {\displaystyle g(u)} , and its second derivative g ′ ( u ) {\displaystyle g^{\prime }(u)} . Hyvärinen states that the functions are useful for general purposes, while may be highly robust. The steps for extracting the weight vector w {\displaystyle \mathbf {w} } for single component in FastICA are the following: Randomize the initial weight vector w {\displaystyle \mathbf {w} } Let w + ← E { X g ( w T X ) T } − E { g ′ ( w T X ) } w {\displaystyle \mathbf {w} ^{+}\leftarrow E\left\{\mathbf {X} g(\mathbf {w} ^{T}\mathbf {X} )^{T}\right\}-E\left\{g'(\mathbf {w} ^{T}\mathbf {X} )\right\}\mathbf {w} } , where E { . . . } {\displaystyle E\left\{...\right\}} means averaging over all column-vectors of matrix X {\displaystyle \mathbf {X} } Let w ← w + / ‖ w + ‖ {\displaystyle \mathbf {w} \leftarrow \mathbf {w} ^{+}/\|\mathbf {w} ^{+}\|} If not converged, go back to 2 === Multiple component extraction === The single unit iterative algorithm estimates only one weight vector which extracts a single component. Estimating additional components that are mutually "independent" requires repeating the algorithm to obtain linearly independent projection vectors - note that the notion of independence here refers to maximizing non-Gaussianity in the estimated components. Hyvärinen provides several ways of extracting multiple components with the simplest being the following. Here, 1 M {\displaystyle \mathbf {1_{M}} } is a column vector of 1's of dimension M {\displaystyle M} . Algorithm FastICA Input: C {\displaystyle C} Number of desired components Input: X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} Prewhitened matrix, where each column represents an N {\displaystyle N} -dimensional sample, where C <= N {\displaystyle C<=N} Output: W ∈ R N × C {\displaystyle \mathbf {W} \in \mathbb {R} ^{N\times C}} Un-mixing matrix where each column projects X {\displaystyle \mathbf {X} } onto independent component. Output: S ∈ R C × M {\displaystyle \mathbf {S} \in \mathbb {R} ^{C\times M}} Independent components matrix, with M {\displaystyle M} columns representing a sample with C {\displaystyle C} dimensions. for p in 1 to C: w p ← {\displaystyle \mathbf {w_{p}} \leftarrow } Random vector of length N while w p {\displaystyle \mathbf {w_{p}} } changes w p ← 1 M X g ( w p T X ) T − 1 M g ′ ( w p T X ) 1 M w p {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {1}{M}}\mathbf {X} g(\mathbf {w_{p}} ^{T}\mathbf {X} )^{T}-{\frac {1}{M}}g'(\mathbf {w_{p}} ^{T}\mathbf {X} )\mathbf {1_{M}} \mathbf {w_{p}} } w p ← w p − ∑ j = 1 p − 1 ( w p T w j ) w j {\displaystyle \mathbf {w_{p}} \leftarrow \mathbf {w_{p}} -\sum _{j=1}^{p-1}(\mathbf {w_{p}} ^{T}\mathbf {w_{j}} )\mathbf {w_{j}} } w p ← w p ‖ w p ‖ {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {\mathbf {w_{p}} }{\|\mathbf {w_{p}} \|}}} output W ← [ w 1 , … , w C ] {\displaystyle \mathbf {W} \leftarrow {\begin{bmatrix}\mathbf {w_{1}} ,\dots ,\mathbf {w_{C}} \end{bmatrix}}} output S ← W T X {\displaystyle \mathbf {S} \leftarrow \mathbf {W^{T}} \mathbf {X} }

    Read more →
  • Algorithmic learning theory

    Algorithmic learning theory

    Algorithmic learning theory is a mathematical framework for analyzing machine learning problems and algorithms. Synonyms include formal learning theory and algorithmic inductive inference. Algorithmic learning theory is different from statistical learning theory in that it does not make use of statistical assumptions and analysis. Both algorithmic and statistical learning theory are concerned with machine learning and can thus be viewed as branches of computational learning theory. == Distinguishing characteristics == Unlike statistical learning theory and most statistical theory in general, algorithmic learning theory does not assume that data are random samples, that is, that data points are independent of each other. This makes the theory suitable for domains where observations are (relatively) noise-free but not random, such as language learning and automated scientific discovery. The fundamental concept of algorithmic learning theory is learning in the limit: as the number of data points increases, a learning algorithm should converge to a correct hypothesis on every possible data sequence consistent with the problem space. This is a non-probabilistic version of statistical consistency, which also requires convergence to a correct model in the limit, but allows a learner to fail on data sequences with probability measure 0 . Algorithmic learning theory investigates the learning power of Turing machines. Other frameworks consider a much more restricted class of learning algorithms than Turing machines, for example, learners that compute hypotheses more quickly, for instance in polynomial time. An example of such a framework is probably approximately correct learning . == Learning in the limit == The concept was introduced in E. Mark Gold's seminal paper "Language identification in the limit". The objective of language identification is for a machine running one program to be capable of developing another program by which any given sentence can be tested to determine whether it is "grammatical" or "ungrammatical". The language being learned need not be English or any other natural language - in fact the definition of "grammatical" can be absolutely anything known to the tester. In Gold's learning model, the tester gives the learner an example sentence at each step, and the learner responds with a hypothesis, which is a suggested program to determine grammatical correctness. It is required of the tester that every possible sentence (grammatical or not) appears in the list eventually, but no particular order is required. It is required of the learner that at each step the hypothesis must be correct for all the sentences so far. A particular learner is said to be able to "learn a language in the limit" if there is a certain number of steps beyond which its hypothesis no longer changes. At this point it has indeed learned the language, because every possible sentence appears somewhere in the sequence of inputs (past or future), and the hypothesis is correct for all inputs (past or future), so the hypothesis is correct for every sentence. The learner is not required to be able to tell when it has reached a correct hypothesis, all that is required is that it be true. Gold showed that any language which is defined by a Turing machine program can be learned in the limit by another Turing-complete machine using enumeration. This is done by the learner testing all possible Turing machine programs in turn until one is found which is correct so far - this forms the hypothesis for the current step. Eventually, the correct program will be reached, after which the hypothesis will never change again (but note that the learner does not know that it won't need to change). Gold also showed that if the learner is given only positive examples (that is, only grammatical sentences appear in the input, not ungrammatical sentences), then the language can only be guaranteed to be learned in the limit if there are only a finite number of possible sentences in the language (this is possible if, for example, sentences are known to be of limited length). Language identification in the limit is a highly abstract model. It does not allow for limits of runtime or computer memory which can occur in practice, and the enumeration method may fail if there are errors in the input. However the framework is very powerful, because if these strict conditions are maintained, it allows the learning of any program known to be computable. This is because a Turing machine program can be written to mimic any program in any conventional programming language. See Church-Turing thesis. == Other identification criteria == Learning theorists have investigated other learning criteria, such as the following. Efficiency: minimizing the number of data points required before convergence to a correct hypothesis. Mind Changes: minimizing the number of hypothesis changes that occur before convergence. Mind change bounds are closely related to mistake bounds that are studied in statistical learning theory. Kevin Kelly has suggested that minimizing mind changes is closely related to choosing maximally simple hypotheses in the sense of Occam’s Razor. == Annual conference == Since 1990, there is an International Conference on Algorithmic Learning Theory (ALT), called Workshop in its first years (1990–1997). Between 1992 and 2016, proceedings were published in the LNCS series. Starting from 2017, they are published by the Proceedings of Machine Learning Research. The 34th conference will be held in Singapore in Feb 2023. The topics of the conference cover all of theoretical machine learning, including statistical and computational learning theory, online learning, active learning, reinforcement learning, and deep learning.

    Read more →
  • World Programming System

    World Programming System

    The World Programming System, also known as WPS Analytics or WPS, is a software product developed by a company called World Programming (acquired by Altair Engineering). WPS Analytics supports users of mixed ability to access and process data and to perform data science tasks. It has interactive visual programming tools using data workflows, and it has coding tools supporting the use of the SAS language mixed with Python, R and SQL. == About == WPS can use programs written in the language of SAS without the need for translating them into any other language. In this regard WPS is compatible with the SAS system. WPS has a built-in language interpreter able to process the language of SAS and produce similar results. WPS is available to run on z/OS, Windows, macOS, Linux (x86, Armv8 64-bit, IBM Power LE, IBM Z), and AIX. On all supported platforms, programs written in the language of SAS can be executed from a WPS command line interface, often referred to as running in batch mode. WPS can also be used from a graphical user interface known as the WPS Workbench for managing, editing and running programs written in the language of SAS. The WPS Workbench user interface is based on Eclipse. WPS version 4 (released in March 2018) introduced a drag-and-drop workflow canvas providing interactive blocks for data retrieval, blending and preparation, data discovery and profiling, predictive modelling powered by machine learning algorithms, model performance validation and scorecards. WPS version 3 (released in February 2012) provided a new client/server architecture that allows the WPS Workbench GUI to execute SAS programs on remote server installations of WPS in a network or cloud. The resulting output, data sets, logs, etc., can then all be viewed and manipulated from inside the Workbench as if the workloads had been executed locally. SAS programs do not require any special language statements to use this feature. == Summary of main features == Runs on Windows, macOS, z/OS, Linux (x86, Armv8 64-bit, IBM Power LE, IBM Z), and AIX An integrated development environment based on Eclipse for Linux, macOS and Windows. Support for language of SAS elements. Support for the language of SAS Macros. Matrix Programming support using PROC IML. Support for generating band plots, bar charts, box plots, bubble plots, contour plots, dendrogram plots, ellipse plots, fringe plots, heat maps, high-low plots, histograms, loess plots, needle plots, pie charts, penalised b-spline, radar charts, reference lines, scatter plots, series plots, step plots, regression plots and vector plots. Support for statistical procedures ACECLUS, ASSOCRULES, ANOVA, BIN, BOXPLOT, CANCORR, CANDISC, CLUSTER, CORRESP, DISCRIM, DISTANCE, FACTOR, FASTCLUS, FREQ, GAM, GANNO, GENMOD, GLIMMIX, GLM, GLMMOD, GLMSELECT, ICLIFETEST, KDE, LIFEREG, LIFETEST, LOESS, LOGISTIC, MDS, MEANS, MI, MIANALYSE, MIXED, MODECLUS, NESTED, NLIN, NPAR1WAY, PHREG, PLAN, PLS, POWER, PRINCOMP, PROBIT, QUANTREG, RBF, REG, ROBUSTREG, RSREG, SCORE, SEGMENT, SIMNORMAL, STANDARD, STDSIZE, STDRATE, STEPDISC, SUMMARY, SURVEYMEANS, SURVEYSELECT, TPSPLINE, TRANSREG, TREE, TTEST, UNIVARIATE, VARCLUS, VARCOMP Support for time series procedures ARIMA, AUTOREG, ESM, EXPAND, FORECAST, LOAN, SEVERITY, SPECTRA, TIMESERIES, X12 Support for machine learning procedures DECISIONFOREST, DECISIONTREE, GMM, MLP, OPTIMALBIN, SEGMENT, SVM Support for ODS. Reads and writes SAS datasets (compressed or uncompressed). Access: Actian Matrix (previously known as ParAccel), DASD, DB2, Excel, Greenplum, Hadoop, Informix, Kognitio Archived 2012-08-24 at the Wayback Machine, MariaDB, MySQL, Netezza, ODBC, OLEDB, Oracle, PostgreSQL, SAND, Snowflake, SPSS/PSPP, SQL Server, Sybase, Sybase IQ, Teradata, VSAM, Vertica and XML. Support for SAS Tape Format. Direct output of reports to CSV, PDF and HTML. Support to connect WPS systems programmatically, remote submit parts of a program to execute on connected remote servers, upload and download data between the connected systems. Support for Hadoop Support for R Support for Python == Industry recognition == Gartner recognized World Programming in their Cool Vendors in Data Science, 2014 Report. == Lawsuit == In 2010 World Programming defended its use of the language of SAS in the High Court of England and Wales in SAS Institute Inc. v World Programming Ltd. The software was the subject of a lawsuit by SAS Institute. The EU Court of Justice ruled in favor of World Programming, stating that the copyright protection does not extend to the software functionality, the programming language used and the format of the data files used by the program. It stated that there is no copyright infringement when a company which does not have access to the source code of a program studies, observes and tests that program to create another program with the same functionality.

    Read more →
  • Non-photorealistic rendering

    Non-photorealistic rendering

    Non-photorealistic rendering (NPR) is an area of computer graphics that focuses on enabling a wide variety of expressive styles for digital art, in contrast to traditional computer graphics, which focuses on photorealism. NPR is inspired by other artistic modes such as painting, drawing, technical illustration, and animated cartoons. NPR has appeared in movies and video games in the form of cel-shaded animation (also known as "toon" shading) as well as in scientific visualization, architectural illustration and experimental animation. == History and criticism of the term == The term non-photorealistic rendering is believed to have been coined by the SIGGRAPH 1990 papers committee, who held a session entitled "Non Photo Realistic Rendering". The term has received some criticism: The term "photorealism" has different meanings for graphics researchers (see "photorealistic rendering") and artists. For artists—who are the target consumers of NPR techniques—it refers to a school of painting that focuses on reproducing the effect of a camera lens, with all the distortion and hyper-reflections that it creates. For graphics researchers, however, it refers to an image that is visually indistinguishable from reality. In fact, graphics researchers lump the kinds of visual distortions that are used by photorealist painters into "non-photorealism". Describing something by what it is not is problematic. Equivalent (made-up) comparisons might be "non-elephant biology" or "non-geometric mathematics". NPR researchers have stated that they expect the term will disappear eventually and be replaced by the now more general term "computer graphics", with "photorealistic graphics" being the term used to describe "traditional" computer graphics. Many techniques that are used to create 'non-photorealistic' images are not rendering techniques. They are modelling techniques, or post-processing techniques. While the latter are coming to be known as 'image-based rendering', sketch-based modelling techniques, cannot technically be included under this heading, which is very inconvenient for conference organisers. The first conference on non-photorealistic animation and rendering included a discussion of possible alternative names. Among those suggested were "expressive graphics", "artistic rendering", "non-realistic graphics", "art-based rendering", and "psychographics". All of these terms have been used in various research papers on the topic, but the "non-photorealistic" term seems to have nonetheless taken hold. The first technical meeting dedicated to NPR was the ACM-sponsored Symposium on Non-Photorealistic Rendering and Animation(NPAR) in 2000. NPAR is traditionally co-located with the Annecy Animated Film Festival, running on even numbered years. From 2007 onward, NPAR began to also run on odd-numbered years, co-located with ACM SIGGRAPH. == 3D == Three-dimensional NPR is the style that is most commonly seen in video games and movies. The output from this technique is almost always a 3D model that has been modified from the original input model to portray a new artistic style. In many cases, the geometry of the model is identical to the original geometry, and only the material applied to the surface is modified. With increased availability of programmable GPU's, shaders have allowed NPR effects to be applied to the rasterised image that is to be displayed to the screen. The majority of NPR techniques applied to 3D geometry are intended to make the scene appear two-dimensional. NPR techniques for 3D images include cel shading and Gooch shading. Many methods can be used to draw stylized outlines and strokes from 3D models, including occluding contours and Suggestive contours. For enhanced legibility, the most useful technical illustrations for technical communication are not necessarily photorealistic. Non-photorealistic renderings, such as exploded view diagrams, greatly assist in showing placement of parts in a complex system. Cartoon rendering, also called cel shading or toon shading, is a non-photorealistic rendering technique used to give 3D computer graphics a flat, cartoon-like appearance. Its defining feature is the use of distinct shading colors rather than smooth gradients, producing a look reminiscent of comic books or animated films. This technique is often used to blend 3D objects and environments with 2D hand-animated elements while maintaining a consistent look. Treasure Planet movie by Disney is an example of blending these techniques. == 2D == The input to a two dimensional NPR system is typically an image or video. The output is a typically an artistic rendering of that input imagery (for example in a watercolor, painterly or sketched style) although some 2D NPR serves non-artistic purposes e.g. data visualization. The artistic rendering of images and video (often referred to as image stylization) traditionally focused upon heuristic algorithms that seek to simulate the placement of brush strokes on a digital canvas. Arguably, the earliest example of 2D NPR is Paul Haeberli's 'Paint by Numbers' at SIGGRAPH 1990. This (and similar interactive techniques) provide the user with a canvas that they can "paint" on using the cursor — as the user paints, a stylized version of the image is revealed on the canvas. This is especially useful for people who want to simulate different sizes of brush strokes according to different areas of the image. Subsequently, basic image processing operations using gradient operators or statistical moments were used to automate this process and minimize user interaction in the late nineties (although artistic control remains with the user via setting parameters of the algorithms). This automation enabled practical application of 2D NPR to video, for the first time in the living paintings of the movie What Dreams May Come (1998). More sophisticated image abstractions techniques were developed in the early 2000s harnessing computer vision operators e.g. image salience, or segmentation operators to drive stroke placement. Around this time, machine learning began to influence image stylization algorithms notably image analogy that could learn to mimic the style of an existing artwork. The advent of deep learning has re-kindled activity in image stylization, notably with neural style transfer (NST) algorithms that can mimic a wide gamut of artistic styles from single visual examples. These algorithms underpin mobile apps capable of the same e.g. Prisma In addition to the above stylization methods, a related class of techniques in 2D NPR address the simulation of artistic media. These methods include simulating the diffusion of ink through different kinds of paper, and also of pigments through water for simulation of watercolor. == Artistic rendering == Artistic rendering is the application of visual art styles to rendering. For photorealistic rendering styles, the emphasis is on accurate reproduction of light-and-shadow and the surface properties of the depicted objects, composition, or other more generic qualities. When the emphasis is on unique interpretive rendering styles, visual information is interpreted by the artist and displayed accordingly using the chosen art medium and level of abstraction in abstract art. In computer graphics, interpretive rendering styles are known as non-photorealistic rendering styles, but may be used to simplify technical illustrations. Rendering styles that combine photorealism with non-photorealism are known as hyperrealistic rendering styles. == Notable films and games == This section lists some seminal uses of NPR techniques in films, games and software. See cel-shaded animation for a list of uses of toon-shading in games and movies.

    Read more →
  • Variable kernel density estimation

    Variable kernel density estimation

    In statistics, adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional. == Rationale == Given a set of samples, { x → i } {\displaystyle \lbrace {\vec {x}}_{i}\rbrace } , we wish to estimate the density, P ( x → ) {\displaystyle P({\vec {x}})} , at a test point, x → {\displaystyle {\vec {x}}} : P ( x → ) ≈ W n h D {\displaystyle P({\vec {x}})\approx {\frac {W}{nh^{D}}}} W = ∑ i = 1 n w i {\displaystyle W=\sum _{i=1}^{n}w_{i}} w i = K ( x → − x → i h ) {\displaystyle w_{i}=K\left({\frac {{\vec {x}}-{\vec {x}}_{i}}{h}}\right)} where n is the number of samples, K is the "kernel", h is its width and D is the number of dimensions in x → {\displaystyle {\vec {x}}} . The kernel can be thought of as a simple, linear filter. Using a fixed filter width may mean that in regions of low density, all samples will fall in the tails of the filter with very low weighting, while regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fix this problem, we vary the width of the kernel in different regions of the sample space. There are two methods of doing this: balloon and pointwise estimation. In a balloon estimator, the kernel width is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample. For multivariate estimators, the parameter, h, can be generalized to vary not just the size, but also the shape of the kernel. This more complicated approach will not be covered here. == Balloon estimators == A common method of varying the kernel width is to make it inversely proportional to the density at the test point: h = k [ n P ( x → ) ] 1 / D {\displaystyle h={\frac {k}{\left[nP({\vec {x}})\right]^{1/D}}}} where k is a constant. If we back-substitute the estimated PDF, and assuming a Gaussian kernel function, we can show that W is a constant: W = k D ( 2 π ) D / 2 {\displaystyle W=k^{D}(2\pi )^{D/2}} A similar derivation holds for any kernel whose normalising function is of the order hD, although with a different constant factor in place of the (2 π)D/2 term. This produces a generalization of the k-nearest neighbour algorithm. That is, a uniform kernel function will return the KNN technique. There are two components to the error: a variance term and a bias term. The variance term is given as: e 1 = P ∫ K 2 n h D {\displaystyle e_{1}={\frac {P\int K^{2}}{nh^{D}}}} . The bias term is found by evaluating the approximated function in the limit as the kernel width becomes much larger than the sample spacing. By using a Taylor expansion for the real function, the bias term drops out: e 2 = h 2 n ∇ 2 P {\displaystyle e_{2}={\frac {h^{2}}{n}}\nabla ^{2}P} An optimal kernel width that minimizes the error of each estimate can thus be derived. == Use for statistical classification == The method is particularly effective when applied to statistical classification. There are two ways we can proceed: the first is to compute the PDFs of each class separately, using different bandwidth parameters, and then compare them as in Taylor. Alternatively, we can divide up the sum based on the class of each sample: P ( j , x → ) ≈ 1 n ∑ i = 1 , c i = j n w i {\displaystyle P(j,{\vec {x}})\approx {\frac {1}{n}}\sum _{i=1,c_{i}=j}^{n}w_{i}} where ci is the class of the ith sample. The class of the test point may be estimated through maximum likelihood.

    Read more →
  • XGBoost

    XGBoost

    XGBoost (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting (GBM, GBRT, GBDT) Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask. XGBoost gained much popularity and attention in the mid-2010s as the algorithm of choice for many winning teams of machine learning competitions. == History == XGBoost initially started as a research project by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC) group at the University of Washington. Initially, it began as a terminal application which could be configured using a libsvm configuration file. It became well known in the ML competition circles after its use in the winning solution of the Higgs Machine Learning Challenge. Soon after, the Python and R packages were built, and XGBoost now has package implementations for Java, Scala, Julia, Perl, and other languages. This brought the library to more developers and contributed to its popularity among the Kaggle community, where it has been used for a large number of competitions. It was soon integrated with a number of other packages making it easier to use in their respective communities. It has now been integrated with scikit-learn for Python users and with the caret package for R users. It can also be integrated into Data Flow frameworks like Apache Spark, Apache Hadoop, and Apache Flink using the abstracted Rabit and XGBoost4J. XGBoost is also available on OpenCL for FPGAs. An efficient, scalable implementation of XGBoost has been published by Tianqi Chen and Carlos Guestrin. While the XGBoost model often achieves higher accuracy than a single decision tree, it sacrifices the intrinsic interpretability of decision trees. For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder. == Features == Salient features of XGBoost which make it different from other gradient boosting algorithms include: Clever penalization of trees A proportional shrinking of leaf nodes Newton Boosting Extra randomization parameter Implementation on single, distributed systems and out-of-core computation Automatic feature selection Theoretically justified weighted quantile sketching for efficient computation Parallel tree structure boosting with sparsity Efficient cacheable block structure for decision tree training == The algorithm == XGBoost works as Newton–Raphson in function space unlike gradient boosting that works as gradient descent in function space, a second order Taylor approximation is used in the loss function to make the connection to Newton–Raphson method. A generic unregularized XGBoost algorithm is: == Parameters == XGBoost has parameters which can be specified to affect how it functions and performs. Some parameters include: Learning rate (also known as the "step size" or “"shrinkage"), it is a number between 0 and 1 (default is 0.3), which determines the rate the algorithm learns from each iteration. n_estimators, sets the number of trees to be built in the ensemble, where more trees generally increases the complexity of the model, but can lead to overfitting with too many trees. Gamma (also known as Lagrange multiplier or the minimum loss reduction parameter), controls the minimum amount of loss reduction required to make a further split on a leaf node of the tree. The default in XGBoost is 0 . max_depth, represents how deeply each tree in the boosting process can grow during training, where the default is 6. == Awards == John Chambers Award (2016) High Energy Physics meets Machine Learning award (HEP meets ML) (2016)

    Read more →
  • Parity benchmark

    Parity benchmark

    Parity problems are widely used as benchmark problems in genetic programming but inherited from the artificial neural network community. Parity is calculated by summing all the binary inputs and reporting if the sum is odd or even. This is considered difficult because: a very simple artificial neural network cannot solve it, and all inputs need to be considered and a change to any one of them changes the answer.

    Read more →
  • Tea (app)

    Tea (app)

    Tea, officially Tea Dating Advice, is a dating surveillance mobile phone application that allows women to post personal data about men they are interested in or are currently dating. Founded by Sean Cook, the app rose to prominence in July 2025 after it was the subject of three major data leaks in July and August 2025. It was removed from Apple's App Store in October 2025, but remains available on the Google Play Store. == History == The app enables its users to upload, view, and comment on photos of men, check men's public records, and perform image searches. It also provides the ability to rate and review men, as well as a group chat function. The app uses artificial intelligence to verify that the user is a woman through facial analysis and other personal information to preserve the app as a women-only space. Users are required to submit their photo and an ID to access the app. The company that created the app was founded by businessman and tech capitalist Sean Cook, who stated in July 2025 that he was inspired to create the app because of his mother's experiences from online dating. According to the company, users remain anonymous, and the requirement to upload an ID was removed in 2023. An August 2025 investigation by 404 Media suggested that much of the information given by Cook on the historical background of the company was inaccurate. In July 2025, private messages, other personally identifying information, and approximately 72,000 images were leaked via 4chan. A further 1.1 million private messages were subsequently leaked using a separate security vulnerability; these included intimate conversations about controversial topics such as adultery and other forms of infidelity to their partners, discussions of abortion, phone numbers, meeting locations, and other confidential communications. The app's publishers subsequently revoked the ability to private message users in the app. Shortly after, the app was hidden from search on Android and an interactive, unverified map was also created of those in the files. By 7 August 2025, ten class action lawsuits had been filed. A further leak was reported later that month. Proponents have praised the app as an aid for women's safety by helping them check men for adultery, catfishing, criminal convictions and other "red flag" behaviors. Critics have described the app as a doxing tool and a violation of privacy, an opportunity for defamation against innocent individuals, and a witch hunt. Cook has stated that the company's legal team receives about three legal threats per day. Another mobile app, called TeaOnHer, was created in response of the app’s popularity. It was described as the male version of the Tea app. The app also reported a data breach in August 2025. In October 2025, Apple removed the app from their app store, telling journalists that the removal was due to a failure to meet company terms regarding content moderation and user privacy. Apple also mentioned an excessive amount of complaints, including allegations that the personal information of minors was being shared. The app remains on the Google Play Store.

    Read more →
  • Dynamic time warping

    Dynamic time warping

    In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. For instance, similarities in walking could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data — indeed, any data that can be turned into a one-dimensional sequence can be analyzed with DTW. A well-known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. It can also be used in partial shape matching applications. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules: Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa The first index from the first sequence must be matched with the first index from the other sequence (but it does not have to be its only match) The last index from the first sequence must be matched with the last index from the other sequence (but it does not have to be its only match) The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing, and vice versa, i.e. if j > i {\displaystyle j>i} are indices from the first sequence, then there must not be two indices l > k {\displaystyle l>k} in the other sequence, such that index i {\displaystyle i} is matched with index l {\displaystyle l} and index j {\displaystyle j} is matched with index k {\displaystyle k} , and vice versa We can plot each match between the sequences 1 : M {\displaystyle 1:M} and 1 : N {\displaystyle 1:N} as a path in a M × N {\displaystyle M\times N} matrix from ( 1 , 1 ) {\displaystyle (1,1)} to ( M , N ) {\displaystyle (M,N)} , such that each step is one of ( 0 , 1 ) , ( 1 , 0 ) , ( 1 , 1 ) {\displaystyle (0,1),(1,0),(1,1)} . In this formulation, we see that the number of possible matches is the Delannoy number. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values. The sequences are "warped" non-linearly in the time dimension to determine a measure of their similarity independent of certain non-linear variations in the time dimension. This sequence alignment method is often used in time series classification. Although DTW measures a distance-like quantity between two given sequences, it doesn't guarantee the triangle inequality to hold. In addition to a similarity measure between the two sequences (a so called "warping path" is produced), by warping according to this path the two signals may be aligned in time. The signal with an original set of points X(original), Y(original) is transformed to X(warped), Y(warped). This finds applications in genetic sequence and audio synchronisation. In a related technique sequences of varying speed may be averaged using this technique see the average sequence section. This is conceptually very similar to the Needleman–Wunsch algorithm. == Implementation == This example illustrates the implementation of the dynamic time warping algorithm when the two sequences s and t are strings of discrete symbols. For two symbols x and y, d ( x , y ) {\displaystyle d(x,y)} is a distance between the symbols, e.g., d ( x , y ) = | x − y | {\displaystyle d(x,y)=|x-y|} . int DTWDistance(s: array [1..n], t: array [1..m]) { DTW := array [0..n, 0..m] for i := 0 to n for j := 0 to m DTW[i, j] := infinity DTW[0, 0] := 0 for i := 1 to n for j := 1 to m cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m] } where DTW[i, j] is the distance between s[1:i] and t[1:j] with the best alignment. We sometimes want to add a locality constraint. That is, we require that if s[i] is matched with t[j], then | i − j | {\displaystyle |i-j|} is no larger than w, a window parameter. We can easily modify the above algorithm to add a locality constraint (differences marked). However, the above given modification works only if | n − m | {\displaystyle |n-m|} is no larger than w, i.e. the end point is within the window length from diagonal. In order to make the algorithm work, the window parameter w must be adapted so that | n − m | ≤ w {\displaystyle |n-m|\leq w} (see the line marked with () in the code). int DTWDistance(s: array [1..n], t: array [1..m], w: int) { DTW := array [0..n, 0..m] w := max(w, abs(n-m)) // adapt window size () for i := 0 to n for j:= 0 to m DTW[i, j] := infinity DTW[0, 0] := 0 for i := 1 to n for j := max(1, i-w) to min(m, i+w) DTW[i, j] := 0 for i := 1 to n for j := max(1, i-w) to min(m, i+w) cost := d(s[i], t[j]) DTW[i, j] := cost + minimum(DTW[i-1, j ], // insertion DTW[i , j-1], // deletion DTW[i-1, j-1]) // match return DTW[n, m] } == Warping properties == The DTW algorithm produces a discrete matching between existing elements of one series to another. In other words, it does not allow time-scaling of segments within the sequence. Other methods allow continuous warping. For example, Correlation Optimized Warping (COW) divides the sequence into uniform segments that are scaled in time using linear interpolation, to produce the best matching warping. The segment scaling causes potential creation of new elements, by time-scaling segments either down or up, and thus produces a more sensitive warping than DTW's discrete matching of raw elements. == Complexity == The time complexity of the DTW algorithm is O ( N M ) {\displaystyle O(NM)} , where N {\displaystyle N} and M {\displaystyle M} are the lengths of the two input sequences. The 50 years old quadratic time bound was broken in 2016: an algorithm due to Gold and Sharir enables computing DTW in O ( N 2 / log ⁡ log ⁡ N ) {\displaystyle O({N^{2}}/\log \log N)} time and space for two input sequences of length N {\displaystyle N} . This algorithm can also be adapted to sequences of different lengths. Despite this improvement, it was shown that a strongly subquadratic running time of the form O ( N 2 − ϵ ) {\displaystyle O(N^{2-\epsilon })} for some ϵ > 0 {\displaystyle \epsilon >0} cannot exist unless the Strong exponential time hypothesis fails. While the dynamic programming algorithm for DTW requires O ( N M ) {\displaystyle O(NM)} space in a naive implementation, the space consumption can be reduced to O ( min ( N , M ) ) {\displaystyle O(\min(N,M))} using Hirschberg's algorithm. == Fast computation == Fast techniques for computing DTW include PrunedDTW, SparseDTW, FastDTW, and the MultiscaleDTW. A common task, retrieval of similar time series, can be accelerated by using lower bounds such as LB_Keogh, LB_Improved, or LB_Petitjean. However, the Early Abandon and Pruned DTW algorithm reduces the degree of acceleration that lower bounding provides and sometimes renders it ineffective. In a survey, Wang et al. reported slightly better results with the LB_Improved lower bound than the LB_Keogh bound, and found that other techniques were inefficient. Subsequent to this survey, the LB_Enhanced bound was developed that is always tighter than LB_Keogh while also being more efficient to compute. LB_Petitjean is the tightest known lower bound that can be computed in linear time. == Average sequence == Averaging for dynamic time warping is the problem of finding an average sequence for a set of sequences. NLAAF is an exact method to average two sequences using DTW. For more than two sequences, the problem is related to that of multiple alignment and requires heuristics. DBA is currently a reference method to average a set of sequences consistently with DTW. COMASA efficiently randomizes the search for the average sequence, using DBA as a local optimization process. == Supervised learning == A nearest-neighbour classifier can achieve state-of-the-art performance when using dynamic time warping as a distance measure. == Amerced Dynamic Time Warping == Amerced Dynamic Time Warping (ADTW) is a variant of DTW designed to better control DTW's permissiveness in the alignments that it allows. The windows that classical DTW uses to constrain alignments introduce a step function. Any warping of the path is allowed within the window and none beyond it. In contrast, ADTW employs an additive penalty that is incurred each time that the path is warped. Any amount of warping is allowed, but each warping action incurs a direct penalty. ADTW significantly outperforms DTW with windowing when applied as a nearest neighbor classifier on a set of benchmark time series classification tasks. == Alternative approaches == In functional data analysis, time series are regarde

    Read more →
  • Wolfram Mathematica

    Wolfram Mathematica

    Wolfram Mathematica (also known as Mathematica) is a software system with built-in libraries for several areas of technical computing that allows machine learning, statistics, symbolic computation, data manipulation, network analysis, time series analysis, NLP, optimization, plotting functions and various types of data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other programming languages. It was conceived by Stephen Wolfram, and is developed by Wolfram Research of Champaign, Illinois. The Wolfram Language is the programming language used in Mathematica. Mathematica 1.0 was released on June 23, 1988 in Champaign, Illinois and Santa Clara, California. Mathematica's Wolfram Language is fundamentally based on Lisp; for example, the Mathematica command Most is identically equal to the Lisp command butlast. == Notebook interface == Mathematica is split into two parts: the kernel and the front end. The kernel interprets expressions (Wolfram Language code) and returns result expressions, which can then be displayed by the front end. The original front end, designed by Theodore Gray in 1988, consists of a notebook interface and allows the creation and editing of notebook documents that can contain code, plaintext, images, and graphics. Code development is also supported through support in a range of standard integrated development environment (IDE) including Eclipse, IntelliJ IDEA, Atom, Vim, Visual Studio Code and Git. The Mathematica Kernel also includes a command line front end. Other interfaces include JMath, based on GNU Readline and WolframScript which runs self-contained Mathematica programs (with arguments) from the UNIX command line. == High-performance computing == Capabilities for high-performance computing were extended with the introduction of packed arrays in version 4 (1999) and sparse matrices (version 5, 2003), and by adopting the GNU Multiple Precision Arithmetic Library to evaluate high-precision arithmetic. Version 5.2 (2005) added automatic multi-threading when computations are performed on multi-core computers. This release included CPU-specific optimized libraries. In addition Mathematica is supported by third party specialist acceleration hardware such as ClearSpeed. In 2002, gridMathematica was introduced to allow user level parallel programming on heterogeneous clusters and multiprocessor systems and in 2008 parallel computing technology was included in all Mathematica licenses including support for grid technology such as Windows HPC Server 2008, Microsoft Compute Cluster Server and Sun Grid. Support for CUDA and OpenCL GPU hardware was added in 2010. == Extensions == As of Version 14, there are 6,602 built-in functions and symbols in the Wolfram Language. Stephen Wolfram announced the launch of the Wolfram Function Repository in June 2019 as a way for the public Wolfram community to contribute functionality to the Wolfram Language. There are currently more than 3000 functions contributed as Resource Functions. In addition to the Wolfram Function Repository, there is a Wolfram Data Repository with computable data and the Wolfram Neural Net Repository for machine learning. Wolfram Mathematica is the basis of the Combinatorica package, which adds discrete mathematics functionality in combinatorics and graph theory to the program. == Connections to other applications, programming languages, and services == Communication with other applications can be done using a protocol called Wolfram Symbolic Transfer Protocol (WSTP). It allows communication between the Wolfram Mathematica kernel and the front end and provides a general interface between the kernel and other applications. Wolfram Research freely distributes a developer kit for linking applications written in the programming language C to the Mathematica kernel through WSTP using J/Link., a Java program that can ask Mathematica to perform computations. Similar functionality is achieved with .NET /Link, but with .NET programs instead of Java programs. Other languages that connect to Mathematica include Haskell, AppleScript, Racket, Visual Basic, Python, and Clojure. Mathematica supports the generation and execution of Modelica models for systems modeling and connects with Wolfram System Modeler. Links are also available to many third-party software packages and APIs. Mathematica can also capture real-time data from a variety of sources and can read and write to public blockchains (Bitcoin, Ethereum, and ARK). It supports import and export of over 220 data, image, video, sound, computer-aided design (CAD), geographic information systems (GIS), document, and biomedical formats. In 2019, support was added for compiling Wolfram Language code to LLVM. Version 12.3 of the Wolfram Language added support for Arduino. == Computable data == Mathematica is also integrated with Wolfram Alpha, an online answer engine that provides additional data, some of which is kept updated in real time, for users who use Mathematica with an internet connection. Some of the data sets include astronomical, chemical, geopolitical, language, biomedical, airplane, and weather data, in addition to mathematical data (such as knots and polyhedra). == Reception == BYTE in 1989 listed Mathematica as among the "Distinction" winners of the BYTE Awards, stating that it "is another breakthrough Macintosh application ... it could enable you to absorb the algebra and calculus that seemed impossible to comprehend from a textbook". Mathematica has been criticized for being closed source. Wolfram Research claims keeping Mathematica closed source is central to its business model and the continuity of the software.

    Read more →
  • Local tangent space alignment

    Local tangent space alignment

    Local tangent space alignment (LTSA) is a method for manifold learning, which can efficiently learn a nonlinear embedding into low-dimensional coordinates from high-dimensional data, and can also reconstruct high-dimensional coordinates from embedding coordinates. It is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned. It begins by computing the k-nearest neighbors of every point. It computes the tangent space at every point by computing the d-first principal components in each local neighborhood. It then optimizes to find an embedding that aligns the tangent spaces, but it ignores the label information conveyed by data samples, and thus can not be used for classification directly.

    Read more →
  • Text Retrieval Conference

    Text Retrieval Conference

    The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology. TREC's evaluation protocols have improved many search technologies. A 2010 study estimated that "without TREC, U.S. Internet users would have spent up to 3.15 billion additional hours using web search engines between 1999 and 2009." Hal Varian the Chief Economist at Google wrote that "The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field." Each track has a challenge wherein NIST provides participating groups with data sets and test problems. Depending on track, test problems might be questions, topics, or target extractable features. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work.Text Retrieval Conference started in 1992, funded by DARPA (US Defense Advanced Research Project) and run by NIST. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. == Goals == Encourage retrieval search based on large text collections Increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas Speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements retrieval methodologies on real world problems To increase the availability of appropriate evaluation techniques for use by industry and academia including development of new evaluation techniques more applicable to current systems TREC is overseen by a program committee consisting of representatives from government, industry, and academia. For each TREC, NIST provide a set of documents and questions. Participants run their own retrieval system on the data and return to NIST a list of retrieved top-ranked documents. NIST pools the individual result judges the retrieved documents for correctness and evaluates the results. The TREC cycle ends with a workshop that is a forum for participants to share their experiences. == Relevance judgments in TREC == TREC defines relevance as: "If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant." Most TREC retrieval tasks use binary relevance: a document is either relevant or not relevant. Some TREC tasks use graded relevance, capturing multiple degrees of relevance. Most TREC collections are too large to perform complete relevance assessment; for these collections it is impossible to calculate the absolute recall for each query. To decide which documents to assess, TREC usually uses a method call pooling. In this method, the top-ranked n documents from each contributing run are aggregated, and the resulting document set is judged completely. == Various TRECs == In 1992 TREC-1 was held at NIST. The first conference attracted 28 groups of researchers from academia and industry. It demonstrated a wide range of different approaches to the retrieval of text from large document collections .Finally TREC1 revealed the facts that automatic construction of queries from natural language query statements seems to work. Techniques based on natural language processing were no better no worse than those based on vector or probabilistic approach. TREC2 Took place in August 1993. 31 group of researchers participated in this. Two types of retrieval were examined. Retrieval using an ‘ad hoc’ query and retrieval using a ‘routing' query In TREC-3 a small group experiments worked with Spanish language collection and others dealt with interactive query formulation in multiple databases TREC-4 they made even shorter to investigate the problems with very short user statements TREC-5 includes both short and long versions of the topics with the goal of carrying out deeper investigation into which types of techniques work well on various lengths of topics In TREC-6 Three new tracks speech, cross language, high precision information retrieval were introduced. The goal of cross language information retrieval is to facilitate research on system that are able to retrieve relevant document regardless of language of the source document TREC-7 contained seven tracks out of which two were new Query track and very large corpus track. The goal of the query track was to create a large query collection TREC-8 contain seven tracks out of which two –question answering and web tracks were new. The objective of QA query is to explore the possibilities of providing answers to specific natural language queries TREC-9 Includes seven tracks In TREC-10 Video tracks introduced Video tracks design to promote research in content based retrieval from digital video In TREC-11 Novelty tracks introduced. The goal of novelty track is to investigate systems abilities to locate relevant and new information within the ranked set of documents returned by a traditional document retrieval system TREC-12 held in 2003 added three new tracks; Genome track, robust retrieval track, HARD (Highly Accurate Retrieval from Documents) == Tracks == === Current tracks === New tracks are added as new research needs are identified, this list is current for TREC 2018. CENTRE Track – Goal: run in parallel CLEF 2018, NTCIR-14, TREC 2018 to develop and tune an IR reproducibility evaluation protocol (new track for 2018). Common Core Track – Goal: an ad hoc search task over news documents. Complex Answer Retrieval (CAR) – Goal: to develop systems capable of answering complex information needs by collating information from an entire corpus. Incident Streams Track – Goal: to research technologies to automatically process social media streams during emergency situations (new track for TREC 2018). The News Track – Goal: partnership with The Washington Post to develop test collections in news environment (new for 2018). Precision Medicine Track – Goal: a specialization of the Clinical Decision Support track to focus on linking oncology patient data to clinical trials. Real-Time Summarization Track (RTS) – Goal: to explore techniques for real-time update summaries from social media streams. === Past tracks === Chemical Track – Goal: to develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists. Clinical Decision Support Track – Goal: to investigate techniques for linking medical cases to information relevant for patient care Contextual Suggestion Track – Goal: to investigate search techniques for complex information needs that are highly dependent on context and user interests. Crowdsourcing Track – Goal: to provide a collaborative venue for exploring crowdsourcing methods both for evaluating search and for performing search tasks. Genomics Track – Goal: to study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007. Dynamic Domain Track – Goal: to investigate domain-specific search algorithms that adapt to the dynamic information needs of professional users as they explore in complex domains. Enterprise Track – Goal: to study search over the data of an organization to complete some task. Last ran on TREC 2008. Entity Track – Goal: to perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search. Cross-Language Track – Goal: to investigate the ability of retrieval systems to find documents topically regardless of source language. After 1999, this track spun off into CLEF. FedWeb Track – Goal: to select best resources to forward a query to, and merge the results so that most relevant are on the top. Federated Web Search Track – Goal: to investigate techniques for the selection and combination of search results from a large number of real on-line web search services. Filtering Track – Goal: to binarily decide retrieval of new

    Read more →
  • Loss function

    Loss function

    In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss. == Examples == === Regret === Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made under circumstances will be known and the decision that was in fact taken before they were known. === Quadratic loss function === The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t {\displaystyle t} , then a quadratic loss function is λ ( x ) = C ( t − x ) 2 {\displaystyle \lambda (x)=C(t-x)^{2}\;} for some constant C {\displaystyle C} ; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL). Many common statistics, including t-tests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function. The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber, log-cosh and SMAE losses are used when the data has many large outliers. === 0-1 loss function === In statistics and decision theory, a frequently used loss function is the 0-1 loss function L ( y ^ , y ) = { 0 if y = y ^ 1 if y ≠ y ^ {\displaystyle L({\hat {y}},y)={\begin{cases}0&{\text{if }}y={\hat {y}}\\1&{\text{if }}y\neq {\hat {y}}\end{cases}}} In information theory, this loss function is known as Hamming distortion. == Constructing loss and objective functions == In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also utility function) in a form suitable for optimization — the problem that Ragnar Frisch has highlighted in his Nobel Prize lecture. The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences. In particular, Andranik Tangian showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or cardinal data that were elicited through computer-assisted interviews with decision makers. Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities and the European subsidies for equalizing unemployment rates among 271 German regions. == Expected loss == In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X {\displaystyle X} . === Statistics === Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms. ==== Frequentist expected loss ==== We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, P θ {\displaystyle P_{\theta }} , of the observed data, X {\displaystyle X} . This is also referred to as the risk function of the decision rule δ {\displaystyle \delta } and the parameter θ {\displaystyle \theta } . Here the decision rule depends on the outcome of X {\displaystyle X} . The risk function is given by: R ( θ , δ ) = E θ ⁡ L ( θ , δ ( X ) ) = ∫ X L ( θ , δ ( x ) ) d P θ ( x ) . {\displaystyle R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).} Here, θ {\displaystyle \theta } is a fixed but possibly unknown state of nature, X {\displaystyle X} is a vector of observations stochastically drawn from a population, E θ {\displaystyle \operatorname {E} _{\theta }} is the expectation over all population values of X {\displaystyle X} , d P θ {\displaystyle \mathrm {d} P_{\theta }} is a probability measure over the event space of X {\displaystyle X} (parametrized by θ {\displaystyle \theta } ) and the integral is evaluated over the entire support of X {\displaystyle X} . ==== Bayes Risk ==== In a Bayesian approach, the expectation is calculated using the prior distribution π ∗ {\displaystyle \pi ^{}} of the parameter θ {\displaystyle \theta } : ρ ( π ∗ , a ) = ∫ Θ ∫ X L ( θ , a ( x ) ) d P ( x | θ ) d π ∗ ( θ ) = ∫ X ∫ Θ L ( θ , a ( x ) ) d π ∗ ( θ | x ) d M ( x ) {\displaystyle \rho (\pi ^{},a)=\int _{\Theta }\int _{\mathbf {X}}L(\theta ,a({\mathbf {x}}))\,\mathrm {d} P({\mathbf {x}}\vert \theta )\,\mathrm {d} \pi ^{}(\theta )=\int _{\mathbf {X}}\int _{\Theta }L(\theta ,a({\mathbf {x}}))\,\mathrm {d} \pi ^{}(\theta \vert {\mathbf {x}})\,\mathrm {d} M({\mathbf {x}})} where M ( x ) {\displaystyle M(\mathbf {x} )} is known as the predictive likelihood wherein θ {\displaystyle \theta } has been "integrated out," π ∗ ( θ | x ) {\displaystyle \pi ^{}(\theta |\mathbf {x} )} is the posterior distribution, and the order of integration has been changed. One then should choose the action a ∗ {\displaystyle a^{}} which minimises this expected loss, which is referred to as Bayes Risk. In the latter equation, the integrand inside d x {\displaystyle \mathrm {d} x} is known as the Posterior Risk, and minimising it with respect to decision a {\displaystyle a} also minimizes the overall Bayes Risk. This optimal decision, a ∗ {\displaystyle a^{}} is known as the Bayes (decision) Rule - it minimises the average loss over all possible states of nature θ {\displaystyle \theta } , over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ {\displaystyle \theta } . ==== Examples in statistics ==== For a scalar parameter θ {\displaystyle \theta } , a decision function whose output θ ^ {\displaystyle {\hat {\theta }}} is an estimate of θ {\displaystyle \theta } , and a quadratic loss function (squared error loss) L ( θ , θ ^ ) = ( θ − θ ^ ) 2 , {\displaystyle L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},} the risk function becomes the mean squared error of the estimate, R ( θ , θ ^ ) = E θ ⁡ [ ( θ − θ ^ ) 2 ] . {\displaystyle R(\theta ,{\hat {\thet

    Read more →
  • Softplus

    Softplus

    In mathematics and machine learning, the softplus function is f ( x ) = ln ⁡ ( 1 + e x ) . {\displaystyle f(x)=\ln(1+e^{x}).} It is a smooth approximation (in fact, an analytic function) to the ramp function, which is known as the rectifier or ReLU (rectified linear unit) in machine learning. For large negative x {\displaystyle x} it is ln ⁡ ( 1 + e x ) = ln ⁡ ( 1 + ϵ ) ⪆ ln ⁡ 1 = 0 {\displaystyle \ln(1+e^{x})=\ln(1+\epsilon )\gtrapprox \ln 1=0} , so just above 0, while for large positive x {\displaystyle x} it is ln ⁡ ( 1 + e x ) ⪆ ln ⁡ ( e x ) = x {\displaystyle \ln(1+e^{x})\gtrapprox \ln(e^{x})=x} , so just above x {\displaystyle x} . The names softplus and SmoothReLU are used in machine learning. The name "softplus" (2000), by analogy with the earlier softmax (1989) is presumably because it is a smooth (soft) approximation of the positive part of x, which is sometimes denoted with a superscript plus, x + := max ( 0 , x ) {\displaystyle x^{+}:=\max(0,x)} . == Alternative forms == This function can be approximated as: ln ⁡ ( 1 + e x ) ≈ { ln ⁡ 2 , x = 0 , x 1 − e − x / ln ⁡ 2 , x ≠ 0 {\displaystyle \ln \left(1+e^{x}\right)\approx {\begin{cases}\ln 2,&x=0,\\[6pt]{\frac {x}{1-e^{-x/\ln 2}}},&x\neq 0\end{cases}}} By making the change of variables x = y ln ⁡ ( 2 ) {\displaystyle x=y\ln(2)} , this is equivalent to log 2 ⁡ ( 1 + 2 y ) ≈ { 1 , y = 0 , y 1 − e − y , y ≠ 0. {\displaystyle \log _{2}(1+2^{y})\approx {\begin{cases}1,&y=0,\\[6pt]{\frac {y}{1-e^{-y}}},&y\neq 0.\end{cases}}} A sharpness parameter k {\displaystyle k} may be included: f ( x ) = ln ⁡ ( 1 + e k x ) k , f ′ ( x ) = e k x 1 + e k x = 1 1 + e − k x . {\displaystyle f(x)={\frac {\ln(1+e^{kx})}{k}},\qquad \qquad f'(x)={\frac {e^{kx}}{1+e^{kx}}}={\frac {1}{1+e^{-kx}}}.} Additionally, the softplus function is equivalent to the log of the sigmoid function in the following way: − ln ⁡ ( sigmoid ( − x ) ) = − ln ⁡ ( 1 1 + e x ) = ln ⁡ ( 1 + e x ) = softplus ( x ) {\displaystyle -\ln({\text{sigmoid}}(-x))=-\ln \left({\frac {1}{1+e^{x}}}\right)=\ln \left(1+e^{x}\right)={\text{softplus}}(x)} == Related functions == The derivative of softplus is the standard logistic function: f ′ ( x ) = e x 1 + e x = 1 1 + e − x {\displaystyle f'(x)={\frac {e^{x}}{1+e^{x}}}={\frac {1}{1+e^{-x}}}} The logistic function or the sigmoid function is a smooth approximation of the rectifier, the Heaviside step function. === LogSumExp === The multivariable generalization of single-variable softplus is the LogSumExp with the first argument set to zero: L S E 0 + ⁡ ( x 1 , … , x n ) := LSE ⁡ ( 0 , x 1 , … , x n ) = ln ⁡ ( 1 + e x 1 + ⋯ + e x n ) . {\displaystyle \operatorname {LSE_{0}} ^{+}(x_{1},\dots ,x_{n}):=\operatorname {LSE} (0,x_{1},\dots ,x_{n})=\ln(1+e^{x_{1}}+\cdots +e^{x_{n}}).} The LogSumExp function is LSE ⁡ ( x 1 , … , x n ) = ln ⁡ ( e x 1 + ⋯ + e x n ) , {\displaystyle \operatorname {LSE} (x_{1},\dots ,x_{n})=\ln(e^{x_{1}}+\cdots +e^{x_{n}}),} and its gradient is the softmax; the softmax with the first argument set to zero is the multivariable generalization of the logistic function. Both LogSumExp and softmax are used in machine learning. === Convex conjugate === The convex conjugate (specifically, the Legendre transformation) of the softplus function is the negative binary entropy function (with base e). This is because (following the definition of the Legendre transformation: the derivatives are inverse functions) the derivative of softplus is the logistic function, whose inverse function is the logit, which is the derivative of negative binary entropy. Softplus can be interpreted as logistic loss (as a positive number), so, by duality, minimizing logistic loss corresponds to maximizing entropy. This justifies the principle of maximum entropy as loss minimization.

    Read more →