AI Content Kaise Banaye Free

AI Content Kaise Banaye Free — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • DryvIQ

    DryvIQ

    DryvIQ is a software application that enables businesses to migrate on-site system files and associated data across storage and content management platforms, as well as create synchronized hybrid storage systems. == History == Before it was DryvIQ, the software SkySync was released in 2013 by Ann Arbor, Michigan based company, Portal Architects, Inc. The company created SkySync, a back-end, administrative application designed to transfer content across storage platforms, after abandoning 18 months of development on a desktop application called SkyBrary in 2011. Between 2014 and 2015, Portal Architects established partnerships with the following companies: Autodesk, Box, Dropbox, Egnyte, EMC, Google, Syncplicity, Huddle, IBM, Microsoft, OpenText, Oracle, Citrix ShareFile, Hightail and Internet2. SkySync (currently DryvIQ) was named a "Cool Vendor in Content Management" by Gartner in 2015. In 2022, SkySync changed its name to DryvIQ, which is now what the company is currently known as. == Overview == DryvIQ is a software application that syncs, migrates or backs up files including their associated properties, metadata, versions, user accounts and permissions across on-premises and Cloud-based storage platforms. The software deploys on a server, virtual machine or within Microsoft Azure, Amazon Web Services or other cloud computing services.

    Read more →
  • Targeted maximum likelihood estimation

    Targeted maximum likelihood estimation

    Targeted Maximum Likelihood Estimation (TMLE) (also more accurately referred to as Targeted Minimum Loss-Based Estimation) is a general statistical estimation framework for causal inference and semiparametric models. TMLE combines ideas from maximum likelihood estimation, semiparametric efficiency theory, and machine learning. It was introduced by Mark J. van der Laan and colleagues in the mid-2000s as a method that yields asymptotically efficient plug-in estimators while allowing the use of flexible, data-adaptive algorithms such as ensemble machine learning for nuisance parameter estimation. TMLE is used in epidemiology, biostatistics, and the social sciences to estimate causal effects in observational and experimental studies. Applications of TMLE include Longitudinal TMLE (LTMLE) for time-varying treatments and confounders. Variations in how the targeting step in TMLE is carried out have resulted in various versions of TMLE such as Collaborative TMLE (CTMLE) and Adaptive TMLE for improved finite-sample performance and automated variable selection. == History == The TMLE framework was first described by van der Laan and Rubin (2006) as a general approach for the construction of efficient plug-in estimators of smooth features of the data density. It was demonstrated in the context of causal inference and missing data problems. It was developed to address limitations of traditional doubly robust methods, such as Augmented Inverse Probability Weighting (AIPW), by respecting the plug-in principle in the sense that it respects that the target parameter is a function of the data density that is an element of the statistical model. TMLE estimates the data density or relevant parts of it with machine learning and targets these machine learning fits before it is plugged in the target parameter mapping. In this manner, a TMLE always respects global knowledge and satisfies known bounds such as that the target parameter is a probability . Since its introduction, TMLE has been developed in a series of theoretical and applied papers, culminating in book-length treatments of the method and its applications to survival analysis, adaptive designs, and longitudinal data. == Methodology == At its core, TMLE is a two-step estimation procedure: Initial estimation: Machine learning methods (such as the Super Learner ensemble) are used to obtain flexible estimates of nuisance parameters, such as outcome regressions and propensity scores. Targeting step: The initial estimate is updated by solving a score equation (the efficient influence function) so that the final estimator is consistent, asymptotically normal, and efficient under mild regularity conditions. The targeted machine learning fit is then mapped into the corresponding estimator of the target parameter by simply plugging it in the target parameter mapping. This approach balances the bias–variance trade-off by combining data-adaptive estimation with semiparametric efficiency theory. TMLE is doubly robust, meaning it remains consistent if either the outcome model or the treatment model is consistently estimated. === Formula === Here we explain the TMLE of the average treatment effect of a binary treatment on an outcome adjusting for baseline covariates. Consider i.i.d. observations O i = ( W i , A i , Y i ) {\displaystyle O_{i}=(W_{i},A_{i},Y_{i})} from a distribution P 0 {\displaystyle P_{0}} , where W {\displaystyle W} are baseline covariates, A {\displaystyle A} is a binary treatment, and Y {\displaystyle Y} is an outcome. Let Q ¯ ( a , w ) = E [ Y ∣ A = a , W = w ] {\displaystyle {\bar {Q}}(a,w)=\mathbb {E} [Y\mid A=a,W=w]} represent the outcome model and g ( a ∣ w ) = P ( A = a ∣ W = w ) {\displaystyle g(a\mid w)=P(A=a\mid W=w)} represent the propensity score. The average treatment effect (ATE) is given by ψ 0 = E { Q ¯ ( 1 , W ) − Q ¯ ( 0 , W ) } . {\displaystyle \psi _{0}=\mathbb {E} \{{\bar {Q}}(1,W)-{\bar {Q}}(0,W)\}.} A basic TMLE for the ATE proceeds as follows: Step 1: Estimate initial models. Obtain estimates Q ¯ ^ ( a , w ) {\displaystyle {\hat {\bar {Q}}}(a,w)} and g ^ ( a ∣ w ) {\displaystyle {\hat {g}}(a\mid w)} , often using flexible methods such as Super Learner. Step 2: Compute the clever covariate. Define: H ( A , W ) = A g ^ ( 1 ∣ W ) − 1 − A g ^ ( 0 ∣ W ) . {\displaystyle H(A,W)={\frac {A}{{\hat {g}}(1\mid W)}}-{\frac {1-A}{{\hat {g}}(0\mid W)}}.} Step 3: Estimate the fluctuation parameter. Fit a logistic regression of Y {\displaystyle Y} on H ( A , W ) {\displaystyle H(A,W)} with logit ⁡ ( Q ¯ ^ ( A , W ) ) {\displaystyle \operatorname {logit} ({\hat {\bar {Q}}}(A,W))} as offset. This yields ε ^ {\displaystyle {\hat {\varepsilon }}} , the MLE that solves the score equation: 1 n ∑ i = 1 n H ( A i , W i ) { Y i − Q ¯ ^ ε ( A i , W i ) } = 0. {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}H(A_{i},W_{i}){\big \{}Y_{i}-{\hat {\bar {Q}}}^{\varepsilon }(A_{i},W_{i}){\big \}}=0.} Step 4: Update the initial estimate. Apply the "blip" to obtain the targeted estimate: Q ¯ ^ ∗ ( A , W ) = expit ⁡ ( logit ⁡ ( Q ¯ ^ ( A , W ) ) + ε ^ H ( A , W ) ) . {\displaystyle {\hat {\bar {Q}}}^{}(A,W)=\operatorname {expit} {\Big (}\operatorname {logit} {\big (}{\hat {\bar {Q}}}(A,W){\big )}+{\hat {\varepsilon }}\,H(A,W){\Big )}.} Step 5: Compute the TMLE. The ATE estimate is: ψ ^ TMLE = 1 n ∑ i = 1 n [ Q ¯ ^ ∗ ( 1 , W i ) − Q ¯ ^ ∗ ( 0 , W i ) ] . {\displaystyle {\hat {\psi }}_{\text{TMLE}}={\frac {1}{n}}\sum _{i=1}^{n}{\big [}{\hat {\bar {Q}}}^{}(1,W_{i})-{\hat {\bar {Q}}}^{}(0,W_{i}){\big ]}.} Inference. The efficient influence function (EIF) for the ATE is: D ∗ ( O ) = H ( A , W ) { Y − Q ¯ ∗ ( A , W ) } + Q ¯ ∗ ( 1 , W ) − Q ¯ ∗ ( 0 , W ) − ψ . {\displaystyle D^{}(O)=H(A,W)\{Y-{\bar {Q}}^{}(A,W)\}+{\bar {Q}}^{}(1,W)-{\bar {Q}}^{}(0,W)-\psi .} The variance is estimated by σ ^ 2 = n − 1 ∑ i = 1 n ( D ∗ ( O i ) ) 2 {\displaystyle {\hat {\sigma }}^{2}=n^{-1}\sum _{i=1}^{n}{\big (}D^{}(O_{i}){\big )}^{2}} , yielding Wald-type confidence intervals ψ ^ TMLE ± z 1 − α / 2 σ ^ / n {\displaystyle {\hat {\psi }}_{\text{TMLE}}\pm z_{1-\alpha /2}\,{\hat {\sigma }}/{\sqrt {n}}} . Remark. For continuous outcomes, a linear fluctuation Q ¯ ^ ∗ = Q ¯ ^ + ε ^ H {\displaystyle {\hat {\bar {Q}}}^{}={\hat {\bar {Q}}}+{\hat {\varepsilon }}\,H} may be used instead. For bounded continuous outcomes, the logistic fluctuation (after rescaling Y {\displaystyle Y} to [ 0 , 1 ] {\displaystyle [0,1]} ) is often preferred for improved finite-sample performance. == Applications == TMLE has been applied in: Epidemiology: Estimating causal effects of exposures and interventions in observational cohort studies. Clinical trials and real-world evidence: The Targeted Learning roadmap provides a structured framework for generating and validating real-world evidence (RWE), bridging randomized trials and observational data using TMLE and related estimation techniques. This approach enables transparency, sensitivity analysis, and stronger causal inference for regulatory and clinical trial contexts. High-dimensional settings: Integration with ensemble methods for causal effect estimation. TMLE has been successfully applied in pharmacoepidemiology where a large number of covariates are automatically selected to adjust for confounding. In a study of post–myocardial infarction statin use and 1-year mortality, TMLE demonstrated robust performance relative to inverse probability weighting in scenarios with hundreds of potential confounders. == Derivatives and extensions == Longitudinal TMLE (LTMLE): A methodological extension of TMLE for longitudinal data with time-varying treatments, confounders, and censoring. It allows the estimation of dynamic treatment regimes and intervention-specific causal effects over time. This framework was originally introduced by van der Laan & Gruber (2012). Collaborative TMLE (CTMLE): Enhances finite-sample performance and variable selection by collaboratively fitting the treatment mechanism in conjunction with the target parameter. == Software == Several R packages implement TMLE and related methods: tmle: Functions for binary, categorical, and continuous outcomes. ltmle: Implementation for longitudinal data with time-varying treatments and outcomes. ctmle: Algorithms for collaborative TMLE and adaptive variable selection. SuperLearner: A theoretically grounded, cross-validated ensemble learning method that combines predictions from multiple algorithms to minimize predictive risk. Widely used in TMLE for estimating nuisance parameters. The original implementation is available as the R package SuperLearner. Recent machine learning platforms like H2O AutoML implement similar ensemble strategies, combining diverse learners in parallel and leveraging stacking and blending techniques, effectively functioning as a large-scale Super Learner.

    Read more →
  • List of text mining software

    List of text mining software

    Text mining computer programs are available from many commercial and open source companies and sources. == Commercial == Angoss – Angoss Text Analytics provides entity and theme extraction, topic categorization, sentiment analysis and document summarization capabilities via the embedded AUTINDEX – is a commercial text mining software package based on sophisticated linguistics by IAI (Institute for Applied Information Sciences), Saarbrücken. DigitalMR – social media listening & text+image analytics tool for market research. FICO Score – leading provider of analytics. General Sentiment – Social Intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media. Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004. IBM LanguageWare – the IBM suite for text analytics (tools and Runtime). IBM SPSS – provider of Modeler Premium (previously called IBM SPSS Modeler and IBM SPSS Text Analytics), which contains advanced NLP-based text analysis capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey responses using NLP-based capabilities for further analysis or reporting. Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008). Language Computer Corporation – text extraction and analysis tools, available in multiple languages. Lexalytics – provider of a text analytics engine used in Social Media Monitoring, Voice of Customer, Survey Analysis, and other applications. Salience Engine. The software provides the unique capability of merging the output of unstructured, text-based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis. Linguamatics – provider of natural language processing (NLP) based enterprise text mining and text analytics software, I2E, for high-value knowledge discovery and decision support. Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis. See Wolfram Language, the programming language of Mathematica. MATLAB offers Text Analytics Toolbox for importing text data, converting it to numeric form for use in machine and deep learning, sentiment analysis and classification tasks. Medallia – offers one system of record for survey, social, text, written and online feedback. NetMiner – software for network analysis and text mining. Supports social media and bibliographic data collection, NLP for english and chinese, sentiment analysis, work co-occurrence network(text network analysis) and visualization. NetOwl – suite of multilingual text and entity analytics products, including entity extraction, link and event extraction, sentiment analysis, geotagging, name translation, name matching, and identity resolution, among others. PolyAnalyst - text analytics environment. PoolParty Semantic Suite - graph-based text mining platform. RapidMiner with its Text Processing Extension – data and text mining software. SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management. Sketch Engine – a corpus manager and analysis software which providing creating text corpora from uploaded texts or the Web including part-of-speech tagging and lemmatization or detecting a particular website. Sysomos – provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations. WordStat – Content analysis and text mining add-on module of QDA Miner for analyzing large amounts of text data. == Open source == Carrot2 – text and search results clustering framework. GATE – general Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering. Gensim – large-scale topic modelling and extraction of semantic information from unstructured text (Python). KH Coder – for Quantitative Content Analysis or Text Mining The KNIME Text Processing extension. Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. OpenNLP – natural language processing. Orange with its text mining add-on. The PLOS Text Mining Collection. The programming language R provides a framework for text mining applications in the package tm. The Natural Language Processing task view contains tm and other text mining library packages. spaCy – open-source Natural Language Processing library for Python Stanbol – an open source text mining engine targeted at semantic content management. Voyant Tools – a web-based text analysis environment, created as a scholarly project.

    Read more →
  • Generalized blockmodeling

    Generalized blockmodeling

    In generalized blockmodeling, the blockmodeling is done by "the translation of an equivalence type into a set of permitted block types", which differs from the conventional blockmodeling, which is using the indirect approach. It's a special instance of the direct blockmodeling approach. Generalized blockmodeling was introduced in 1994 by Patrick Doreian, Vladimir Batagelj and Anuška Ferligoj. == Definition == Generalized blockmodeling approach is a direct one, "where the optimal partition(s) is (are) identified based on minimal values of a compatible criterion function defined by the difference between empirical blocks and corresponding ideal blocks". At the same time, the much broader set of block types is introduced (while in conventional blockmodeling only certain types are used). The conventional blockmodeling is inductive due to nonspecification of neither the clusters or the location of block types, while in generalized blockmodeling the blockmodel is specified with more detail than just the permition of certain block types (e.g., prespecification). Further, it's possible to define departures from the permitted (ideal) blocktype, using criterion function. Using local optimization procedure, firstly the initial clustering (with specified number of clusters is done, based on random creation. How the clusters are neighboring to each other, is based on two transformations: 1) a vertex is moved from one to another cluster or 2) a pair of vertices is interchanged between two different clusters. This process of transformation steps is repeated many times, until only the best fitting partitions (with the minimized value of the criterion function) are kept as blockmodels for the future exploration of the network. Different types of generalized blockmodeling are: generalized binary blockmodeling, generalized valued blockmodeling and generalized homogeneity blockmodeling. == Benefits == According to Patrick Doreian, the benefits of generalized blockmodeling, are as follows: usage of explicit criterion function, compatible with a given type of equivalence, results to in-built measure of fit, which is integral to the establishment of the blockmodels (in conventional blockmodeling, there is no compelling and coherent measures of fit); partitions, based on generalized blockmodeling, regularly outperform and never perform less well than the partitions, based on conventional approach; with generalized blockmodeling it's possible to specify new types of blockmodels; this potentially unlimited set of new block types also results in permittion of inclusion of substantively driven blockmodels; in generalized blockmodeling, the specification of the block types and the location of some of them in the blockmodel is possible; researcher can speficy which (pair of) vertices must be (not) clustered together; this approach also allows the imposition of penalties, resulting into identification of empirical null blocks without inconsistencies with a corresponding ideal null block. == Problems == According to Doreian, the problems of generalized blockmodeling, are as follows: unknown sensitivity to particular data features, examination of boundary problems, computationally burdensome, which results in a constraint regarding practical network size (generalized blockmodeling is thus primarily used to analyse smaller networks (below 100 units)), identifying structure from incomplete network information, most of generalized blockmodeling is based on binary networks, but there is also development in the field of valued networks, criterion function is minimized for a specified blockmodel, with results in issues of evaluating statistically, based on the structural data alone, problems regarding three dimensional network data, problems regarding the evolution of fundamental network structure. == Book == The book with the same title, Generalized blockmodeling, written by Patrick Doreian, Vladimir Batagelj and Anuška Ferligoj, was in 2007 awarded the Harrison White Outstanding Book Award by the Mathematical Sociology Section of American Sociological Association.

    Read more →
  • Confusion matrix

    Confusion matrix

    In machine learning, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. In unsupervised learning it is usually called a matching matrix. The term is used specifically in the problem of statistical classification. Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa – both variants are found in the literature. The diagonal of the matrix therefore represents all instances that are correctly predicted. The name stems from the fact that it makes it easy to identify whether the system is confusing two classes (i.e., commonly mislabeling one class as another). The confusion matrix has its origins in human perceptual studies of auditory stimuli. It was adapted for machine learning studies and used by Frank Rosenblatt, among other early researchers, to compare human and machine classifications of visual (and later auditory) stimuli. It is a special kind of contingency table, with two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each combination of dimension and class is a variable in the contingency table). == Example == Given a sample of 12 individuals, 8 that have been diagnosed with cancer and 4 that are cancer-free, where individuals with cancer belong to class 1 (positive) and non-cancer individuals belong to class 0 (negative), we can display that data as follows: Assume that we have a classifier that distinguishes between individuals with and without cancer in some way, we can take the 12 individuals and run them through the classifier. The classifier then makes 9 accurate predictions and misses 3: 2 individuals with cancer wrongly predicted as being cancer-free (sample 1 and 2), and 1 person without cancer that is wrongly predicted to have cancer (sample 9). Notice, that if we compare the actual classification set to the predicted classification set, there are 4 different outcomes that could result in any particular column: The actual classification is positive and the predicted classification is positive (1,1). This is called a true positive result because the positive sample was correctly identified by the classifier. The actual classification is positive and the predicted classification is negative (1,0). This is called a false negative result because the positive sample is incorrectly identified by the classifier as being negative. The actual classification is negative and the predicted classification is positive (0,1). This is called a false positive result because the negative sample is incorrectly identified by the classifier as being positive. The actual classification is negative and the predicted classification is negative (0,0). This is called a true negative result because the negative sample gets correctly identified by the classifier. We can then perform the comparison between actual and predicted classifications and add this information to the table, making correct results appear in green so they are more easily identifiable. The template for any binary confusion matrix uses the four kinds of results discussed above (true positives, false negatives, false positives, and true negatives) along with the positive and negative classifications. The four outcomes can be formulated in a 2×2 confusion matrix, as follows: The color convention of the three data tables above were picked to match this confusion matrix, in order to easily differentiate the data. Now, we can simply total up each type of result, substitute into the template, and create a confusion matrix that will concisely summarize the results of testing the classifier: In this confusion matrix, of the 8 samples with cancer, the system judged that 2 were cancer-free, and of the 4 samples without cancer, it predicted that 1 did have cancer. All correct predictions are located in the diagonal of the table (highlighted in green), so it is easy to visually inspect the table for prediction errors, as values outside the diagonal will represent them. By summing up the 2 rows of the confusion matrix, one can also deduce the total number of positive (P) and negative (N) samples in the original dataset, i.e. P = T P + F N {\displaystyle P=TP+FN} and N = F P + T N {\displaystyle N=FP+TN} . == Table of confusion == In predictive analytics, a table of confusion (sometimes also called a confusion matrix) is a table with two rows and two columns that reports the number of true positives, false negatives, false positives, and true negatives. This allows more detailed analysis than simply observing the proportion of correct classifications (accuracy). Accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly. For example, if there were 95 cancer samples and only 5 non-cancer samples in the data, a particular classifier might classify all the observations as having cancer. The overall accuracy would be 95%, but in more detail the classifier would have a 100% recognition rate (sensitivity) for the cancer class but a 0% recognition rate for the non-cancer class. F1 score is even more unreliable in such cases, and here would yield over 97.4%, whereas informedness removes such bias and yields 0 as the probability of an informed decision for any form of guessing (here always guessing cancer). According to Davide Chicco and Giuseppe Jurman, the most informative metric to evaluate a confusion matrix is the Matthews correlation coefficient (MCC). Other metrics can be included in a confusion matrix, each of them having their significance and use. Some researchers have argued that the confusion matrix, and the metrics derived from it, do not truly reflect a model's knowledge. In particular, the confusion matrix cannot show whether correct predictions were reached through sound reasoning or merely by chance (a problem known in philosophy as epistemic luck). It also does not capture situations where the facts used to make a prediction later change or turn out to be wrong (defeasibility). This means that while the confusion matrix is a useful tool for measuring classification performance, it may give an incomplete picture of a model’s true reliability. == Confusion matrices with more than two categories == Confusion matrix is not limited to binary classification and can be used in multi-class classifiers as well. The confusion matrices discussed above have only two conditions: positive and negative. For example, the table below summarizes communication of a whistled language between two speakers, with zero values omitted for clarity. == Confusion matrices in multi-label and soft-label classification == Confusion matrices are not limited to single-label classification (where only one class is present) or hard-label settings (where classes are either fully present, 1, or absent, 0). They can also be extended to Multi-label classification (where multiple classes can be predicted at once) and soft-label classification (where classes can be partially present). One such extension is the Transport-based Confusion Matrix (TCM), which builds on the theory of optimal transport and the principle of maximum entropy. TCM applies to single-label, multi-label, and soft-label settings. It retains the familiar structure of the standard confusion matrix: a square matrix sized by the number of classes, with diagonal entries indicating correct predictions and off-diagonal entries indicating confusion. In the single-label case, TCM is identical to the standard confusion matrix. TCM follows the same reasoning as the standard confusion matrix: if class A is overestimated (its predicted value is greater than its label value) and class B is underestimated (its predicted value is less than its label value), A is considered confused with B, and the entry (B, A) is increased. If a class is both predicted and present, it is correctly identified, and the diagonal entry (A, A) increases. Optimal transport and maximum entropy are used to determine the extent to which these entries are updated. TCM enables clearer comparison between predictions and labels in complex classification tasks, while maintaining a consistent matrix format across settings.

    Read more →
  • Markov blanket

    Markov blanket

    In statistics and machine learning, a Markov blanket of a random variable is a set of variables that renders the variable conditionally independent of all other variables in the system. This concept is central in probabilistic graphical models and feature selection. If a Markov blanket is minimal—meaning that no variable in it can be removed without losing this conditional independence—it is called a Markov boundary. Identifying a Markov blanket or boundary allows for efficient inference and helps isolate relevant variables for prediction or causal reasoning. The terms Markov blanket and Markov boundary were coined by Judea Pearl in 1988. A Markov blanket may be derived from the structure of a probabilistic graphical model such as a Bayesian network or Markov random field. == Definition == A Markov blanket of a random variable Y {\displaystyle Y} in a random variable set S = { X 1 , … , X n } {\displaystyle {\mathcal {S}}=\{X_{1},\ldots ,X_{n}\}} is any subset S 1 {\displaystyle {\mathcal {S}}_{1}} of S {\displaystyle {\mathcal {S}}} , conditioned on which other variables are independent with Y {\displaystyle Y} : Y ⊥ ⊥ S ∖ S 1 ∣ S 1 {\displaystyle Y\perp \!\!\!\perp {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}\mid {\mathcal {S}}_{1}} It means that S 1 {\displaystyle {\mathcal {S}}_{1}} contains at least all the information one needs to infer Y {\displaystyle Y} , where the variables in S ∖ S 1 {\displaystyle {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}} are redundant. In general, a given Markov blanket is not unique. Any set in S {\displaystyle {\mathcal {S}}} that contains a Markov blanket is also a Markov blanket itself. Specifically, S {\displaystyle {\mathcal {S}}} is a Markov blanket of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} . === Example === In a Bayesian network, the Markov blanket of a node consists of its parents, its children, and its children's other parents (i.e., co-parents). Knowing the values of these nodes makes the target node conditionally independent of the rest of the network. In a Markov random field, the Markov blanket of a node is simply its immediate neighbors. == Markov condition == The concept of a Markov blanket is rooted in the Markov condition, which states that in a probabilistic graphical model, each variable is conditionally independent of its non-descendants given its parents. This condition implies the existence of a minimal separating set — the Markov blanket — that shields a variable from the rest of the network. For instance, when a person holds an object stationary against gravity, the object’s acceleration is fully determined by its direct causes—namely, the upward force from the hand and the downward gravitational pull. Other variables such as air pressure or temperature are causally irrelevant. == Markov boundary == A Markov boundary of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} is a subset S 2 {\displaystyle {\mathcal {S}}_{2}} of S {\displaystyle {\mathcal {S}}} , such that S 2 {\displaystyle {\mathcal {S}}_{2}} itself is a Markov blanket of Y {\displaystyle Y} , but any proper subset of S 2 {\displaystyle {\mathcal {S}}_{2}} is not a Markov blanket of Y {\displaystyle Y} . In other words, a Markov boundary is a minimal Markov blanket. The Markov boundary of a node A {\displaystyle A} in a Bayesian network is the set of nodes composed of A {\displaystyle A} 's parents, A {\displaystyle A} 's children, and A {\displaystyle A} 's children's other parents. In a Markov random field, the Markov boundary for a node is the set of its neighboring nodes. In a dependency network, the Markov boundary for a node is the set of its parents. === Uniqueness of Markov boundary === The Markov boundary always exists. Under some mild conditions, the Markov boundary is unique. However, for most practical and theoretical scenarios multiple Markov boundaries may provide alternative solutions. When there are multiple Markov boundaries, quantities measuring causal effect could fail. == In cognitive science == In the study of consciousness, brain function, and complex adaptive systems, Markov blankets are proposed as a mathematical mechanism which delimits the extent of cognitive entities, whether it be physical or causal.

    Read more →
  • Sliced inverse regression

    Sliced inverse regression

    Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics. In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable x _ {\displaystyle {\underline {x}}} , which is a p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing. As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve, E ( x _ | y ) {\displaystyle E({\underline {x}}\,|\,y)} , to perform a weighted principal component analysis. == Model == Given a response variable Y {\displaystyle \,Y} and a (random) vector X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} of explanatory variables, SIR is based on the model Y = f ( β 1 ⊤ X , … , β k ⊤ X , ε ) ( 1 ) {\displaystyle Y=f(\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X,\varepsilon )\quad \quad \quad \quad \quad (1)} where β 1 , … , β k {\displaystyle \beta _{1},\ldots ,\beta _{k}} are unknown projection vectors, k {\displaystyle \,k} is an unknown number smaller than p {\displaystyle \,p} , f {\displaystyle \;f} is an unknown function on R k + 1 {\displaystyle \mathbb {R} ^{k+1}} as it only depends on k {\displaystyle \,k} arguments, and ε {\displaystyle \varepsilon } is a random variable representing error with E [ ε | X ] = 0 {\displaystyle E[\varepsilon |X]=0} and a finite variance of σ 2 {\displaystyle \sigma ^{2}} . The model describes an ideal solution, where Y {\displaystyle \,Y} depends on X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} only through a k {\displaystyle \,k} dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from p {\displaystyle \,p} to a smaller number k {\displaystyle \,k} without losing any information. An equivalent version of ( 1 ) {\displaystyle \,(1)} is: the conditional distribution of Y {\displaystyle \,Y} given X {\displaystyle \,X} depends on X {\displaystyle \,X} only through the k {\displaystyle \,k} dimensional random vector ( β 1 ⊤ X , … , β k ⊤ X ) {\displaystyle (\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X)} . It is assumed that this reduced vector is as informative as the original X {\displaystyle \,X} in explaining Y {\displaystyle \,Y} . The unknown β i ′ s {\displaystyle \,\beta _{i}'s} are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the effective dimension reducing space (EDR-space). == Relevant linear algebra background == Given a _ 1 , … , a _ r ∈ R n {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}\in \mathbb {R} ^{n}} , then V := L ( a _ 1 , … , a _ r ) {\displaystyle V:=L({\underline {a}}_{1},\ldots ,{\underline {a}}_{r})} , the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors a _ 1 , … , a _ r {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}} span V {\displaystyle \,V} , but the vectors that span space V {\displaystyle \,V} are not unique. The dimension of V ( ∈ R n ) {\displaystyle \,V(\in \mathbb {R} ^{n})} is equal to the maximum number of linearly independent vectors in V {\displaystyle \,V} . A set of n {\displaystyle \,n} linear independent vectors of R n {\displaystyle \mathbb {R} ^{n}} makes up a basis of R n {\displaystyle \mathbb {R} ^{n}} . The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line. == Inverse regression == Computing the inverse regression curve (IR) means instead of looking for E [ Y | X = x ] {\displaystyle \,E[Y|X=x]} , which is a curve in R p {\displaystyle \mathbb {R} ^{p}} it is actually E [ X | Y = y ] {\displaystyle \,E[X|Y=y]} , which is also a curve in R p {\displaystyle \mathbb {R} ^{p}} , but consisting of p {\displaystyle \,p} one-dimensional regressions. The center of the inverse regression curve is located at E [ E [ X | Y ] ] = E [ X ] {\displaystyle \,E[E[X|Y]]=E[X]} . Therefore, the centered inverse regression curve is E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} which is a p {\displaystyle \,p} dimensional curve in R p {\displaystyle \mathbb {R} ^{p}} . == Inverse regression versus dimension reduction == The centered inverse regression curve lies on a k {\displaystyle \,k} -dimensional subspace spanned by Σ x x β i ′ s {\displaystyle \,\Sigma _{xx}\beta _{i}\,'s} . This is a connection between the model and inverse regression. Given this condition and ( 1 ) {\displaystyle \,(1)} , the centered inverse regression curve E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} is contained in the linear subspace spanned by Σ x x β k ( k = 1 , … , K ) {\displaystyle \,\Sigma _{xx}\beta _{k}(k=1,\ldots ,K)} , where Σ x x = C o v ( X ) {\displaystyle \,\Sigma _{xx}=Cov(X)} . == Estimation of the EDR-directions == After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means m ^ h ′ s {\displaystyle \,{\hat {m}}_{h}\,'s} , X {\displaystyle \,X} would have been standardized to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} . Corresponding to the theorem above, the IR-curve m 1 ( y ) = E [ Z | Y = y ] {\displaystyle \,m_{1}(y)=E[Z|Y=y]} lies in the space spanned by ( η 1 , … , η k ) {\displaystyle \,(\eta _{1},\ldots ,\eta _{k})} , where η i = Σ x x 1 / 2 β i {\displaystyle \,\eta _{i}=\Sigma _{xx}^{1/2}\beta _{i}} . As a consequence, the covariance matrix c o v [ E [ Z | Y ] ] {\displaystyle \,cov[E[Z|Y]]} is degenerate in any direction orthogonal to the η i ′ s {\displaystyle \,\eta _{i}\,'s} . Therefore, the eigenvectors η k ( k = 1 , … , K ) {\displaystyle \,\eta _{k}(k=1,\ldots ,K)} associated with the largest K {\displaystyle \,K} eigenvalues are the standardized EDR-directions. == Algorithm == === SIR algorithm === The algorithm from Li, K-C. (1991) to estimate the EDR-directions via SIR is as follows. 1. Let Σ x x {\displaystyle \,\Sigma _{xx}} be the covariance matrix of X {\displaystyle \,X} . Standardize X {\displaystyle \,X} to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} ( 1 ) {\displaystyle \,(1)} can also be rewritten as Y = f ( η 1 ⊤ Z , … , η k ⊤ Z , ε ) {\displaystyle Y=f(\eta _{1}^{\top }Z,\ldots ,\eta _{k}^{\top }Z,\varepsilon )} where η k = β k Σ x x 1 / 2 ∀ k {\displaystyle \,\eta _{k}=\beta _{k}\Sigma _{xx}^{1/2}\quad \forall \;k} .) 2. Divide the range of y i {\displaystyle \,y_{i}} into S {\displaystyle \,S} non-overlapping slices H s ( s = 1 , … , S ) . n s {\displaystyle \,H_{s}(s=1,\ldots ,S).\;n_{s}} is the number of observations within each slice and I H s {\displaystyle \,I_{H_{s}}} is the indicator function for the slice: n s = ∑ i = 1 n I H s ( y i ) {\displaystyle n_{s}=\sum _{i=1}^{n}I_{H_{s}}(y_{i})} 3. Compute the mean of z i {\displaystyle \,z_{i}} over all slices, which is a crude estimate m ^ 1 {\displaystyle \,{\hat {m}}_{1}} of the inverse regression curve m 1 {\displaystyle \,m_{1}} : z ¯ s = n s − 1 ∑ i = 1 n z i I H s ( y i ) {\displaystyle \,{\bar {z}}_{s}=n_{s}^{-1}\sum _{i=1}^{n}z_{i}I_{H_{s}}(y_{i})} 4. Calculate the estimate for C o v { m 1 ( y ) } {\displaystyle \,Cov\{m_{1}(y)\}} : V ^ = n − 1 ∑ i = 1 S n s z ¯ s z ¯ s ⊤ {\displaystyle \,{\hat {V}}=n^{-1}\sum _{i=1}^{S}n_{s}{\bar {z}}_{s}{\bar {z}}_{s}^{\top }} 5. Identify the eigenvalues λ ^ i {\displaystyle \,{\hat {\lambda }}_{i}} and the eigenvectors η ^ i {\displaystyle \,{\hat {\eta }}_{i}} of V ^ {\displaystyle \,{\hat {V}}} , which are the standardized EDR-directions. 6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by: β ^ i = Σ ^ x x − 1 / 2 η ^ i {\displaystyle \,{\hat {\beta }}_{i}={\hat {\Sigma }}_{xx}^{-1/2}{\hat {\eta }}_{i}} (which are not necessarily orthogonal.)

    Read more →
  • Weighted majority algorithm (machine learning)

    Weighted majority algorithm (machine learning)

    In machine learning, weighted majority algorithm (WMA) is a meta learning algorithm used to construct a compound algorithm from a pool of prediction algorithms, which could be any type of learning algorithms, classifiers, or even real human experts. The algorithm assumes that we have no prior knowledge about the accuracy of the algorithms in the pool, but there are sufficient reasons to believe that one or more will perform well. Assume that the problem is a binary decision problem. To construct the compound algorithm, a positive weight is given to each of the algorithms in the pool. The compound algorithm then collects weighted votes from all the algorithms in the pool, and gives the prediction that has a higher vote. If the compound algorithm makes a mistake, the algorithms in the pool that contributed to the wrong predicting will be discounted by a certain ratio β where 0<β<1. It can be shown that the upper bounds on the number of mistakes made in a given sequence of predictions from a pool of algorithms A {\displaystyle \mathbf {A} } is O ( l o g | A | + m ) {\displaystyle \mathbf {O(log|A|+m)} } if one algorithm in x i {\displaystyle \mathbf {x} _{i}} makes at most m {\displaystyle \mathbf {m} } mistakes. There are many variations of the weighted majority algorithm to handle different situations, like shifting targets, infinite pools, or randomized predictions. The core mechanism remains similar, with the final performances of the compound algorithm bounded by a function of the performance of the specialist (best performing algorithm) in the pool.

    Read more →
  • Endomondo

    Endomondo

    Endomondo is a health and wellness website. It allows users to track their health statistics and provides insights on fitness trends. Originally launched in 2007, Endomondo was acquired by Under Armour in 2015. Under Armour shut down Endomondo in 2020, but, by 2024, Endomondo re-launched as its own entity. == History == Endomondo started in Denmark in 2007 by Mette Lykke, Christian Birk and Jakob Nordenhof Jønck. In 2011, the company opened an office in Silicon Valley, USA, but kept its research and development department in Denmark. In 2013, Endomondo LLC was listed in Red Herring as a European finalists for promising start-ups. The same year, Christian Birk and Jakob Nordenhof Jønck left the daily operation of the company, but kept co-ownership. In February 2015, Endomondo LLC was acquired by athletic apparel maker Under Armour for $85 million. Endomondo, at that time, had over 20 million users. In October 2020, Under Armour announced that Endomondo would be shutting down and selling off MyFitnessPal to the private equity firm Francisco Partners for $345 million. Service stopped on 31 December 2020, giving customers until 15 February 2021 to download an archive of their historic data. In 2024, Endomondo.com was brought back online as a professional fitness guidance website. == Features == Endomondo provides numerous workouts, guidance on exercises, performance-enhancing nutrition, and tips. Previously, Endomondo was able to track numerous fitness attributes such as running routes, distance, duration, and calories. The software helped analyze performance and recommend improvements. There was a free and a paid version available of Endomondo. The free version had advertisements. The paid Premium version was free of advertisements and included additional features such as the possibility to create one's own training plan. The offering of additional features was different between the Android, IOS and Windows platforms, and had significantly better features for tracking performance over time than UnderArmours suggested replacement. Endomondo offered challenges of various types to the user and allowed users to create their own challenges.

    Read more →
  • Correspondence analysis

    Correspondence analysis

    Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a manner similar to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis. It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead. CA may also be applied to binary data given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test. Although the χ 2 {\displaystyle \chi ^{2}} statistic used in inferential statistics and the chi-square distance are computationally related they should not be confused since the latter works as a multivariate statistical distance measure in CA while the χ 2 {\displaystyle \chi ^{2}} statistic is in fact a scalar not a metric. == Details == Like principal components analysis, correspondence analysis creates orthogonal components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see Factor analysis). Correspondence analysis is performed on the data table, conceived as matrix C of size m × n where m is the number of rows and n is the number of columns. In the following mathematical description of the method capital letters in italics refer to a matrix while letters in italics refer to vectors. Understanding the following computations requires knowledge of matrix algebra. === Preprocessing === Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed. First compute a set of weights for the columns and the rows (sometimes called masses), where row and column weights are given by the row and column vectors, respectively: w m = 1 n C C 1 , w n = 1 n C 1 T C . {\displaystyle w_{m}={\frac {1}{n_{C}}}C\mathbf {1} ,\quad w_{n}={\frac {1}{n_{C}}}\mathbf {1} ^{T}C.} Here n C = ∑ i = 1 n ∑ j = 1 m C i j {\displaystyle n_{C}=\sum _{i=1}^{n}\sum _{j=1}^{m}C_{ij}} is the sum of all cell values in matrix C, or short the sum of C, and 1 {\displaystyle \mathbf {1} } is a column vector of ones with the appropriate dimension. Put in simple words, w m {\displaystyle w_{m}} is just a vector whose elements are the row sums of C divided by the sum of C, and w n {\displaystyle w_{n}} is a vector whose elements are the column sums of C divided by the sum of C. The weights are transformed into diagonal matrices W m = diag ⁡ ( 1 / w m ) {\displaystyle W_{m}=\operatorname {diag} (1/{\sqrt {w_{m}}})} and W n = diag ⁡ ( 1 / w n ) {\displaystyle W_{n}=\operatorname {diag} (1/{\sqrt {w_{n}}})} where the diagonal elements of W n {\displaystyle W_{n}} are 1 / w n {\displaystyle 1/{\sqrt {w_{n}}}} and those of W m {\displaystyle W_{m}} are 1 / w m {\displaystyle 1/{\sqrt {w_{m}}}} respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0. Next, compute matrix P {\displaystyle P} by dividing C {\displaystyle C} by its sum P = 1 n C C . {\displaystyle P={\frac {1}{n_{C}}}C.} In simple words, Matrix P {\displaystyle P} is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table. Finally, compute matrix S {\displaystyle S} , sometimes called the matrix of standardized residuals, by matrix multiplication as S = W m ( P − w m w n ) W n {\displaystyle S=W_{m}(P-w_{m}w_{n})W_{n}} Note, the vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are combined in an outer product resulting in a matrix of the same dimensions as P {\displaystyle P} . In words the formula reads: matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is subtracted from matrix P {\displaystyle P} and the resulting matrix is scaled (weighted) by the diagonal matrices W m {\displaystyle W_{m}} and W n {\displaystyle W_{n}} . Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of W m {\displaystyle W_{m}} or W n {\displaystyle W_{n}} , respectively. === Interpretation of preprocessing === The vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} from matrix P {\displaystyle P} is the matrix algebra version of double centering the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the origin of a vector space. This origin is defined by matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} . In fact matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is identical with the matrix of expected frequencies in the chi-squared test. Therefore S {\displaystyle S} is computationally related to the independence model used in that test. But since CA is not an inferential method the term independence model is inappropriate here. === Orthogonal components === The table S {\displaystyle S} is then decomposed by a singular value decomposition as S = U Σ V ∗ {\displaystyle S=U\Sigma V^{}\,} where U {\displaystyle U} and V {\displaystyle V} are the left and right singular vectors of S {\displaystyle S} and Σ {\displaystyle \Sigma } is a square diagonal matrix with the singular values σ i {\displaystyle \sigma _{i}} of S {\displaystyle S} on the diagonal. Σ {\displaystyle \Sigma } is of dimension p ≤ ( min ( m , n ) − 1 ) {\displaystyle p\leq (\min(m,n)-1)} hence U {\displaystyle U} is of dimension m×p and V {\displaystyle V} is of n×p. As orthonormal vectors U {\displaystyle U} and V {\displaystyle V} fulfill U ∗ U = V ∗ V = I {\displaystyle U^{}U=V^{}V=I} . In other words, the multivariate information that is contained in C {\displaystyle C} as well as in S {\displaystyle S} is now distributed across two (coordinate) matrices U {\displaystyle U} and V {\displaystyle V} and a diagonal (scaling) matrix Σ {\displaystyle \Sigma } . The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1. === Inertia === While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called inertia. The sum of the squared singular values is the total inertia I {\displaystyle \mathrm {I} } of the data table, computed as I = ∑ i = 1 p σ i 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{p}\sigma _{i}^{2}.} The total inertia I {\displaystyle \mathrm {I} } of the data table can also computed directly from S {\displaystyle S} as I = ∑ i = 1 n ∑ j = 1 m s i j 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{n}\sum _{j=1}^{m}s_{ij}^{2}.} The amount of inertia covered by the i-th set of singular vectors is ι i {\displaystyle \iota _{i}} , the principal inertia. The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore, all principal inertia values are expressed as portion ϵ i {\displaystyle \epsilon _{i}} of the total inertia ϵ i = σ i 2 / ∑ i = 1 p σ i 2 {\displaystyle \epsilon _{i}=\sigma _{i}^{2}/\sum _{i=1}^{p}\sigma _{i}^{2}} and are presented in the form of a scree plot. In fact a scree plot is just a bar plot of all principal inertia portions ϵ i {\displaystyle \epsilon _{i}} . === Coordinates === To transform the singular vectors to coordinates which preserve the chi-square distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called principal coordinates in CA text books. If principal coordinates are used for

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Abess

    Abess

    abess (Adaptive Best Subset Selection, also ABESS) is a machine learning method designed to address the problem of best subset selection. It aims to determine which features or variables are crucial for optimal model performance when provided with a dataset and a prediction task. abess was introduced by Zhu in 2020 and it dynamically selects the appropriate model size adaptively, eliminating the need for selecting regularization parameters. abess is applicable in various statistical and machine learning tasks, including linear regression, the Single-index model, and other common predictive models. abess can also be applied in biostatistics. == Basic Form == The basic form of abess is employed to address the optimal subset selection problem in general linear regression. abess is an l 0 {\displaystyle l_{0}} method, it is characterized by its polynomial time complexity and the property of providing both unbiased and consistent estimates. In the context of linear regression, assuming we have knowledge of n {\displaystyle n} independent samples ( x i , y i ) , i = 1 , … , n {\displaystyle (x_{i},y_{i}),i=1,\ldots ,n} , where x i ∈ R p × 1 {\displaystyle x_{i}\in \mathbb {R} ^{p\times 1}} and y i ∈ R {\displaystyle y_{i}\in \mathbb {R} } , we define X = ( x 1 , … , x n ) ⊤ {\displaystyle X=(x_{1},\ldots ,x_{n})^{\top }} and y = ( y 1 , … , y n ) ⊤ {\displaystyle y=(y_{1},\ldots ,y_{n})^{\top }} . The following equation represents the general linear regression model: y = X β + ε . {\displaystyle y=X\beta +\varepsilon .} To obtain appropriate parameters β {\displaystyle \beta } , one can consider the loss function for linear regression: L n LR ( β ; X , y ) = 1 2 n ‖ y − X β ‖ 2 2 . {\displaystyle {\mathcal {L}}_{n}^{\text{LR}}(\beta ;X,y)={\frac {1}{2n}}\|y-X\beta \|_{2}^{2}.} In abess, the initial focus is on optimizing the loss function under the l 0 {\displaystyle l_{0}} constraint. That is, we consider the following problem: min β ∈ R p × 1 L n LR ( β ; X , y ) , subject to ‖ β ‖ 0 ≤ s , {\displaystyle \min _{\beta \in \mathbb {R} ^{p\times 1}}{\mathcal {L}}_{n}^{\text{LR}}(\beta ;X,y),{\text{ subject to }}\|\beta \|_{0}\leq s,} where s {\displaystyle s} represents the desired size of the support set, and ‖ β ‖ 0 = ∑ i = 1 p I ( β i ≠ 0 ) {\displaystyle \|\beta \|_{0}=\sum _{i=1}^{p}{\mathcal {I}}_{(\beta _{i}\neq 0)}} is the l 0 {\displaystyle l_{0}} norm of the vector. To address the optimization problem described above, abess iteratively exchanges an equal number of variables between the active set and the inactive set. In each iteration, the concept of sacrifice is introduced as follows: For j in the active set ( j ∈ A ^ {\displaystyle j\in {\hat {\mathcal {A}}}} ): ξ j = L n LR ( β ^ A ∖ { j } ) − L n LR ( β ^ A ) = X j ⊤ X j 2 n ( β ^ j ) 2 {\displaystyle \xi _{j}={\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{{\mathcal {A}}\backslash \{j\}}\right)-{\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}\right)={\frac {{\boldsymbol {X}}_{j}^{\top }{\boldsymbol {X}}_{j}}{2n}}\left({\hat {\beta }}_{j}\right)^{2}} For j in the inactive set ( j ∉ A ^ {\displaystyle j\notin {\hat {\mathcal {A}}}} ): ξ j = L n LR ( β ^ A ) − L n LR ( β ^ A + t ^ { j } ) = X j ⊤ X j 2 n ( d ^ j X j ⊤ X j / n ) 2 {\displaystyle \xi _{j}={\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}\right)-{\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}+{\hat {\boldsymbol {t}}}^{\{j\}}\right)={\frac {{\boldsymbol {X}}_{j}^{\top }{\boldsymbol {X}}_{j}}{2n}}\left({\frac {{\hat {\mathrm {d} }}_{j}}{{\boldsymbol {X}}_{j}^{\top }{\boldsymbol {X}}_{j}/n}}\right)^{2}} Here are the key elements in the above equations: β ^ A {\displaystyle {\hat {\beta }}^{\mathcal {A}}} : This represents the estimate of β {\displaystyle \beta } obtained in the previous iteration. A ^ {\displaystyle {\hat {\mathcal {A}}}} : It denotes the estimated active set from the previous iteration. β ^ A ∖ { j } {\displaystyle {\hat {\boldsymbol {\beta }}}^{{\mathcal {A}}\backslash \{j\}}} : This is a vector where the j-th element is set to 0, while the other elements are the same as β ^ A {\displaystyle {\hat {\beta }}^{\mathcal {A}}} . t ^ { j } = arg ⁡ min t L n LR ( β ^ A + t { j } ) {\displaystyle {\hat {\boldsymbol {t}}}^{\{j\}}=\arg \min _{t}{\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}+{\boldsymbol {t}}^{\{j\}}\right)} : Here, t { j } {\displaystyle t^{\{j\}}} represents a vector where all elements are 0 except the j-th element. d ^ j = X j ⊤ ( y − X β ^ ) / n {\displaystyle {\hat {d}}_{j}={\boldsymbol {X}}_{j}^{\top }({\boldsymbol {y}}-{\boldsymbol {X}}{\hat {\boldsymbol {\beta }}})/n} : This is calculated based on the equation mentioned. The iterative process involves exchanging variables, with the aim of minimizing the sacrifices in the active set while maximizing the sacrifices in the inactive set during each iteration. This approach allows abess to efficiently search for the optimal feature subset. In abess, select an appropriate s max {\displaystyle s_{\max }} and optimize the above problem for active sets size s = 1 , … , s max {\displaystyle s=1,\ldots ,s_{\max }} using the information criterion GIC = n log ⁡ L n LR + s log ⁡ p log ⁡ log ⁡ n , {\displaystyle {\text{GIC}}=n\log {\mathcal {L}}_{n}^{\text{LR}}+s\log p\log \log n,} to adaptively choose the appropriate active set size s {\displaystyle s} and obtain its corresponding abess estimator. == Generalizations == The splicing algorithm in abess can be employed for subset selection in other models. === Distribution-Free Location-Scale Regression === In 2023, Siegfried extends abess to the case of Distribution-Free and Location-Scale. Specifically, it considers the optimization problem max ϑ ∈ R P , β ∈ R J , γ ∈ R J ∑ i = 1 N ℓ i ( ϑ , x i ⊤ β , exp ⁡ ( x i ⊤ γ ) − 1 ) , {\displaystyle \max _{{\boldsymbol {\vartheta }}\in \mathbb {R} ^{P},{\boldsymbol {\beta }}\in \mathbb {R} ^{J},{\boldsymbol {\gamma }}\in \mathbb {R} ^{J}}\sum _{i=1}^{N}\ell _{i}\left({\boldsymbol {\vartheta }},{\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\beta }},{\sqrt {\exp \left({\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\gamma }}\right)}}^{-1}\right),} subject to ‖ ( β ⊤ , γ ⊤ ) ⊤ ‖ 0 ≤ s , {\displaystyle \left\|\left({\boldsymbol {\beta }}^{\top },{\boldsymbol {\gamma }}^{\top }\right)^{\top }\right\|_{0}\leq s,} where ℓ i {\displaystyle \ell _{i}} is a loss function, ϑ {\displaystyle {\boldsymbol {\vartheta }}} is a parameter vector, β {\displaystyle {\boldsymbol {\beta }}} and γ {\displaystyle {\boldsymbol {\gamma }}} are vectors, and x i {\displaystyle {\boldsymbol {x}}_{i}} is a data vector. This approach, demonstrated across various applications, enables parsimonious regression modeling for arbitrary outcomes while maintaining interpretability through innovative subset selection procedures. === Groups Selection === In 2023, Zhang applied the splicing algorithm to group selection, optimizing the following model: min β ∈ R p L n LR ( β ; X , y ) subject to ∑ j = 1 J I ( ‖ β G j ‖ 2 ≠ 0 ) ≤ s {\displaystyle \min _{{\boldsymbol {\beta }}\in \mathbb {R} ^{p}}{\mathcal {L}}_{n}^{\text{LR}}(\beta ;X,y){\text{ subject to }}\sum _{j=1}^{J}I\left(\|{\boldsymbol {\beta }}_{G_{j}}\|_{2}\neq 0\right)\leq s} Here are the symbols involved: J {\displaystyle J} : Total number of feature groups, representing the existence of J {\displaystyle J} non-overlapping feature groups in the dataset. G j {\displaystyle G_{j}} : Index set for the j {\displaystyle j} -th feature group, where j {\displaystyle j} ranges from 1 to J {\displaystyle J} , representing the feature grouping structure in the data. s {\displaystyle s} : Model size, a positive integer determined from the data, limiting the number of selected feature groups. === Regression with Corrupted Data === Zhang applied the splicing algorithm to handle corrupted data. Corrupted data refers to information that has been disrupted or contains errors during the data collection or recording process. This interference may include sensor inaccuracies, recording errors, communication issues, or other external disturbances, leading to inaccurate or distorted observations within the dataset. === Single Index Models === In 2023, Tang applied the splicing algorithm to optimal subset selection in the Single-index model. The form of the Single Index Model (SIM) is given by y i = g ( b ⊤ x i , e i ) , i = 1 , … , n , {\displaystyle y_{i}=g({\boldsymbol {b}}^{\top }{\boldsymbol {x}}_{i},e_{i}),\quad i=1,\ldots ,n,} where b {\displaystyle {\boldsymbol {b}}} is the parameter vector, e i {\displaystyle e_{i}} is the error term. The corresponding loss function is defined as l n ( β ) = ∑ i = 1 n ( r i n − 1 2 − x i ⊤ β ) 2 , {\displaystyle l_{n}({\boldsymbol {\beta }})=\sum _{i=1}^{n}\left({\frac {r_{i}}{n}}-{\frac {1}{2}}-{\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\beta }}\right)^{2},} where r {\disp

    Read more →
  • Semantic analysis (machine learning)

    Semantic analysis (machine learning)

    In machine learning, semantic analysis of a text corpus is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior semantic understanding of the documents. Semantic analysis strategies include: Metalanguages based on first-order logic, which can analyze the speech of humans. Understanding the semantics of a text is symbol grounding: if language is grounded, it is equal to recognizing a machine-readable meaning. For the restricted domain of spatial analysis, a computer-based language understanding system was demonstrated. Latent semantic analysis (LSA), a class of techniques where documents are represented as vectors in a term space. A prominent example is probabilistic latent semantic analysis (PLSA). Latent Dirichlet allocation, which involves attributing document terms to topics. n-grams and hidden Markov models, which work by representing the term stream as a Markov chain, in which each term is derived from preceding terms. == Stochastic semantic analysis ==

    Read more →
  • Detrended correspondence analysis

    Detrended correspondence analysis

    Detrended correspondence analysis (DCA) is a multivariate statistical technique widely used by ecologists to find the main factors or gradients in large, species-rich but usually sparse data matrices that typify ecological community data. DCA is frequently used to suppress artifacts inherent in most other multivariate analyses when applied to gradient data. == History == DCA was created in 1979 by Mark Hill of the United Kingdom's Institute for Terrestrial Ecology (now merged into Centre for Ecology and Hydrology) and implemented in FORTRAN code package called DECORANA (Detrended Correspondence Analysis), a correspondence analysis method. DCA is sometimes erroneously referred to as DECORANA; however, DCA is the underlying algorithm, while DECORANA is a tool implementing it. == Issues addressed == According to Hill and Gauch, DCA suppresses two artifacts inherent in most other multivariate analyses when applied to gradient data. An example is a time-series of plant species colonising a new habitat; early successional species are replaced by mid-successional species, then by late successional ones (see example below). When such data are analysed by a standard ordination such as a correspondence analysis: the ordination scores of the samples will exhibit the 'edge effect', i.e. the variance of the scores at the beginning and the end of a regular succession of species will be considerably smaller than that in the middle, when presented as a graph the points will be seen to follow a horseshoe shaped curve rather than a straight line ('arch effect'), even though the process under analysis is a steady and continuous change that human intuition would prefer to see as a linear trend. Outside ecology, the same artifacts occur when gradient data are analysed (e.g. soil properties along a transect running between 2 different geologies, or behavioural data over the lifespan of an individual) because the curved projection is an accurate representation of the shape of the data in multivariate space. Ter Braak and Prentice (1987, p. 121) cite a simulation study analysing two-dimensional species packing models resulting in a better performance of DCA compared to CA. == Method == DCA is an iterative algorithm that has shown itself to be a highly reliable and useful tool for data exploration and summary in community ecology (Shaw 2003). It starts by running a standard ordination (CA or reciprocal averaging) on the data, to produce the initial horse-shoe curve in which the 1st ordination axis distorts into the 2nd axis. It then divides the first axis into segments (default = 26), and rescales each segment to have mean value of zero on the 2nd axis - this effectively squashes the curve flat. It also rescales the axis so that the ends are no longer compressed relative to the middle, so that 1 DCA unit approximates to the same rate of turnover all the way through the data: the rule of thumb is that 4 DCA units mean that there has been a total turnover in the community. Ter Braak and Prentice (1987, p. 122) warn against the non-linear rescaling of the axes due to robustness issues and recommend using detrending-by-polynomials only. == Drawbacks == No significance tests are available with DCA, although there is a constrained (canonical) version called DCCA in which the axes are forced by Multiple linear regression to correlate optimally with a linear combination of other (usually environmental) variables; this allows testing of a null model by Monte-Carlo permutation analysis. == Example == The example shows an ideal data set: The species data is in rows, samples in columns. For each sample along the gradient, a new species is introduced but another species is no longer present. The result is a sparse matrix. Ones indicate the presence of a species in a sample. Except at the edges each sample contains five species. The plot of the first two axes of the correspondence analysis result on the right hand side clearly shows the disadvantages of this procedure: the edge effect, i.e. the points are clustered at the edges of the first axis, and the arch effect. == Software == An open source implementation of DCA, based on the original FORTRAN code, is available in the vegan R-package.

    Read more →
  • Latent space

    Latent space

    A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances between the objects. In most cases, the dimensionality of the latent space is chosen to be lower than the dimensionality of the feature space from which the data points are drawn, making the construction of a latent space an example of dimensionality reduction, which can also be viewed as a form of data compression. Latent spaces are usually fit via machine learning, and they can then be used as feature spaces in machine learning models, including classifiers and other supervised predictors. The interpretation of latent spaces in machine learning models is an ongoing area of research, but achieving clear interpretations remains challenging. The black-box nature of these models often makes the latent space unintuitive, while its high-dimensional, complex, and nonlinear characteristics further complicate the task of understanding it. Analysis of the latent space geometry of diffusion models reveals a fractal structure of phase transitions in the latent space, characterized by abrupt changes in the Fisher information metric. Some visualization techniques have been developed to connect the latent space to the visual world, but there is often not a direct connection between the latent space interpretation and the model itself. Such techniques include t-distributed stochastic neighbor embedding (t-SNE), where the latent space is mapped to two dimensions for visualization. Latent space distances lack physical units, so the interpretation of these distances may depend on the application. == Embedding models == Several embedding models have been developed to perform this transformation to create latent space embeddings given a set of data items and a similarity function. These models learn the embeddings by leveraging statistical techniques and machine learning algorithms. Here are some commonly used embedding models: Word2Vec: Word2Vec is a popular embedding model used in natural language processing (NLP). It learns word embeddings by training a neural network on a large corpus of text. Word2Vec captures semantic and syntactic relationships between words, allowing for meaningful computations like word analogies. GloVe: GloVe (Global Vectors for Word Representation) is another widely used embedding model for NLP. It combines global statistical information from a corpus with local context information to learn word embeddings. GloVe embeddings are known for capturing both semantic and relational similarities between words. Siamese Networks: Siamese networks are a type of neural network architecture commonly used for similarity-based embedding. They consist of two identical subnetworks that process two input samples and produce their respective embeddings. Siamese networks are often used for tasks like image similarity, recommendation systems, and face recognition. Variational Autoencoders (VAEs): VAEs are generative models that simultaneously learn to encode and decode data. The latent space in VAEs acts as an embedding space. By training VAEs on high-dimensional data, such as images or audio, the model learns to encode the data into a compact latent representation. VAEs are known for their ability to generate new data samples from the learned latent space. == Multimodality == Multimodality refers to the integration and analysis of multiple modes or types of data within a single model or framework. Embedding multimodal data involves capturing relationships and interactions between different data types, such as images, text, audio, and structured data. Multimodal embedding models aim to learn joint representations that fuse information from multiple modalities, allowing for cross-modal analysis and tasks. These models enable applications like image captioning, visual question answering, and multimodal sentiment analysis. To embed multimodal data, specialized architectures such as deep multimodal networks or multimodal transformers are employed. These architectures combine different types of neural network modules to process and integrate information from various modalities. The resulting embeddings capture the complex relationships between different data types, facilitating multimodal analysis and understanding. == Applications == Embedding latent space and multimodal embedding models have found numerous applications across various domains: Information retrieval: Embedding techniques enable efficient similarity search and recommendation systems by representing data points in a compact space. Natural language processing: Word embeddings have revolutionized NLP tasks like sentiment analysis, machine translation, and document classification. Computer vision: Image and video embeddings enable tasks like object recognition, image retrieval, and video summarization. Recommendation systems: Embeddings help capture user preferences and item characteristics, enabling personalized recommendations. Healthcare: Embedding techniques have been applied to electronic health records, medical imaging, and genomic data for disease prediction, diagnosis, and treatment. Social systems: Embedding techniques can be used to learn latent representations of social systems such as internal migration systems, academic citation networks, and world trade networks.

    Read more →