AI Coding Godot

AI Coding Godot — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Uncertain data

    Uncertain data

    In computer science, uncertain data is data that contains noise that makes it deviate from the correct, intended or original values. In the age of big data, uncertainty or data veracity is one of the defining characteristics of data. Data is constantly growing in volume, variety, velocity and uncertainty (1/veracity). Uncertain data is found in abundance today on the web, in sensor networks, within enterprises both in their structured and unstructured sources. For example, there may be uncertainty regarding the address of a customer in an enterprise dataset, or the temperature readings captured by a sensor due to aging of the sensor. In 2012 IBM called out managing uncertain data at scale in its global technology outlook report that presents a comprehensive analysis looking three to ten years into the future seeking to identify significant, disruptive technologies that will change the world. In order to make confident business decisions based on real-world data, analyses must necessarily account for many different kinds of uncertainty present in very large amounts of data. Analyses based on uncertain data will have an effect on the quality of subsequent decisions, so the degree and types of inaccuracies in this uncertain data cannot be ignored. Uncertain data is found in the area of sensor networks; text where noisy text is found in abundance on social media, web and within enterprises where the structured and unstructured data may be old, outdated, or plain incorrect; in modeling where the mathematical model may only be an approximation of the actual process. When representing such data in a database, an appropriate uncertain database model needs to be selected. == Example data model for uncertain data == One way to represent uncertain data is through probability distributions. Let us take the example of a relational database. There are three main ways to do represent uncertainty as probability distributions in such a database model. In attribute uncertainty, each uncertain attribute in a tuple is subject to its own independent probability distribution. For example, if readings are taken of temperature and wind speed, each would be described by its own probability distribution, as knowing the reading for one measurement would not provide any information about the other. In correlated uncertainty, multiple attributes may be described by a joint probability distribution. For example, if readings are taken of the position of an object, and the x- and y-coordinates stored, the probability of different values may depend on the distance from the recorded coordinates. As distance depends on both coordinates, it may be appropriate to use a joint distribution for these coordinates, as they are not independent. In tuple uncertainty, all the attributes of a tuple are subject to a joint probability distribution. This covers the case of correlated uncertainty, but also includes the case where there is a probability of a tuple not belonging in the relevant relation, which is indicated by all the probabilities not summing to one. For example, assume we have the following tuple from a probabilistic database: Then, the tuple has 10% chance of not existing in the database.

    Read more →
  • Dominance-based rough set approach

    Dominance-based rough set approach

    The dominance-based rough set approach (DRSA) is an extension of rough set theory for multi-criteria decision analysis (MCDA), introduced by Greco, Matarazzo and Słowiński. The main change compared to the classical rough sets is the substitution for the indiscernibility relation by a dominance relation, which permits one to deal with inconsistencies typical to consideration of criteria and preference-ordered decision classes. == Multicriteria classification (sorting) == Multicriteria classification (sorting) is one of the problems considered within MCDA and can be stated as follows: given a set of objects evaluated by a set of criteria (attributes with preference-order domains), assign these objects to some pre-defined and preference-ordered decision classes, such that each object is assigned to exactly one class. Due to the preference ordering, improvement of evaluations of an object on the criteria should not worsen its class assignment. The sorting problem is very similar to the problem of classification, however, in the latter, the objects are evaluated by regular attributes and the decision classes are not necessarily preference ordered. The problem of multicriteria classification is also referred to as ordinal classification problem with monotonicity constraints and often appears in real-life application when ordinal and monotone properties follow from the domain knowledge about the problem. As an illustrative example, consider the problem of evaluation in a high school. The director of the school wants to assign students (objects) to three classes: bad, medium and good (notice that class good is preferred to medium and medium is preferred to bad). Each student is described by three criteria: level in Physics, Mathematics and Literature, each taking one of three possible values bad, medium and good. Criteria are preference-ordered and improving the level from one of the subjects should not result in worse global evaluation (class). As a more serious example, consider classification of bank clients, from the viewpoint of bankruptcy risk, into classes safe and risky. This may involve such characteristics as "return on equity (ROE)", "return on investment (ROI)" and "return on sales (ROS)". The domains of these attributes are not simply ordered but involve a preference order since, from the viewpoint of bank managers, greater values of ROE, ROI or ROS are better for clients being analysed for bankruptcy risk . Thus, these attributes are criteria. Neglecting this information in knowledge discovery may lead to wrong conclusions. == Data representation == === Decision table === In DRSA, data are often presented using a particular form of decision table. Formally, a DRSA decision table is a 4-tuple S = ⟨ U , Q , V , f ⟩ {\displaystyle S=\langle U,Q,V,f\rangle } , where U {\displaystyle U\,\!} is a finite set of objects, Q {\displaystyle Q\,\!} is a finite set of criteria, V = ⋃ q ∈ Q V q {\displaystyle V=\bigcup {}_{q\in Q}V_{q}} where V q {\displaystyle V_{q}\,\!} is the domain of the criterion q {\displaystyle q\,\!} and f : U × Q → V {\displaystyle f\colon U\times Q\to V} is an information function such that f ( x , q ) ∈ V q {\displaystyle f(x,q)\in V_{q}} for every ( x , q ) ∈ U × Q {\displaystyle (x,q)\in U\times Q} . The set Q {\displaystyle Q\,\!} is divided into condition criteria (set C ≠ ∅ {\displaystyle C\neq \emptyset } ) and the decision criterion (class) d {\displaystyle d\,\!} . Notice, that f ( x , q ) {\displaystyle f(x,q)\,\!} is an evaluation of object x {\displaystyle x\,\!} on criterion q ∈ C {\displaystyle q\in C} , while f ( x , d ) {\displaystyle f(x,d)\,\!} is the class assignment (decision value) of the object. An example of decision table is shown in Table 1 below. === Outranking relation === It is assumed that the domain of a criterion q ∈ Q {\displaystyle q\in Q} is completely preordered by an outranking relation ⪰ q {\displaystyle \succeq _{q}} ; x ⪰ q y {\displaystyle x\succeq _{q}y} means that x {\displaystyle x\,\!} is at least as good as (outranks) y {\displaystyle y\,\!} with respect to the criterion q {\displaystyle q\,\!} . Without loss of generality, we assume that the domain of q {\displaystyle q\,\!} is a subset of reals, V q ⊆ R {\displaystyle V_{q}\subseteq \mathbb {R} } , and that the outranking relation is a simple order between real numbers ≥ {\displaystyle \geq \,\!} such that the following relation holds: x ⪰ q y ⟺ f ( x , q ) ≥ f ( y , q ) {\displaystyle x\succeq _{q}y\iff f(x,q)\geq f(y,q)} . This relation is straightforward for gain-type ("the more, the better") criterion, e.g. company profit. For cost-type ("the less, the better") criterion, e.g. product price, this relation can be satisfied by negating the values from V q {\displaystyle V_{q}\,\!} . === Decision classes and class unions === Let T = { 1 , … , n } {\displaystyle T=\{1,\ldots ,n\}\,\!} . The domain of decision criterion, V d {\displaystyle V_{d}\,\!} consist of n {\displaystyle n\,\!} elements (without loss of generality we assume V d = T {\displaystyle V_{d}=T\,\!} ) and induces a partition of U {\displaystyle U\,\!} into n {\displaystyle n\,\!} classes Cl = { C l t , t ∈ T } {\displaystyle {\textbf {Cl}}=\{Cl_{t},t\in T\}} , where C l t = { x ∈ U : f ( x , d ) = t } {\displaystyle Cl_{t}=\{x\in U\colon f(x,d)=t\}} . Each object x ∈ U {\displaystyle x\in U} is assigned to one and only one class C l t , t ∈ T {\displaystyle Cl_{t},t\in T} . The classes are preference-ordered according to an increasing order of class indices, i.e. for all r , s ∈ T {\displaystyle r,s\in T} such that r ≥ s {\displaystyle r\geq s\,\!} , the objects from C l r {\displaystyle Cl_{r}\,\!} are strictly preferred to the objects from C l s {\displaystyle Cl_{s}\,\!} . For this reason, we can consider the upward and downward unions of classes, defined respectively, as: C l t ≥ = ⋃ s ≥ t C l s C l t ≤ = ⋃ s ≤ t C l s t ∈ T {\displaystyle Cl_{t}^{\geq }=\bigcup _{s\geq t}Cl_{s}\qquad Cl_{t}^{\leq }=\bigcup _{s\leq t}Cl_{s}\qquad t\in T} == Main concepts == === Dominance === We say that x {\displaystyle x\,\!} dominates y {\displaystyle y\,\!} with respect to P ⊆ C {\displaystyle P\subseteq C} , denoted by x D p y {\displaystyle xD_{p}y\,\!} , if x {\displaystyle x\,\!} is better than y {\displaystyle y\,\!} on every criterion from P {\displaystyle P\,\!} , x ⪰ q y , ∀ q ∈ P {\displaystyle x\succeq _{q}y,\,\forall q\in P} . For each P ⊆ C {\displaystyle P\subseteq C} , the dominance relation D P {\displaystyle D_{P}\,\!} is reflexive and transitive, i.e. it is a partial pre-order. Given P ⊆ C {\displaystyle P\subseteq C} and x ∈ U {\displaystyle x\in U} , let D P + ( x ) = { y ∈ U : y D p x } {\displaystyle D_{P}^{+}(x)=\{y\in U\colon yD_{p}x\}} D P − ( x ) = { y ∈ U : x D p y } {\displaystyle D_{P}^{-}(x)=\{y\in U\colon xD_{p}y\}} represent P-dominating set and P-dominated set with respect to x ∈ U {\displaystyle x\in U} , respectively. === Rough approximations === The key idea of the rough set philosophy is approximation of one knowledge by another knowledge. In DRSA, the knowledge being approximated is a collection of upward and downward unions of decision classes and the "granules of knowledge" used for approximation are P-dominating and P-dominated sets. The P-lower and the P-upper approximation of C l t ≥ , t ∈ T {\displaystyle Cl_{t}^{\geq },t\in T} with respect to P ⊆ C {\displaystyle P\subseteq C} , denoted as P _ ( C l t ≥ ) {\displaystyle {\underline {P}}(Cl_{t}^{\geq })} and P ¯ ( C l t ≥ ) {\displaystyle {\overline {P}}(Cl_{t}^{\geq })} , respectively, are defined as: P _ ( C l t ≥ ) = { x ∈ U : D P + ( x ) ⊆ C l t ≥ } {\displaystyle {\underline {P}}(Cl_{t}^{\geq })=\{x\in U\colon D_{P}^{+}(x)\subseteq Cl_{t}^{\geq }\}} P ¯ ( C l t ≥ ) = { x ∈ U : D P − ( x ) ∩ C l t ≥ ≠ ∅ } {\displaystyle {\overline {P}}(Cl_{t}^{\geq })=\{x\in U\colon D_{P}^{-}(x)\cap Cl_{t}^{\geq }\neq \emptyset \}} Analogously, the P-lower and the P-upper approximation of C l t ≤ , t ∈ T {\displaystyle Cl_{t}^{\leq },t\in T} with respect to P ⊆ C {\displaystyle P\subseteq C} , denoted as P _ ( C l t ≤ ) {\displaystyle {\underline {P}}(Cl_{t}^{\leq })} and P ¯ ( C l t ≤ ) {\displaystyle {\overline {P}}(Cl_{t}^{\leq })} , respectively, are defined as: P _ ( C l t ≤ ) = { x ∈ U : D P − ( x ) ⊆ C l t ≤ } {\displaystyle {\underline {P}}(Cl_{t}^{\leq })=\{x\in U\colon D_{P}^{-}(x)\subseteq Cl_{t}^{\leq }\}} P ¯ ( C l t ≤ ) = { x ∈ U : D P + ( x ) ∩ C l t ≤ ≠ ∅ } {\displaystyle {\overline {P}}(Cl_{t}^{\leq })=\{x\in U\colon D_{P}^{+}(x)\cap Cl_{t}^{\leq }\neq \emptyset \}} Lower approximations group the objects which certainly belong to class union C l t ≥ {\displaystyle Cl_{t}^{\geq }} (respectively C l t ≤ {\displaystyle Cl_{t}^{\leq }} ). This certainty comes from the fact, that object x ∈ U {\displaystyle x\in U} belongs to the lower approximation P _ ( C l t ≥ ) {\displaystyle {\underline {P}}(Cl_{t}^{\geq })} (respectively P _ ( C l t ≤ ) {\displaystyle {\underl

    Read more →
  • Apache Mahout

    Apache Mahout

    Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented. == Features == === Samsara === Apache Mahout-Samsara refers to a Scala domain-specific language (DSL) that allows users to use R-like syntax as opposed to traditional Scala-like syntax. This allows user to express algorithms concisely and clearly. === Backend agnostic === Apache Mahout's code abstracts the domain-specific language from the engine where the code is run. While active development is done with the Apache Spark engine, users are free to implement any engine they choose- H2O and Apache Flink have been implemented in the past and examples exist in the code base. === GPU/CPU accelerators === The JVM has notoriously slow computation. To improve speed, "native solvers" were added which move in-core, and by extension, distributed BLAS operations out of the JVM, offloading to off-heap or GPU memory for processing via multiple CPUs and/or CPU cores, or GPUs when built against the ViennaCL library. ViennaCL is a highly optimized C++ library with BLAS operations implemented in OpenMP, and OpenCL. As of release 14.1, the OpenMP build considered to be stable, leaving the OpenCL build is still in its experimental proof-of-concept phase. === Recommenders === Apache Mahout features implementations of Alternating Least Squares, Co-Occurrence, and Correlated Co-Occurrence, a unique-to-Mahout recommender algorithm that extends co-occurrence to be used on multiple dimensions of data. == History == === Transition from Map Reduce to Apache Spark === While Mahout's core algorithms for clustering, classification and batch based collaborative filtering were implemented on top of Apache Hadoop using the map/reduce paradigm, it did not restrict contributions to Hadoop-based implementations. Contributions that run on a single node or on a non-Hadoop cluster were also welcomed. For example, the 'Taste' collaborative-filtering recommender component of Mahout was originally a separate project and can run stand-alone without Hadoop. Starting with the release 0.10.0, the project shifted its focus to building a backend-independent programming environment, code named "Samsara". The environment consists of an algebraic backend-independent optimizer and an algebraic Scala DSL unifying in-memory and distributed algebraic operators. Supported algebraic platforms are Apache Spark, H2O, and Apache Flink. Support for MapReduce algorithms started being gradually phased out in 2014. === Release history === === Developers === Apache Mahout is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC). The current PMC is Andrew Musselman, Andrew Palumbo, Drew Farris, Isabel Drost-Fromm, Jake Mannix, Pat Ferrel, Paritosh Ranjan, Trevor Grant, Robin Anil, Sebastian Schelter, Stevo Slavić.

    Read more →
  • Vapnik–Chervonenkis dimension

    Vapnik–Chervonenkis dimension

    In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the function class can shatter—that is, for which all possible binary labelings can be realized by some function in the class. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis. Informally, the capacity of a classification model is related to how complicated it can be. For example, consider the thresholding of a high-degree polynomial: if the polynomial evaluates above zero, that point is classified as positive, otherwise as negative. A high-degree polynomial can be wiggly, so that it can fit a given set of training points well. Such a polynomial has a high capacity. A much simpler alternative is to threshold a linear function. This function may not fit the training set well, because it has a low capacity. This notion of capacity is made rigorous below. == Definitions == === VC dimension of a set-family === Let C = { C } C ∈ C {\displaystyle {\mathcal {C}}=\{C\}_{C\in {\mathcal {C}}}} be a family of sets (also called set family, collection of sets or set of sets) and X {\displaystyle X} a set. Their intersection is defined as the following set family: C ∩ X := { C ∩ X ∣ C ∈ C } . {\displaystyle {\mathcal {C}}\cap X:=\{C\cap X\mid C\in {\mathcal {C}}\}.} Here typically X {\displaystyle X} and each C ∈ C {\displaystyle C\in {\mathcal {C}}} are subsets of a big "universe" of possibilities U {\displaystyle U} where intersection takes place. We say that a set X {\displaystyle X} is shattered by C {\displaystyle {\mathcal {C}}} if P ( X ) = C ∩ X {\displaystyle {\mathcal {P}}(X)={\mathcal {C}}\cap X} i.e. the set of intersections contains (hence is equal to) all the subsets of X {\displaystyle X} . For finite sets X {\displaystyle X} this is equivalent to | C ∩ X | = 2 | X | . {\displaystyle |{\mathcal {C}}\cap X|=2^{|X|}.} The VC dimension D {\displaystyle D} of C {\displaystyle {\mathcal {C}}} is the cardinality of the largest set that is shattered by C {\displaystyle {\mathcal {C}}} . If arbitrarily large sets can be shattered, the VC dimension of C {\displaystyle {\mathcal {C}}} is ∞ {\displaystyle \infty } . === VC dimension of a classification model === A binary classification model f {\displaystyle f} with some parameter vector θ {\displaystyle \theta } is said to shatter a set of generally positioned data points ( x 1 , x 2 , … , x n ) {\displaystyle (x_{1},x_{2},\ldots ,x_{n})} if, for every assignment of labels to those points, there exists a θ {\displaystyle \theta } such that the model f {\displaystyle f} makes no errors when evaluating that set of data points. The VC dimension of a model f {\displaystyle f} is the maximum number of points that can be arranged so that f {\displaystyle f} shatters them. More formally, it is the maximum cardinal D {\displaystyle D} such that there exists a generally positioned data point set of cardinality D {\displaystyle D} that can be shattered by f {\displaystyle f} . == Examples == f {\displaystyle f} is a constant classifier (with no parameters); Its VC dimension is 0 since it cannot shatter even a single point. In general, the VC dimension of a finite classification model, which can return at most 2 d {\displaystyle 2^{d}} different classifiers, is at most d {\displaystyle d} (this is an upper bound on the VC dimension; the Sauer–Shelah lemma gives a lower bound on the dimension). f {\displaystyle f} is a single-parametric threshold classifier on real numbers; i.e., for a certain threshold θ {\displaystyle \theta } , the classifier f θ {\displaystyle f_{\theta }} returns 1 if the input number is larger than θ {\displaystyle \theta } and 0 otherwise. The VC dimension of f {\displaystyle f} is 1 because: (a) It can shatter a single point. For every point x {\displaystyle x} , a classifier f θ {\displaystyle f_{\theta }} labels it as 0 if θ > x {\displaystyle \theta >x} and labels it as 1 if θ < x {\displaystyle \theta x + 2 {\displaystyle \theta >x+2} , as (1,0) if θ ∈ [ x − 4 , x − 2 ) {\displaystyle \theta \in [x-4,x-2)} , as (1,1) if θ ∈ [ x − 2 , x ] {\displaystyle \theta \in [x-2,x]} , and as (0,1) if θ ∈ ( x , x + 2 ] {\displaystyle \theta \in (x,x+2]} . (b) It cannot shatter any set of three points. For every set of three numbers, if the smallest and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible. f {\displaystyle f} is a straight line as a classification model on points in a two-dimensional plane (this is the model used by a perceptron). The line should separate positive data points from negative data points. There exist sets of 3 points that can indeed be shattered using this model (any 3 points that are not collinear can be shattered). However, no set of 4 points can be shattered: by Radon's theorem, any four points can be partitioned into two subsets with intersecting convex hulls, so it is not possible to separate one of these two subsets from the other. Thus, the VC dimension of this particular classifier is 3. It is important to remember that while one can choose any arrangement of points, the arrangement of those points cannot change when attempting to shatter for some label assignment. Note, only 3 of the 23 = 8 possible label assignments are shown for the three points. f {\displaystyle f} is a single-parametric sine classifier, i.e., for a certain parameter θ {\displaystyle \theta } , the classifier f θ {\displaystyle f_{\theta }} returns 1 if the input number x {\displaystyle x} has sin ⁡ ( θ x ) > 0 {\displaystyle \sin(\theta x)>0} and 0 otherwise. The VC dimension of f {\displaystyle f} is infinite, since it can shatter any finite subset of the set { 2 − m ∣ m ∈ N } {\displaystyle \{2^{-m}\mid m\in \mathbb {N} \}} . == Uses == === In statistical learning theory === The VC dimension can predict a probabilistic upper bound on the test error of a classification model. Vapnik proved that the probability of the test error (i.e., risk with 0–1 loss function) distancing from an upper bound (on data that is drawn i.i.d. from the same distribution as the training set) is given by: Pr ( test error ⩽ training error + 1 N [ D ( log ⁡ ( 2 N D ) + 1 ) − log ⁡ ( η 4 ) ] ) = 1 − η , {\displaystyle \Pr \left({\text{test error}}\leqslant {\text{training error}}+{\sqrt {{\frac {1}{N}}\left[D\left(\log \left({\tfrac {2N}{D}}\right)+1\right)-\log \left({\tfrac {\eta }{4}}\right)\right]}}\,\right)=1-\eta ,} where D {\displaystyle D} is the VC dimension of the classification model, 0 < η ⩽ 1 {\displaystyle 0<\eta \leqslant 1} , and N {\displaystyle N} is the size of the training set (restriction: this formula is valid when D ≪ N {\displaystyle D\ll N} . When D {\displaystyle D} is larger, the test-error may be much higher than the training-error. This is due to overfitting). The VC dimension also appears in sample-complexity bounds. A space of binary functions with VC dimension D {\displaystyle D} can be learned with: N = Θ ( D + ln ⁡ 1 δ ε 2 ) {\displaystyle N=\Theta \left({\frac {D+\ln {1 \over \delta }}{\varepsilon ^{2}}}\right)} samples, where ε {\displaystyle \varepsilon } is the learning error and δ {\displaystyle \delta } is the failure probability. Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space. === In computational geometry === The VC dimension is one of the critical parameters in the size of ε-nets, which determines the complexity of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-nets at all. == Bounds == The VC dimension of the dual set-family of C {\displaystyle {\mathcal {C}}} is strictly less than 2 vc ⁡ ( C ) + 1 {\displaystyle 2^{\operatorname {vc} ({\mathcal {C}})+1}} , and this is best possible. The VC dimension of a finite set-family C {\displaystyle {\mathcal {C}}} is at most log 2 ⁡ | C | {\displaystyle \log _{2}|{\mathcal {C}}|} . This is because | C ∩ X | ≤ | X | {\displaystyle |{\mathcal {C}}\cap X|\leq |X|} by definition. Given a set-fa

    Read more →
  • Cloud Security Alliance

    Cloud Security Alliance

    Cloud Security Alliance (CSA) is a not-for-profit organization with the mission to "promote the use of best practices for providing security assurance within cloud computing, artificial intelligence and to provide education on the uses of cloud computing to help secure all other forms of computing." The CSA has over 80,000 individual members worldwide. The CSA gained significant reputability in 2011 when the American Presidential Administration selected the CSA Summit as the venue for announcing the federal government’s cloud computing strategy. == History == The CSA was formed in December 2008 as a coalition by individuals who saw the need to provide objective enterprise user guidance on the adoption and use of cloud computing. Its initial work product, Security Guidance for Critical Areas of Focus in Cloud Computing, was put together in a Wiki-style by dozens of volunteers. In 2014, the Chairman of the Board of the CSA was Dave Cullinane, VP of Global Security and Privacy for Catalina Marketing, St. Petersburg, Florida, and former CISO for eBay. Cullinane has said, "If you have an application exposed to the Internet that will allow people to make money, it will be probed." == Profile == In 2009, the Cloud Security Alliance incorporated in Nevada as a Corporation and achieved US Federal 501(c)6 non-profit status. It is registered as a Foreign Non-Profit Corporation in Washington. == Policy maker support == The CSA works to support a number of global policy makers in their focus on cloud security initiatives including the National Institute of Standards and Technology (NIST), European Commission, Singapore Government, and other data protection authorities. In March 2012, the CSA was selected to partner with three of Europe’s largest research centers (CERN, EMBL and ESA) to launch Helix Nebula – The Science Cloud. == Size == The Cloud Security Alliance employs roughly sixty full-time and contract staff worldwide. It has several thousand active volunteers participating in research, working groups and chapters at any time. == Membership == According to CSA, they are a member-driven organization, chartered with promoting the use of best practices for providing security assurance within Cloud Computing, and providing education on the uses of Cloud Computing to help secure all other forms of computing. === Individuals === Individuals who are interested in cloud computing and have experience to assist in making it more secure receive a complimentary individual membership based on a minimum level of participation. === Chapters === The Cloud Security Alliance has a network of chapters worldwide. Chapters are separate legal entities from the Cloud Security Alliance, but operate within guidelines set down by the Cloud Security Alliance In the United States, Chapters may elect to benefit from the non-profit tax shield that the Cloud Security Alliance has. Chapters are encouraged to hold local meetings and participate in areas of research. Chapter activities are coordinated by the Cloud Security Alliance worldwide. === International scope === There are separate legal entities in Europe and Asia Pacific, called Cloud Security Alliance (Europe), a Scottish company in the United Kingdom, and Cloud Security Alliance Asia Pacific Ltd, in Singapore. Each legal entity is responsible for overseeing all Cloud Security Alliance-related activities in their respective regions. These legal entities operate under an agreement with Cloud Security Alliance that give it oversight power and have separate Boards of Directors. Both are companies Limited By Guarantee. The Managing Directors of each are members of the Executive Team of Cloud Security Alliance. == Areas of research == The Cloud Security Alliance has 25+ active working groups. Key areas of research include cloud standards, certification, education and training, guidance and tools, global reach, and driving innovation. Security Guidance for Critical Areas of Focus in Cloud Computing. Foundational best practices for securing cloud computing. Top Threats to Cloud Computing. Helps organizations make educated risk management decisions regarding their cloud adoption strategies. GRC (Governance, Risk and Compliance) Stack. A toolkit for key stakeholders to instrument and assess clouds against industry established best practices, standards and critical compliance requirements. Cloud Controls Matrix (CCM). Security controls framework for cloud provider and cloud consumers. CloudTrust Protocol. The mechanism by which cloud service consumers ask for and receive information about the elements of transparency as applied to cloud service providers. Consensus Assessments Initiative Research. Tools and processes to perform consistent measurements of cloud providers. Software Defined Perimeter. A proposed security framework that can be deployed to protect application infrastructure from network-based attacks. It will incorporate standards from organizations such as OASIS and NIST and security concepts from organizations like the U.S. DoD into an integrated framework. == Working groups and initiatives == Mobile Working Group Big Data Working Group Security as a Service Working Group Trusted Cloud Initiative CloudAudit CloudCERT CloudSIRT Cloud Metrics Security, Trust and Assurance Registry (STAR) Cloud Data Governance Turbot (business) Blockchain/Distributed Ledger

    Read more →
  • Multispectral pattern recognition

    Multispectral pattern recognition

    Multispectral remote sensing is the collection and analysis of reflected, emitted, or back-scattered energy from an object or an area of interest in multiple bands of regions of the electromagnetic spectrum (Jensen, 2005). Subcategories of multispectral remote sensing include hyperspectral, in which hundreds of bands are collected and analyzed, and ultraspectral remote sensing where many hundreds of bands are used (Logicon, 1997). The main purpose of multispectral imaging is the potential to classify the image using multispectral classification. This is a much faster method of image analysis than is possible by human interpretation. == Multispectral remote sensing systems == Remote sensing systems gather data via instruments typically carried on satellites in orbit around the Earth. The remote sensing scanner detects the energy that radiates from the object or area of interest. This energy is recorded as an analog electrical signal and converted into a digital value though an A-to-D conversion. There are several multispectral remote sensing systems that can be categorized in the following way: === Multispectral imaging using discrete detectors and scanning mirrors === Landsat Multispectral Scanner (MSS) Landsat Thematic Mapper (TM) NOAA Geostationary Operational Environmental Satellite (GOES) NOAA Advanced Very High Resolution Radiometer (AVHRR) NASA and ORBIMAGE, Inc., Sea-viewing Wide field-of-view Sensor (SeaWiFS) Daedalus, Inc., Aircraft Multispectral Scanner (AMS) NASA Airborne Terrestrial Applications Sensor (ATLAS) === Multispectral imaging using linear arrays === SPOT 1, 2, and 3 High Resolution Visible (HRV) sensors and Spot 4 and 5 High Resolution Visible Infrared (HRVIR) and vegetation sensor Indian Remote Sensing System (IRS) Linear Imaging Self-scanning Sensor (LISS) Space Imaging, Inc. (IKONOS) Digital Globe, Inc. (QuickBird) ORBIMAGE, Inc. (OrbView-3) ImageSat International, Inc. (EROS A1) NASA Terra Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) NASA Terra Multiangle Imaging Spectroradiometer (MISR) === Imaging spectrometry using linear and area arrays === NASA Jet Propulsion Laboratory Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Compact Airborne Spectrographic Imager 3 (CASI 3) NASA Terra Moderate Resolution Imaging Spectrometer (MODIS) NASA Earth Observer (EO-1) Advanced Land Imager (ALI), Hyperion, and LEISA Atmospheric Corrector (LAC) === Satellite analog and digital photographic systems === Russian SPIN-2 TK-350, and KVR-1000 NASA Space Shuttle and International Space Station Imagery == Multispectral classification methods == A variety of methods can be used for the multispectral classification of images: Algorithms based on parametric and nonparametric statistics that use ratio-and interval-scaled data and nonmetric methods that can also incorporate nominal scale data (Duda et al., 2001), Supervised or unsupervised classification logic, Hard or soft (fuzzy) set classification logic to create hard or fuzzy thematic output products, Per-pixel or object-oriented classification logic, and Hybrid approaches == Supervised classification == In this classification method, the identity and location of some of the land-cover types are obtained beforehand from a combination of fieldwork, interpretation of aerial photography, map analysis, and personal experience. The analyst would locate sites that have similar characteristics to the known land-cover types. These areas are known as training sites because the known characteristics of these sites are used to train the classification algorithm for eventual land-cover mapping of the remainder of the image. Multivariate statistical parameters (means, standard deviations, covariance matrices, correlation matrices, etc.) are calculated for each training site. All pixels inside and outside of the training sites are evaluated and allocated to the class with the more similar characteristics. === Classification scheme === The first step in the supervised classification method is to identify the land-cover and land-use classes to be used. Land-cover refers to the type of material present on the site (e.g. water, crops, forest, wet land, asphalt, and concrete). Land-use refers to the modifications made by people to the land cover (e.g. agriculture, commerce, settlement). All classes should be selected and defined carefully to properly classify remotely sensed data into the correct land-use and/or land-cover information. To achieve this purpose, it is necessary to use a classification system that contains taxonomically correct definitions of classes. If a hard classification is desired, the following classes should be used: Mutually exclusive: there is not any taxonomic overlap of any classes (i.e., rain forest and evergreen forest are distinct classes). Exhaustive: all land-covers in the area have been included. Hierarchical: sub-level classes (e.g., single-family residential, multiple-family residential) are created, allowing that these classes can be included in a higher category (e.g., residential). Some examples of hard classification schemes are: American Planning Association Land-Based Classification System United States Geological Survey Land-use/Land-cover Classification System for Use with Remote Sensor Data U.S. Department of the Interior Fish and Wildlife Service U.S. National Vegetation and Classification System International Geosphere-Biosphere Program IGBP Land Cover Classification System === Training sites === Once the classification scheme is adopted, the image analyst may select training sites in the image that are representative of the land-cover or land-use of interest. If the environment where the data was collected is relatively homogeneous, the training data can be used. If different conditions are found in the site, it would not be possible to extend the remote sensing training data to the site. To solve this problem, a geographical stratification should be done during the preliminary stages of the project. All differences should be recorded (e.g. soil type, water turbidity, crop species, etc.). These differences should be recorded on the imagery and the selection training sites made based on the geographical stratification of this data. The final classification map would be a composite of the individual stratum classifications. After the data are organized in different training sites, a measurement vector is created. This vector would contain the brightness values for each pixel in each band in each training class. The mean, standard deviation, variance-covariance matrix, and correlation matrix are calculated from the measurement vectors. Once the statistics from each training site are determined, the most effective bands for each class should be selected. The objective of this discrimination is to eliminate the bands that can provide redundant information. Graphical and statistical methods can be used to achieve this objective. Some of the graphic methods are: Bar graph spectral plots Cospectral mean vector plots Feature space plots Cospectral parallelepiped or ellipse plots === Classification algorithm === The last step in supervised classification is selecting an appropriate algorithm. The choice of a specific algorithm depends on the input data and the desired output. Parametric algorithms are based on the fact that the data is normally distributed. If the data is not normally distributed, nonparametric algorithms should be used. The more common nonparametric algorithms are: One-dimensional density slicing Parallelipiped Minimum distance Nearest-neighbor Expert system analysis Convolutional neural network == Unsupervised classification == Unsupervised classification (also known as clustering) is a method of partitioning remote sensor image data in multispectral feature space and extracting land-cover information. Unsupervised classification require less input information from the analyst compared to supervised classification because clustering does not require training data. This process consists in a series of numerical operations to search for the spectral properties of pixels. From this process, a map with m spectral classes is obtained. Using the map, the analyst tries to assign or transform the spectral classes into thematic information of interest (i.e. forest, agriculture, urban). This process may not be easy because some spectral clusters represent mixed classes of surface materials and may not be useful. The analyst has to understand the spectral characteristics of the terrain to be able to label clusters as a specific information class. There are hundreds of clustering algorithms. Two of the most conceptually simple algorithms are the chain method and the ISODATA method. === Chain method === The algorithm used in this method operates in a two-pass mode (it passes through the multispectral dataset two times. In the first pass, the program reads through the dataset and sequentially builds clusters (groups of p

    Read more →
  • Iris flower data set

    Iris flower data set

    The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus". The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish each species. Fisher's paper was published in the Annals of Eugenics (today the Annals of Human Genetics). == Use of the data set == Originally used as an example data set on which Fisher's linear discriminant analysis was applied, it became a typical test case for many statistical classification techniques in machine learning such as support vector machines. The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same. Nevertheless, all three species of Iris are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed. The data points are projected into the closest node. For each node the pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram (left) that the absolute majority of the samples of the different Iris species belong to the different nodes. Only a small fraction of Iris-virginica is mixed with Iris-versicolor (the mixed blue-green nodes in the diagram). Therefore, the three species of Iris (Iris setosa, Iris virginica and Iris versicolor) are separable by the unsupervising procedures of nonlinear principal component analysis. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree. == Data set == The data set contains a set of 150 records under five attributes: sepal length, sepal width, petal length, petal width and species. The iris data set is widely used as a beginner's data set for machine learning purposes. The data set is included in R base and Python in the machine learning library scikit-learn, so that users can access it without having to find a source for it. Several versions of the data set have been published. === R code illustrating usage === The example R code shown below reproduce the scatterplot displayed at the top of this article: === Python code illustrating usage === This code gives:

    Read more →
  • Multimodal learning

    Multimodal learning

    Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning. Multimodal learning was proposed in 2011 at the beginning of the deep learning period. Large multimodal models, such as Google Gemini and GPT-4o, have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena. == Motivation == Data usually comes with different modalities which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe information which may not be obvious from text. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the combined information from different modalities. == Multimodal transformers == Models such as CLIP (Contrastive Language–Image Pretraining) learn joint representations of images and text by optimizing contrastive objectives, allowing the model to match images with their corresponding textual descriptions. == Multimodal deep Boltzmann machines == A Boltzmann machine is a type of stochastic neural network invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They are named after the Boltzmann distribution in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. Each unit is like a neuron with a binary output that represents whether it is activated or not. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine. A more efficient architecture is called restricted Boltzmann machine where connection is only allowed between hidden unit and visible unit, which is described in the next section. Multimodal deep Boltzmann machines can process and learn from different types of information, such as images and text, simultaneously. This can notably be done by having a separate deep Boltzmann machine for each modality, for example one for images and one for text, joined at an additional top hidden layer. == Applications == Multimodal machine learning has numerous applications across various domains: Cross-modal retrieval: cross-modal retrieval allows users to search for data across different modalities (e.g., retrieving images based on text descriptions), improving multimedia search engines and content recommendation systems. Classification and missing data retrieval: multimodal Deep Boltzmann Machines outperform traditional models like support vector machines and latent Dirichlet allocation in classification tasks and can predict missing data in multimodal datasets, such as images and text. Healthcare diagnostics: multimodal models integrate medical imaging, genomic data, and patient records to improve diagnostic accuracy and early disease detection, especially in cancer screening. Content generation: models like DALL·E generate images from textual descriptions, benefiting creative industries, while cross-modal retrieval enables dynamic multimedia searches. Robotics and human-computer interaction: multimodal learning improves interaction in robotics and AI by integrating sensory inputs like speech, vision, and touch, aiding autonomous systems and human-computer interaction. Emotion recognition: combining visual, audio, and text data, multimodal systems enhance sentiment analysis and emotion recognition, applied in customer service, social media, and marketing.

    Read more →
  • Inception score

    Inception score

    The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inception v3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true: The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct". The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse". It has been somewhat superseded by the related Fréchet inception distance. While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth"). == Definition == Let there be two spaces, the space of images Ω X {\displaystyle \Omega _{X}} and the space of labels Ω Y {\displaystyle \Omega _{Y}} . The space of labels is finite. Let p g e n {\displaystyle p_{gen}} be a probability distribution over Ω X {\displaystyle \Omega _{X}} that we wish to judge. Let a discriminator be a function of type p d i s : Ω X → M ( Ω Y ) {\displaystyle p_{dis}:\Omega _{X}\to M(\Omega _{Y})} where M ( Ω Y ) {\displaystyle M(\Omega _{Y})} is the set of all probability distributions on Ω Y {\displaystyle \Omega _{Y}} . For any image x {\displaystyle x} , and any label y {\displaystyle y} , let p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} be the probability that image x {\displaystyle x} has label y {\displaystyle y} , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet. The Inception Score of p g e n {\displaystyle p_{gen}} relative to p d i s {\displaystyle p_{dis}} is I S ( p g e n , p d i s ) := exp ⁡ ( E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x ) ] ) {\displaystyle IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)} Equivalent rewrites include ln ⁡ I S ( p g e n , p d i s ) := E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ) ] {\displaystyle \ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]} ln ⁡ I S ( p g e n , p d i s ) := H [ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ] − E x ∼ p g e n [ H [ p d i s ( ⋅ | x ) ] ] {\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]} ln ⁡ I S {\displaystyle \ln IS} is nonnegative by Jensen's inequality. Pseudocode:INPUT discriminator p d i s {\displaystyle p_{dis}} . INPUT generator g {\displaystyle g} . Sample images x i {\displaystyle x_{i}} from generator. Compute p d i s ( ⋅ | x i ) {\displaystyle p_{dis}(\cdot |x_{i})} , the probability distribution over labels conditional on image x i {\displaystyle x_{i}} . Sum up the results to obtain p ^ {\displaystyle {\hat {p}}} , an empirical estimate of ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx} . Sample more images x i {\displaystyle x_{i}} from generator, and for each, compute D K L ( p d i s ( ⋅ | x i ) ‖ p ^ ) {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)} . Average the results, and take its exponential. RETURN the result. === Interpretation === A higher inception score is interpreted as "better", as it means that p g e n {\displaystyle p_{gen}} is a "sharp and distinct" collection of pictures. ln ⁡ I S ( p g e n , p d i s ) ∈ [ 0 , ln ⁡ N ] {\displaystyle \ln IS(p_{gen},p_{dis})\in [0,\ln N]} , where N {\displaystyle N} is the total number of possible labels. ln ⁡ I S ( p g e n , p d i s ) = 0 {\displaystyle \ln IS(p_{gen},p_{dis})=0} iff for almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} p d i s ( ⋅ | x ) = ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx} That means p g e n {\displaystyle p_{gen}} is completely "indistinct". That is, for any image x {\displaystyle x} sampled from p g e n {\displaystyle p_{gen}} , discriminator returns exactly the same label predictions p d i s ( ⋅ | x ) {\displaystyle p_{dis}(\cdot |x)} . The highest inception score N {\displaystyle N} is achieved if and only if the two conditions are both true: For almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} , the distribution p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} is concentrated on one label. That is, H y [ p d i s ( y | x ) ] = 0 {\displaystyle H_{y}[p_{dis}(y|x)]=0} . That is, every image sampled from p g e n {\displaystyle p_{gen}} is exactly classified by the discriminator. For every label y {\displaystyle y} , the proportion of generated images labelled as y {\displaystyle y} is exactly E x ∼ p g e n [ p d i s ( y | x ) ] = 1 N {\displaystyle \mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}} . That is, the generated images are equally distributed over all labels.

    Read more →
  • Differential evolution

    Differential evolution

    Differential evolution (DE) is an evolutionary algorithm to optimize a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the optimized problem and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means DE does not require the optimization problem to be differentiable, as is required by classic optimization methods such as gradient descent and quasi-newton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way, the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. == History == Storn and Price introduced Differential Evolution in 1995. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. Surveys on the multi-faceted research aspects of DE can be found in journal articles. == Algorithm == A basic variant of the DE algorithm works by having a population of candidate solutions (called agents). These agents are moved around in the search-space by using simple mathematical formulae to combine the positions of existing agents from the population. If the new position of an agent is an improvement then it is accepted and forms part of the population, otherwise the new position is simply discarded. The process is repeated and by doing so it is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered. Formally, let f : R n → R {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} } be the fitness function which must be minimized (note that maximization can be performed by considering the function h := − f {\displaystyle h:=-f} instead). The function takes a candidate solution as argument in the form of a vector of real numbers. It produces a real number as output which indicates the fitness of the given candidate solution. The gradient of f {\displaystyle f} is not known. The goal is to find a solution m {\displaystyle \mathbf {m} } for which f ( m ) ≤ f ( p ) {\displaystyle f(\mathbf {m} )\leq f(\mathbf {p} )} for all p {\displaystyle \mathbf {p} } in the search-space, which means that m {\displaystyle \mathbf {m} } is the global minimum. Let x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} designate a candidate solution (agent) in the population. The basic DE algorithm can then be described as follows: Choose the parameters NP ≥ 4 {\displaystyle {\text{NP}}\geq 4} , CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} , and F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} . NP : NP {\displaystyle {\text{NP}}} is the population size, i.e. the number of candidate agents or "parents". CR : The parameter CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} is called the crossover probability. F : The parameter F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} is called the differential weight. Typical settings are N P = 10 n {\displaystyle NP=10n} , C R = 0.9 {\displaystyle CR=0.9} and F = 0.8 {\displaystyle F=0.8} . Optimization performance may be greatly impacted by these choices; see below. Initialize all agents x {\displaystyle \mathbf {x} } with random positions in the search-space. Until a termination criterion is met (e.g. number of iterations performed, or adequate fitness reached), repeat the following: For each agent x {\displaystyle \mathbf {x} } in the population do: Pick three agents a , b {\displaystyle \mathbf {a} ,\mathbf {b} } , and c {\displaystyle \mathbf {c} } from the population at random, they must be distinct from each other as well as from agent x {\displaystyle \mathbf {x} } . ( a {\displaystyle \mathbf {a} } is called the "base" vector.) Pick a random index R ∈ { 1 , … , n } {\displaystyle R\in \{1,\ldots ,n\}} where n {\displaystyle n} is the dimensionality of the problem being optimized. Compute the agent's potentially new position y = [ y 1 , … , y n ] {\displaystyle \mathbf {y} =[y_{1},\ldots ,y_{n}]} as follows: For each i ∈ { 1 , … , n } {\displaystyle i\in \{1,\ldots ,n\}} , pick a uniformly distributed random number r i ∼ U ( 0 , 1 ) {\displaystyle r_{i}\sim U(0,1)} If r i < C R {\displaystyle r_{i} Read more →

  • Proximal policy optimization

    Proximal policy optimization

    Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large. == History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games. == TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} Hyperparameters: KL-divergence limit δ {\textstyle \delta } , backtracking coefficient α {\textstyle \alpha } , maximum number of backtracking steps K {\textstyle K} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Estimate policy gradient as g ^ k = 1 | D k | ∑ τ ∈ D k ∑ t = 0 T ∇ θ log ⁡ π θ ( a t ∣ s t ) | θ k A ^ t {\displaystyle {\hat {g}}_{k}=\left.{\frac {1}{\left|{\mathcal {D}}_{k}\right|}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\nabla _{\theta }\log \pi _{\theta }\left(a_{t}\mid s_{t}\right)\right|_{\theta _{k}}{\hat {A}}_{t}} Use the conjugate gradient algorithm to compute x ^ k ≈ H ^ k − 1 g ^ k {\displaystyle {\hat {x}}_{k}\approx {\hat {H}}_{k}^{-1}{\hat {g}}_{k}} where H ^ k {\textstyle {\hat {H}}_{k}} is the Hessian of the sample average KL-divergence. Update the policy by backtracking line search with θ k + 1 = θ k + α j 2 δ x ^ k T H ^ k x ^ k x ^ k {\displaystyle \theta _{k+1}=\theta _{k}+\alpha ^{j}{\sqrt {\frac {2\delta }{{\hat {x}}_{k}^{T}{\hat {H}}_{k}{\hat {x}}_{k}}}}{\hat {x}}_{k}} where j ∈ { 0 , 1 , 2 , … K } {\textstyle j\in \{0,1,2,\ldots K\}} is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. == PPO == The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Update the policy by maximizing the PPO-Clip objective: θ k + 1 = arg ⁡ max θ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T min ( π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) A π θ k ( s t , a t ) , g ( ϵ , A π θ k ( s t , a t ) ) ) {\displaystyle \theta _{k+1}=\arg \max _{\theta }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\min \left({\frac {\pi _{\theta }\left(a_{t}\mid s_{t}\right)}{\pi _{\theta _{k}}\left(a_{t}\mid s_{t}\right)}}A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right),\quad g\left(\epsilon ,A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right)\right)\right)} typically via stochastic gradient ascent with Adam. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent. Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation. Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process. === Basic concepts === To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution P ( A | S ) {\displaystyle P(A|S)} generated by the policy network. The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes. === Policy gradient laws: the advantage function === The advantage function (denoted as A {\displaystyle A} ) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output. The advantage function can be defined as A = Q − V {\displaystyle A=Q-V} , where Q {\displaystyle Q} is the discounted sum of rewards (the total weighted reward for the completion of an episode) and V {\displaystyle V} is the baseline estimate. Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With Q {\displaystyle Q} and V {\displaystyle V} computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return. If A > 0 {\displaystyle A>0} , the actual return of the action is better than the expected return from experience; if A < 0 {\displaystyle A<0} , the actual return is worse. === Ratio function === In PPO, the ratio function ( r t {\displaystyle r_{t}} ) calculates the probability of selecting action a {\displaystyle a} in state s {\displaystyle s} given the current policy network, divided by the previous probability under the old policy. In other words: If r t ( θ ) > 1 {\displaystyle r_{t}(\theta )>1} , where θ {\displaystyle \theta } are the policy network parameters, then selecting action a {\displaystyle a} in state s {\displaystyle s} is more likely based on the current policy than the previous policy. If 0 ≤ r t ( θ ) < 1 {\displaystyle 0\leq r_{t}(\theta )<1} , then selecting actio

    Read more →
  • Sigmoid function

    Sigmoid function

    A sigmoid function is any mathematical function whose graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function. Other sigmoid functions are given in the Examples section. In some fields, most notably in the context of artificial neural networks, the term "sigmoid function" is used as a synonym for "logistic function". Special cases of sigmoid functions include the Gompertz curve (used in modeling systems that saturate at large values of x) and the ogee curve (used in the spillway of some dams). Sigmoid functions have domain of all real numbers, with return (response) value commonly monotonically increasing but could be decreasing. Sigmoid functions most often show a return value (y axis) in the range 0 to 1. Another commonly used range is from −1 to 1. There is also the Heaviside step function, which instantaneously transitions between 0 and 1. A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons. Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic density, the normal density, and Student's t probability density functions. The logistic sigmoid function is invertible, and its inverse is the logit function. == Theory == In mathematics, a unitary sigmoid function is a bounded sigmoid-type function normalized to the unit range, typically with lower and upper asymptotes at 0 and 1. The theory proposed by Grebenc distinguishes three kinds of unitary sigmoid functions according to their asymptotic behavior and the presence or absence of oscillation near the asymptotes. A general form of a unitary sigmoid function is y = A S ( f ( x ) ) + B , {\displaystyle y=A\,S(f(x))+B,} where S {\displaystyle S} is an increasing sigmoid function, f ( x ) {\displaystyle f(x)} is a transformation of the independent variable, and A {\displaystyle A} and B {\displaystyle B} are constants controlling scaling and translation. === Classification === ==== 1st kind ==== A unitary sigmoid function of the first kind is a bounded increasing function that approaches its lower and upper asymptotes monotonically, without oscillation. This class includes many of the standard sigmoid functions used in statistics, biomathematics, and engineering, such as the logistic function and related generalizations. ==== 2nd kind ==== A unitary sigmoid function of the second kind is a bounded increasing function that oscillates near the upper asymptote while preserving an overall sigmoid transition. ==== 3rd kind ==== A unitary sigmoid function of the third kind is a bounded increasing function that oscillates near both the lower and upper asymptotes. These functions retain the global shape of a sigmoid curve but exhibit oscillatory behavior in the vicinity of both limiting states. === Taxonomy === The tables below show the taxonomy of unitary sigmoid functions of all three kinds. Table 1. Taxonomy matrix with examples of sigmoid functions of the 1st kind Table 2. Taxonomy matrix with examples of sigmoid functions of the 2nd kind on the unbounded interval Table 3. Taxonomy matrix with examples of sigmoid functions of the 3rd kind === Construction methods === The same theory presents a list of 30 methods for constructing sigmoid functions.. These include algebraic transformations, integration and convolution methods, constructions from bell-shaped functions, solutions of ordinary and partial differential equations, recursive schemes, stochastic differential equations, feedback systems, and chaotic systems. M0: Construction method for sigmoid functions not evident or intuitive M1: Inverse of singularity functions M2: Sigmoid functions of embedded positive functions M3: Rising a sigmoid function to the power M4: Exponentiating a sigmoid function M5: Symmetric sigmoid functions derived from asymmetric ones M6: Sigmoid functions of the reciprocal independent variable M7: Embedding a sigmoid function into other function M8: Sum of sigmoid functions M9: Multiplication of sigmoid functions M10: Integral of the product of an increasing and a decreasing function M11: Derivation from lambda (bell-shaped) functions M12: Integration of lambda (bell-shaped) function M13: Integration of the sum of lambda (bell-shaped) functions M14: Integration of the product of two lambda (bell-shaped) functions M15: Integration of the difference of two shifted sigmoid functions M16: Integration of the product of two shifted sigmoid functions M17: Convolution of sigmoid functions M18: Integration of the product of lambda and sigmoid function M19: Solutions of ordinary differential equations M20: Solutions of partial differential equation (PDE) M21: Solutions of functional differential equation (FDE) M22: Sum of a sigmoid function and some derivatives M23: Combination of sigmoid functions, its derivative and integral M24: Filtering sigmoid functions M25: Special cases of Gauss hypergeometric functions M26: Feedback closed-loop systems M27: Recursive functions M28: Recursive time-delayed feed-forward loops M29: Solutions of stochastic differential equation M30: Chaotic sigmoid functions Consult reference for more details. == Definition == A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a positive derivative at each point. == Properties == In general, a sigmoid function is monotonic, and has a first derivative which is bell shaped. Conversely, the integral of any continuous, non-negative, bell-shaped function (with one local maximum and no local minimum, unless degenerate) will be sigmoidal. Thus the cumulative distribution functions for many common probability distributions are sigmoidal. One such example is the error function, which is related to the cumulative distribution function of a normal distribution; another is the arctan function, which is related to the cumulative distribution function of a Cauchy distribution. A sigmoid function is constrained by a pair of horizontal asymptotes as x → ± ∞ {\displaystyle x\rightarrow \pm \infty } . A sigmoid function is convex for values less than a particular point, and it is concave for values greater than that point: in many of the examples here, that point is 0. == Examples == Logistic function f ( x ) = 1 1 + e − x {\displaystyle f(x)={\frac {1}{1+e^{-x}}}} Hyperbolic tangent (shifted and scaled version of the logistic function, above) f ( x ) = tanh ⁡ x = e x − e − x e x + e − x {\displaystyle f(x)=\tanh x={\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}} Arctangent function f ( x ) = arctan ⁡ x {\displaystyle f(x)=\arctan x} Gudermannian function f ( x ) = gd ⁡ ( x ) = ∫ 0 x d t cosh ⁡ t = 2 arctan ⁡ ( tanh ⁡ ( x 2 ) ) {\displaystyle f(x)=\operatorname {gd} (x)=\int _{0}^{x}{\frac {dt}{\cosh t}}=2\arctan \left(\tanh \left({\frac {x}{2}}\right)\right)} Error function f ( x ) = erf ⁡ ( x ) = 2 π ∫ 0 x e − t 2 d t {\displaystyle f(x)=\operatorname {erf} (x)={\frac {2}{\sqrt {\pi }}}\int _{0}^{x}e^{-t^{2}}\,dt} Generalised logistic function f ( x ) = ( 1 + e − x ) − α , α > 0 {\displaystyle f(x)=\left(1+e^{-x}\right)^{-\alpha },\quad \alpha >0} Smoothstep function f ( x ) = { ( ∫ 0 1 ( 1 − u 2 ) N d u ) − 1 ∫ 0 x ( 1 − u 2 ) N d u , | x | ≤ 1 sgn ⁡ ( x ) | x | ≥ 1 N ∈ Z ≥ 1 {\displaystyle f(x)={\begin{cases}{\displaystyle \left(\int _{0}^{1}\left(1-u^{2}\right)^{N}du\right)^{-1}\int _{0}^{x}\left(1-u^{2}\right)^{N}\ du},&|x|\leq 1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\quad N\in \mathbb {Z} \geq 1} Some algebraic functions, for example f ( x ) = x 1 + x 2 {\displaystyle f(x)={\frac {x}{\sqrt {1+x^{2}}}}} and in a more general form f ( x ) = x ( 1 + | x | k ) 1 / k {\displaystyle f(x)={\frac {x}{\left(1+|x|^{k}\right)^{1/k}}}} Up to shifts and scaling, many sigmoids are special cases of f ( x ) = φ ( φ ( x , β ) , α ) , {\displaystyle f(x)=\varphi (\varphi (x,\beta ),\alpha ),} where φ ( x , λ ) = { ( 1 − λ x ) 1 / λ λ ≠ 0 e − x λ = 0 {\displaystyle \varphi (x,\lambda )={\begin{cases}(1-\lambda x)^{1/\lambda }&\lambda \neq 0\\e^{-x}&\lambda =0\\\end{cases}}} is the inverse of the negative Box–Cox transformation, and α < 1 {\displaystyle \alpha <1} and β < 1 {\displaystyle \beta <1} are shape parameters. Smooth transition function normalized to (−1,1): f ( x ) = { 2 1 + e − 2 m x 1 − x 2 − 1 , | x | < 1 sgn ⁡ ( x ) | x | ≥ 1 = { tanh ⁡ ( m x 1 − x 2 ) , | x | < 1 sgn ⁡ ( x ) | x | ≥ 1 {\displaystyle {\begin{aligned}f(x)&={\begin{cases}{\displaystyle {\frac {2}{1+e^{-2m{\frac {x}{1-x^{2}}}}}}-1},&|x|<1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\\&={\begin{cases}{\displaystyle \tanh \left(m{\frac {x}{1-x^{2}}}\right)},&|x|<1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\end{aligned}}} using the hyperbolic tangent mentioned above. Here, m {\displaystyle m} is a free parameter encoding the slope at x = 0 {\displaystyle x=0} , which must be great

    Read more →
  • MLOps

    MLOps

    MLOps or ML Ops is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently. It bridges the gap between machine learning development and production operations, ensuring that models are robust, scalable, and aligned with business goals. The word is a compound of "machine learning" and the continuous delivery practice (CI/CD) of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between data scientists, DevOps, and machine learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics. == Definition == MLOps is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops. == History == Interest in operationalizing machine learning systems began to grow in the mid-2010s as ML projects started moving from experimentation to production use. The challenges associated with sustaining such systems were highlighted in a 2015 paper. The predicted growth in machine learning included an estimated doubling of ML pilots and implementations from 2017 to 2018, and again from 2018 to 2020. Reports show a majority (up to 88%) of corporate machine learning initiatives are struggling to move beyond test stages. However, those organizations that actually put machine learning into production saw a 3–15% profit margin increases. The MLOps market size was USD 2,191.8 Million in 2024, and is projected to be USD 16,613.4 Million in 2030. == Architecture == Machine Learning systems can be categorized in eight different categories: data collection, data processing, feature engineering, data labeling, model design, model training and optimization, endpoint deployment, and endpoint monitoring. Each step in the machine learning lifecycle is built in its own system, but requires interconnection. These are the minimum systems that enterprises need to scale machine learning within their organization. == Goals == There are a number of goals enterprises want to achieve through MLOps systems successfully implementing ML across the enterprise, including: Deployment and automation Reproducibility of models and predictions Diagnostics Governance and regulatory compliance Scalability Collaboration Business uses Monitoring and management A standard practice, such as MLOps, takes into account each of the aforementioned areas, which can help enterprises optimize workflows and avoid issues during implementation. Vendors such as Adaptive ML deliver commercial reinforcement learning operations (RLOps) and MLOps-infrastructure, targeting organizations deploying large language models in production. A common architecture of an MLOps system would include data science platforms where models are constructed and the analytical engines where computations are performed, with the MLOps tool orchestrating the movement of machine learning models, data and outcomes between the systems.

    Read more →
  • Generalized blockmodeling of binary networks

    Generalized blockmodeling of binary networks

    Generalized blockmodeling of binary networks (also relational blockmodeling) is an approach of generalized blockmodeling, analysing the binary network(s). As most network analyses deal with binary networks, this approach is also considered as the fundamental approach of blockmodeling. This is especially noted, as the set of ideal blocks, when used for interpretation of blockmodels, have binary link patterns, which precludes them to be compared with valued empirical blocks. When analysing the binary networks, the criterion function is measuring block inconsistencies, while also reporting the possible errors. The ideal block in binary blockmodeling has only three types of conditions: "a certain cell must be (at least) 1, a certain cell must be 0 and the f {\displaystyle f} over each row (or column) must be at least 1". It is also used as a basis for developing the generalized blockmodeling of valued networks.

    Read more →
  • Semidefinite embedding

    Semidefinite embedding

    Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is an algorithm in computer science that uses semidefinite programming to perform non-linear dimensionality reduction of high-dimensional vectorial input data. It is motivated by the observation that kernel Principal Component Analysis (kPCA) does not reduce the data dimensionality, as it leverages the Kernel trick to non-linearly map the original data into an inner-product space. == Algorithm == MVU creates a mapping from the high dimensional input vectors to some low dimensional Euclidean vector space in the following steps: A neighbourhood graph is created. Each input is connected with its k-nearest input vectors (according to Euclidean distance metric) and all k-nearest neighbors are connected with each other. If the data is sampled well enough, the resulting graph is a discrete approximation of the underlying manifold. The neighbourhood graph is "unfolded" with the help of semidefinite programming. Instead of learning the output vectors directly, the semidefinite programming aims to find an inner product matrix that maximizes the pairwise distances between any two inputs that are not connected in the neighbourhood graph while preserving the nearest neighbors distances. The low-dimensional embedding is finally obtained by application of multidimensional scaling on the learned inner product matrix. The steps of applying semidefinite programming followed by a linear dimensionality reduction step to recover a low-dimensional embedding into a Euclidean space were first proposed by Linial, London, and Rabinovich. == Optimization formulation == Let X {\displaystyle X\,\!} be the original input and Y {\displaystyle Y\,\!} be the embedding. If i , j {\displaystyle i,j\,\!} are two neighbors, then the local isometry constraint that needs to be satisfied is: | X i − X j | 2 = | Y i − Y j | 2 {\displaystyle |X_{i}-X_{j}|^{2}=|Y_{i}-Y_{j}|^{2}\,\!} Let G , K {\displaystyle G,K\,\!} be the Gram matrices of X {\displaystyle X\,\!} and Y {\displaystyle Y\,\!} (i.e.: G i j = X i ⋅ X j , K i j = Y i ⋅ Y j {\displaystyle G_{ij}=X_{i}\cdot X_{j},K_{ij}=Y_{i}\cdot Y_{j}\,\!} ). We can express the above constraint for every neighbor points i , j {\displaystyle i,j\,\!} in term of G , K {\displaystyle G,K\,\!} : G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i {\displaystyle G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji}\,\!} In addition, we also want to constrain the embedding Y {\displaystyle Y\,\!} to center at the origin: 0 = | ∑ i Y i | 2 ⇔ ( ∑ i Y i ) ⋅ ( ∑ i Y i ) ⇔ ∑ i , j Y i ⋅ Y j ⇔ ∑ i , j K i j {\displaystyle 0=|\sum _{i}Y_{i}|^{2}\Leftrightarrow (\sum _{i}Y_{i})\cdot (\sum _{i}Y_{i})\Leftrightarrow \sum _{i,j}Y_{i}\cdot Y_{j}\Leftrightarrow \sum _{i,j}K_{ij}} As described above, except the distances of neighbor points are preserved, the algorithm aims to maximize the pairwise distance of every pair of points. The objective function to be maximized is: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}} Intuitively, maximizing the function above is equivalent to pulling the points as far away from each other as possible and therefore "unfold" the manifold. The local isometry constraint Let τ = m a x { η i j | Y i − Y j | 2 } {\displaystyle \tau =max\{\eta _{ij}|Y_{i}-Y_{j}|^{2}\}\,\!} where η i j := { 1 if i is a neighbour of j 0 otherwise . {\displaystyle \eta _{ij}:={\begin{cases}1&{\mbox{if}}\ i{\mbox{ is a neighbour of }}j\\0&{\mbox{otherwise}}.\end{cases}}} prevents the objective function from diverging (going to infinity). Since the graph has N points, the distance between any two points | Y i − Y j | 2 ≤ N τ {\displaystyle |Y_{i}-Y_{j}|^{2}\leq N\tau \,\!} . We can then bound the objective function as follows: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 ≤ 1 2 N ∑ i , j ( N τ ) 2 = N 3 τ 2 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\leq {\dfrac {1}{2N}}\sum _{i,j}(N\tau )^{2}={\dfrac {N^{3}\tau ^{2}}{2}}\,\!} The objective function can be rewritten purely in the form of the Gram matrix: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 = 1 2 N ∑ i , j ( Y i 2 + Y j 2 − Y i ⋅ Y j − Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − ∑ i , j Y i ⋅ Y j − ∑ i , j Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − 0 − 0 ) = 1 N ( ∑ i Y i 2 ) = 1 N ( T r ( K ) ) {\displaystyle {\begin{aligned}T(Y)&{}={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\\&{}={\dfrac {1}{2N}}\sum _{i,j}(Y_{i}^{2}+Y_{j}^{2}-Y_{i}\cdot Y_{j}-Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-\sum _{i,j}Y_{i}\cdot Y_{j}-\sum _{i,j}Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-0-0)\\&{}={\dfrac {1}{N}}(\sum _{i}Y_{i}^{2})={\dfrac {1}{N}}(Tr(K))\\\end{aligned}}\,\!} Finally, the optimization can be formulated as: Maximize T r ( K ) subject to K ⪰ 0 , ∑ i j K i j = 0 and G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i , ∀ i , j where η i j = 1 , {\displaystyle {\begin{aligned}&{\text{Maximize}}&&Tr(\mathbf {K} )\\&{\text{subject to}}&&\mathbf {K} \succeq 0,\sum _{ij}\mathbf {K} _{ij}=0\\&{\text{and}}&&G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji},\forall i,j{\mbox{ where }}\eta _{ij}=1,\end{aligned}}} After the Gram matrix K {\displaystyle K\,\!} is learned by semidefinite programming, the output Y {\displaystyle Y\,\!} can be obtained via Cholesky decomposition. In particular, the Gram matrix can be written as K i j = ∑ α = 1 N ( λ α V α i V α j ) {\displaystyle K_{ij}=\sum _{\alpha =1}^{N}(\lambda _{\alpha }V_{\alpha i}V_{\alpha j})\,\!} where V α i {\displaystyle V_{\alpha i}\,\!} is the i-th element of eigenvector V α {\displaystyle V_{\alpha }\,\!} of the eigenvalue λ α {\displaystyle \lambda _{\alpha }\,\!} . It follows that the α {\displaystyle \alpha \,\!} -th element of the output Y i {\displaystyle Y_{i}\,\!} is λ α V α i {\displaystyle {\sqrt {\lambda _{\alpha }}}V_{\alpha i}\,\!} .

    Read more →