AI Avatar From Photo

AI Avatar From Photo — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Spreading activation

    Spreading activation

    Spreading activation is a method for searching associative networks, biological and artificial neural networks, or semantic networks. The search process is initiated by labeling a set of source nodes (e.g. concepts in a semantic network) with weights or "activation" and then iteratively propagating or "spreading" that activation out to other nodes linked to the source nodes. Most often these "weights" are real values that decay as activation propagates through the network. When the weights are discrete this process is often referred to as marker passing. Activation may originate from alternate paths, identified by distinct markers, and terminate when two alternate paths reach the same node. However brain studies show that several different brain areas play an important role in semantic processing. Spreading activation in semantic networks as a model were invented in cognitive psychology to model the fan out effect. Spreading activation can also be applied in information retrieval, by means of a network of nodes representing documents and terms contained in those documents. == Cognitive psychology == As it relates to cognitive psychology, spreading activation is the theory of how the brain iterates through a network of associated ideas to retrieve specific information. The spreading activation theory presents the array of concepts within our memory as cognitive units, each consisting of a node and its associated elements or characteristics, all connected together by edges. A spreading activation network can be represented schematically, in a sort of web diagram with shorter lines between two nodes meaning the ideas are more closely related and will typically be associated more quickly to the original concept. In memory psychology, the spreading activation model holds that people organize their knowledge of the world based on their personal experiences, which in turn form the network of ideas that is the person's knowledge of the world. When a word (the target) is preceded by an associated word (the prime) in word recognition tasks, participants seem to perform better in the amount of time that it takes them to respond. For instance, subjects respond faster to the word "doctor" when it is preceded by "nurse" than when it is preceded by an unrelated word like "carrot". This semantic priming effect with words that are close in meaning within the cognitive network has been seen in a wide range of tasks given by experimenters, ranging from sentence verification to lexical decision and naming. As another example, if the original concept is "red" and the concept "vehicles" is primed, they are much more likely to say "fire engine" instead of something unrelated to vehicles, such as "cherries". If instead "fruits" was primed, they would likely name "cherries" and continue on from there. The activation of pathways in the network has everything to do with how closely linked two concepts are by meaning, as well as how a subject is primed. == Algorithm == A directed graph is populated by Nodes[ 1...N ] each having an associated activation value A [ i ] which is a real number in the range [0.0 ... 1.0]. A Link[ i, j ] connects source node[ i ] with target node[ j ]. Each edge has an associated weight W [ i, j ] usually a real number in the range [0.0 ... 1.0]. Parameters: Firing threshold F, a real number in the range [0.0 ... 1.0] Decay factor D, a real number in the range [0.0 ... 1.0] Steps: Initialize the graph setting all activation values A [ i ] to zero. Set one or more origin nodes to an initial activation value greater than the firing threshold F. A typical initial value is 1.0. For each unfired node [ i ] in the graph having an activation value A [ i ] greater than the node firing threshold F: For each Link [ i, j ] connecting the source node [ i ] with target node [ j ], adjust A [ j ] = A [ j ] + (A [ i ] W [ i, j ] D) where D is the decay factor. If a target node receives an adjustment to its activation value so that it would exceed 1.0, then set its new activation value to 1.0. Likewise maintain 0.0 as a lower bound on the target node's activation value should it receive an adjustment to below 0.0. Once a node has fired it may not fire again, although variations of the basic algorithm permit repeated firings and loops through the graph. Nodes receiving a new activation value that exceeds the firing threshold F are marked for firing on the next spreading activation cycle. If activation originates from more than one node, a variation of the algorithm permits marker passing to distinguish the paths by which activation is spread over the graph The procedure terminates when either there are no more nodes to fire or in the case of marker passing from multiple origins, when a node is reached from more than one path. Variations of the algorithm that permit repeated node firings and activation loops in the graph, terminate after a steady activation state, with respect to some delta, is reached, or when a maximum number of iterations is exceeded. == Examples ==

    Read more →
  • Types of artificial neural networks

    Types of artificial neural networks

    Types of neural networks (NN) include a family of techniques. The simplest types have static components, including number of units, number of layers, unit weights and topology. Dynamic NNs evolve via learning. Some types allow/require learning to be "supervised" by the operator, while others operate independently. Some types operate purely in hardware, while others are purely software and run on general purpose computers. The main types are: Transformers: these use attention to analyze every token in the input stream against every other token in the stream. That technique has enabled neural networks to reach the general public via chatbots, code generators and many other forms. Convolutional neural networks (CNN): a FNN that uses kernels and regularization to evade problems in prior generations of NNs. They are typically used to analyze visual and other two-dimensional data. Generative adversarial networks set networks (of varying structure) against each other, each trying to push the other(s) to produce better results such as winning a game or to deceive the opponent about the authenticity of an input. == Feedforward == In feedforward neural networks the information moves from the input to output directly in every layer. There can be hidden layers with or without cycles/loops to sequence inputs. Feedforward networks can be constructed with various types of units, such as binary McCulloch–Pitts neurons, the simplest of which is the perceptron. Continuous neurons, frequently with sigmoidal activation, are used in the context of backpropagation. == Group method of data handling == The Group Method of Data Handling (GMDH) features fully automatic structural and parametric model optimization. The node activation functions are Kolmogorov–Gabor polynomials that permit additions and multiplications. It uses a deep multilayer perceptron with eight layers. It is a supervised learning network that grows layer by layer, where each layer is trained by regression analysis. Useless items are detected using a validation set, and pruned through regularization. The size and depth of the resulting network depends on the task. == Autoencoder == An autoencoder, autoassociator or Diabolo network is similar to the multilayer perceptron (MLP) – with an input layer, an output layer and one or more hidden layers connecting them. However, the output layer has the same number of units as the input layer. Its purpose is to reconstruct its own inputs (instead of emitting a target value). Therefore, autoencoders are unsupervised learning models. An autoencoder is used for unsupervised learning of efficient codings, typically for the purpose of dimensionality reduction and for learning generative models of data. == Probabilistic == A probabilistic neural network (PNN) is a four-layer feedforward neural network. The layers are Input, hidden pattern, hidden summation, and output. In the PNN algorithm, the parent probability distribution function (PDF) of each class is approximated by a Parzen window and a non-parametric function. Then, using PDF of each class, the class probability of a new input is estimated and Bayes’ rule is employed to allocate it to the class with the highest posterior probability. It was derived from the Bayesian network and a statistical algorithm called Kernel Fisher discriminant analysis. It is used for classification and pattern recognition. == Time delay == A time delay neural network (TDNN) is a feedforward architecture for sequential data that recognizes features independent of sequence position. In order to achieve time-shift invariance, delays are added to the input so that multiple data points (points in time) are analyzed together. It usually forms part of a larger pattern recognition system. It has been implemented using a perceptron network whose connection weights were trained with back propagation (supervised learning). == Convolutional == A convolutional neural network (CNN, or ConvNet or shift invariant or space invariant) is a class of deep network, composed of one or more convolutional layers with fully connected layers (matching those in typical ANNs) on top. It uses tied weights and pooling layers. In particular, max-pooling. It is often structured via Fukushima's convolutional architecture. They are variations of multilayer perceptrons that use minimal preprocessing. This architecture allows CNNs to take advantage of the 2D structure of input data. Its unit connectivity pattern is inspired by the organization of the visual cortex. Units respond to stimuli in a restricted region of space known as the receptive field. Receptive fields partially overlap, over-covering the entire visual field. Unit response can be approximated mathematically by a convolution operation. CNNs are suitable for processing visual and other two-dimensional data. They have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate. Capsule Neural Networks (CapsNet) add structures called capsules to a CNN and reuse output from several capsules to form more stable (with respect to various perturbations) representations. Examples of applications in computer vision include DeepDream and robot navigation. They have wide applications in image and video recognition, recommender systems and natural language processing. == Deep stacking network == A deep stacking network (DSN) (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Yu. It formulates the learning as a convex optimization problem with a closed-form solution, emphasizing the mechanism's similarity to stacked generalization. Each DSN block is a simple module that is easy to train by itself in a supervised fashion without backpropagation for the entire blocks. Each block consists of a simplified multi-layer perceptron (MLP) with a single hidden layer. The hidden layer h has logistic sigmoidal units, and the output layer has linear units. Connections between these layers are represented by weight matrix U; input-to-hidden-layer connections have weight matrix W. Target vectors t form the columns of matrix T, and the input data vectors x form the columns of matrix X. The matrix of hidden units is H = σ ( W T X ) {\displaystyle {\boldsymbol {H}}=\sigma ({\boldsymbol {W}}^{T}{\boldsymbol {X}})} . Modules are trained in order, so lower-layer weights W are known at each stage. The function performs the element-wise logistic sigmoid operation. Each block estimates the same final label class y, and its estimate is concatenated with original input X to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix U given other weights in the network can be formulated as a convex optimization problem: min U T f = ‖ U T H − T ‖ F 2 , {\displaystyle \min _{U^{T}}f=\|{\boldsymbol {U}}^{T}{\boldsymbol {H}}-{\boldsymbol {T}}\|_{F}^{2},} which has a closed-form solution. Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely discriminative tasks, DSNs outperform conventional DBNs. === Tensor deep stacking networks === This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from covariance statistics, and it transforms the non-convex problem of a lower-layer to a convex sub-problem of an upper-layer. TDSNs use covariance statistics in a bilinear mapping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order tensor. While parallelization and scalability are not considered seriously in conventional DNNs, all learning for DSNs and TDSNs is done in batch mode, to allow parallelization. Parallelization allows scaling the design to larger (deeper) architectures and data sets. The basic architecture is suitable for diverse tasks such as classification and regression. == Physics-informed == Such a neural network is designed for the numerical solution of mathematical equations, such as differential, integral, delay, fractional and others. As input parameters, PINN accepts variables (spatial, temporal, and others), transmits them through the network block. At the output, it produces an approximate solution and substitutes it into the mathematical model, considering the initial and boundary conditions. If the solution does not satisfy the required accuracy, one uses the backpropagation and rectify the solution. Besides PINN, other architectures have been developed to produce surrogate models for scientific comput

    Read more →
  • Wake-sleep algorithm

    Wake-sleep algorithm

    The wake-sleep algorithm is an unsupervised learning algorithm for deep generative models, especially Helmholtz Machines. The algorithm is similar to the expectation-maximization algorithm, and optimizes the model likelihood for observed data. The name of the algorithm derives from its use of two learning phases, the “wake” phase and the “sleep” phase, which are performed alternately. It can be conceived as a model for learning in the brain, but is also being applied for machine learning. == Description == The goal of the wake-sleep algorithm is to find a hierarchical representation of observed data. In a graphical representation of the algorithm, data is applied to the algorithm at the bottom, while higher layers form gradually more abstract representations. Between each pair of layers are two sets of weights: Recognition weights, which define how representations are inferred from data, and generative weights, which define how these representations relate to data. == Training == Training consists of two phases – the “wake” phase and the “sleep” phase. It has been proven that this learning algorithm is convergent. === The "wake" phase === Neurons are fired by recognition connections (from what would be input to what would be output). Generative connections (leading from outputs to inputs) are then modified to increase probability that they would recreate the correct activity in the layer below – closer to actual data from sensory input. === The "sleep" phase === The process is reversed in the “sleep” phase – neurons are fired by generative connections while recognition connections are being modified to increase probability that they would recreate the correct activity in the layer above – further to actual data from sensory input. == Extensions == Since the recognition network is limited in its flexibility, it might not be able to approximate the posterior distribution of latent variables well. To better approximate the posterior distribution, it is possible to employ importance sampling, with the recognition network as the proposal distribution. This improved approximation of the posterior distribution also improves the overall performance of the model.

    Read more →
  • Online machine learning

    Online machine learning

    In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches. Online machine learning algorithms find applications in a wide variety of fields such as sponsored search to maximize ad revenue, portfolio optimization, shortest path prediction (with stochastic weights, e.g. traffic on roads for a maps application), spam filtering, real-time fraud detection, dynamic pricing for e-commerce, etc. There is also growing interest in usage of online learning paradigms for LLMs to enable continuous, real-time adaptation after the initial training. == Introduction == In the setting of supervised learning, a function of f : X → Y {\displaystyle f:X\to Y} is to be learned, where X {\displaystyle X} is thought of as a space of inputs and Y {\displaystyle Y} as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution p ( x , y ) {\displaystyle p(x,y)} on X × Y {\displaystyle X\times Y} . In reality, the learner never knows the true distribution p ( x , y ) {\displaystyle p(x,y)} over instances. Instead, the learner usually has access to a training set of examples ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} . In this setting, the loss function is given as V : Y × Y → R {\displaystyle V:Y\times Y\to \mathbb {R} } , such that V ( f ( x ) , y ) {\displaystyle V(f(x),y)} measures the difference between the predicted value f ( x ) {\displaystyle f(x)} and the true value y {\displaystyle y} . The ideal goal is to select a function f ∈ H {\displaystyle f\in {\mathcal {H}}} , where H {\displaystyle {\mathcal {H}}} is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms. == Statistical view of online learning == In statistical learning models, the training sample ( x i , y i ) {\displaystyle (x_{i},y_{i})} are assumed to have been drawn from the true distribution p ( x , y ) {\displaystyle p(x,y)} and the objective is to minimize the expected "risk" I [ f ] = E [ V ( f ( x ) , y ) ] = ∫ V ( f ( x ) , y ) d p ( x , y ) . {\displaystyle I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .} A common paradigm in this situation is to estimate a function f ^ {\displaystyle {\hat {f}}} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input ( x t + 1 , y t + 1 ) {\displaystyle (x_{t+1},y_{t+1})} , the current best predictor f t {\displaystyle f_{t}} and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where f t + 1 {\displaystyle f_{t+1}} is permitted to depend on f t {\displaystyle f_{t}} and all previous data points ( x 1 , y 1 ) , … , ( x t , y t ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{t},y_{t})} . In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques. A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of b ≥ 1 {\displaystyle b\geq 1} data points at a time, this can be considered as pseudo-online learning for b {\displaystyle b} much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks. === Example: linear least squares === The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions. === Batch learning === Consider the setting of supervised learning with f {\displaystyle f} being a linear function to be learned: f ( x j ) = ⟨ w , x j ⟩ = w ⋅ x j {\displaystyle f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j}} where x j ∈ R d {\displaystyle x_{j}\in \mathbb {R} ^{d}} is a vector of inputs (data points) and w ∈ R d {\displaystyle w\in \mathbb {R} ^{d}} is a linear filter vector. The goal is to compute the filter vector w {\displaystyle w} . To this end, a square loss function V ( f ( x j ) , y j ) = ( f ( x j ) − y j ) 2 = ( ⟨ w , x j ⟩ − y j ) 2 {\displaystyle V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2}} is used to compute the vector w {\displaystyle w} that minimizes the empirical loss I n [ w ] = ∑ j = 1 n V ( ⟨ w , x j ⟩ , y j ) = ∑ j = 1 n ( x j T w − y j ) 2 {\displaystyle I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T}}w-y_{j})^{2}} where y j ∈ R . {\displaystyle y_{j}\in \mathbb {R} .} Let X {\displaystyle X} be the i × d {\displaystyle i\times d} data matrix and y ∈ R i {\displaystyle y\in \mathbb {R} ^{i}} is the column vector of target values after the arrival of the first i {\displaystyle i} data points. Assuming that the covariance matrix Σ i = X T X {\displaystyle \Sigma _{i}=X^{\mathsf {T}}X} is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution f ∗ ( x ) = ⟨ w ∗ , x ⟩ {\displaystyle f^{}(x)=\langle w^{},x\rangle } to the linear least squares problem is given by w ∗ = ( X T X ) − 1 X T y = Σ i − 1 ∑ j = 1 i x j y j . {\displaystyle w^{}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.} Now, calculating the covariance matrix Σ i = ∑ j = 1 i x j x j T {\displaystyle \Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T}}} takes time O ( i d 2 ) {\displaystyle O(id^{2})} , inverting the d × d {\displaystyle d\times d} matrix takes time O ( d 3 ) {\displaystyle O(d^{3})} , while the rest of the multiplication takes time O ( d 2 ) {\displaystyle O(d^{2})} , giving a total time of O ( i d 2 + d 3 ) {\displaystyle O(id^{2}+d^{3})} . When there are n {\displaystyle n} total points in the dataset, to recompute the solution after the arrival of every datapoint i = 1 , … , n {\displaystyle i=1,\ldots ,n} , the naive approach will have a total complexity O ( n 2 d 2 + n d 3 ) {\displaystyle O(n^{2}d^{2}+nd^{3})} . Note that when storing the matrix Σ i {\displaystyle \Sigma _{i}} , then updating it at each step needs only adding x i + 1 x i + 1 T {\displaystyle x_{i+1}x_{i+1}^{\mathsf {T}}} , which takes O ( d 2 ) {\displaystyle O(d^{2})} time, reducing the total time to O ( n d 2 + n d 3 ) = O ( n d 3 ) {\displaystyle O(nd^{2}+nd^{3})=O(nd^{3})} , but with an additional storage space of O ( d 2 ) {\displaystyle O(d^{2})} to store Σ i {\displaystyle \Sigma _{i}} . === Online learning: recursive least squares === The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising w 0 = 0 ∈ R d {\displaystyle \textstyle w_{0}=0\in \mathbb {R} ^{d}} and Γ 0 = I ∈ R d × d {\displaystyle \textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d}} , the solution of the linear least squares problem given in the previous section can be computed by the following iteration: Γ i = Γ i − 1 − Γ i − 1 x i x i T Γ i − 1 1 + x i T Γ i − 1 x i {\displaystyle \Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T}}\Gamma _{i-1}}{1+x_{i}^{\mathsf {T}}\Gamma _{i-1}x_{i}}}} w i = w i − 1 − Γ i x i ( x i T w i − 1 − y i ) {\displaystyle w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{

    Read more →
  • Elements (toolchain)

    Elements (toolchain)

    RemObjects Elements is a toolchain for software development, comprising six programming languages: C#, Swift, Go, Java, Oxygene (a form of modern Object Pascal), and Visual Basic .NET. All languages interoperate, meaning a single project can use any combination of languages, and they can all be compiled to .NET, the JVM, native, or WebAssembly. Elements supports Microsoft Windows, all Apple Inc. platforms (including iOS, visionOS and watchOS), Android, and Linux. Elements also supports language conversion, allowing source code in one language to be rewritten in another. Elements is supported in Visual Studio, but RemObjects also makes their own IDEs, Fire (on MacOS) and Water (on Windows.) == Background == RemObjects began in 2002, creating software for Delphi, but in 2005 in response to the growth of .NET and that Delphi was targeting only native Windows, they released Oxygene (known as Chrome at the time) as a new version of Object Pascal, with more modern syntax as well as being .NET-native. Since then, five other languages have been added to the suite, as well as compiling for the web via WebAssembly and to native architectures (eg Intel 32/64 or ARM64). Elements is primarily intended for developers who want to pull together libraries and codebases written in multiple languages, including legacy codebases in older languages while modernizing either with newer syntax and features or by adding in the use of newer or more popular languages. Because of the Oxygene flavour of Object Pascal, supporting Delphi apps is a primary focus, including allowing Pascal to be compiled for other architectures or providing language features that match other prominent languages. == Approach == New versions of the Elements come out approximately every week. RemObjects names its programming languages after chemical elements, sometimes with poetic or musical spelling, rather than referring to them directly. They are: C#: Hydrogene Object Pascal: Oxygene Java: Iodine Visual Basic: Mercury Go: Gold Swift: Silver == History == The Elements compiler was first introduced with version 1.0 in 2005 under the name "Chrome", with support for only the Oxygene language on the .NET platform, primarily as a response to the then-new and not well-received Delphi .NET compiler from Embarcadero. Chrome saw updates to version 1.5 'Floorshow' and Chrome 2.0 'Joyride' over the next few years, moving in parallel with major advancements on the .NET platform for .NET 2.0 (Generics) and .NET 3.x (LINQ), respectively. With the release of version 3.0 (code-named Oxygène after the Jean-Michel Jarre album of the same name) Chrome was rebranded to Oxygene in 2008, and also shipped co-branded by Embarcadero as Delphi Prism (later just Prism) as part of RAD Studio, replacing Embarcadero's own and now-defunct Delphi.NET compiler. 2010 saw the release of Oxygene 4 ("Echoes"), the last version to focus on just a single language and platform. With Oxygene 5 in 2011 and Oxygene 6 in 2013, RemObjects introduced new platform support for Java/Android (code-name "Cooper") and then Cocoa, the Apple development platform (code-name "Toffee"). Elements 7.0 was released at the beginning of 2014, adding the second programming language, C# to the compiler, and delegating Oxygene from the product name to merely branding the Object Pascal-based language. Over the subsequent years, Elements gained support for additional languages, with Apple Swift in 2015, Java in 2017, and subsequently Google's Go and Mercury, a revitalized Visual Basic.NET. Elements also gained support for its fourth target platform, "Island", for CPU-native compilation for Windows, Linux, and WebAssembly. In addition to the chemical elements-based names for the different languages, the "Elements" concept was carried on with the introduction of dedicated development environments alchemically named Fire (for the Mac, in 2015) and Water (for Windows, in 2018). == Fire and Water (IDEs) == Fire and Water are integrated development environments developed by RemObjects Software. They are designed specifically for use with the Elements Compiler. Fire is the version developed for macOS, while Water is intended for Microsoft Windows. Both IDEs are designed to work closely with the Elements compiler and are primarily intended for developers using the RemObjects language ecosystem. They support software development across multiple platforms, including .NET, Android, iOS, macOS, Windows, Linux, and WebAssembly. The IDEs include standard development tools such as syntax highlighting, code completion, debugging, and project navigation. Build operations are managed using a custom system known as EBuild, which is part of the broader Elements toolchain. The IDEs are distributed as part of the RemObjects Elements package and are updated in coordination with the compiler itself. == In media == Oxygene has been mentioned several times by Verity Stob in their Chronicles of Delphi series, currently living at The Register.

    Read more →
  • Low-rank approximation

    Low-rank approximation

    In mathematics, low-rank approximation refers to the process of approximating a given matrix by a matrix of lower rank. More precisely, it is a minimization problem, in which the cost function measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure. Low-rank approximation is closely related to numerous other techniques, including principal component analysis, factor analysis, total least squares, latent semantic analysis, orthogonal regression, and dynamic mode decomposition. == Definition == Given structure specification S : R n p → R m × n {\displaystyle {\mathcal {S}}:\mathbb {R} ^{n_{p}}\to \mathbb {R} ^{m\times n}} , vector of structure parameters p ∈ R n p {\displaystyle p\in \mathbb {R} ^{n_{p}}} , norm ‖ ⋅ ‖ {\displaystyle \|\cdot \|} , and desired rank r {\displaystyle r} , minimize over p ^ ‖ p − p ^ ‖ subject to rank ⁡ ( S ( p ^ ) ) ≤ r . {\displaystyle {\text{minimize}}\quad {\text{over }}{\widehat {p}}\quad \|p-{\widehat {p}}\|\quad {\text{subject to}}\quad \operatorname {rank} {\big (}{\mathcal {S}}({\widehat {p}}){\big )}\leq r.} == Applications == Linear system identification, in which case the approximating matrix is Hankel structured. Machine learning, in which case the approximating matrix is nonlinearly structured. Recommender systems, in which cases the data matrix has missing values and the approximation is categorical. Distance matrix completion, in which case there is a positive definiteness constraint. Natural language processing, in which case the approximation is nonnegative. Computer algebra, in which case the approximation is Sylvester structured. Matrix product states, in which case the approximation is usually rescaled to have fixed Frobenius norm. == Basic low-rank approximation problem == The unstructured problem with fit measured by the Frobenius norm, i.e., minimize over D ^ ‖ D − D ^ ‖ F subject to rank ⁡ ( D ^ ) ≤ r {\displaystyle {\text{minimize}}\quad {\text{over }}{\widehat {D}}\quad \|D-{\widehat {D}}\|_{\text{F}}\quad {\text{subject to}}\quad \operatorname {rank} {\big (}{\widehat {D}}{\big )}\leq r} has an analytic solution in terms of the singular value decomposition of the data matrix. The result is referred to as the matrix approximation lemma or Eckart–Young–Mirsky theorem. This problem was originally solved by Erhard Schmidt in the infinite dimensional context of integral operators (although his methods easily generalize to arbitrary compact operators on Hilbert spaces) and later rediscovered by C. Eckart and G. Young. L. Mirsky generalized the result to arbitrary unitarily invariant norms. Let D = U Σ V ⊤ ∈ R m × n , m ≥ n {\displaystyle D=U\Sigma V^{\top }\in \mathbb {R} ^{m\times n},\quad m\geq n} be the singular value decomposition of D {\displaystyle D} , where Σ =: diag ⁡ ( σ 1 , … , σ r ) {\displaystyle \Sigma =:\operatorname {diag} (\sigma _{1},\ldots ,\sigma _{r})} , where r ≤ min { m , n } = n {\displaystyle r\leq \min\{m,n\}=n} , is the m × n {\displaystyle m\times n} rectangular diagonal matrix with r {\displaystyle r} non-zero singular values σ 1 ≥ … ≥ σ r > σ r + 1 = … = σ n = 0 {\displaystyle \sigma _{1}\geq \ldots \geq \sigma _{r}>\sigma _{r+1}=\ldots =\sigma _{n}=0} . For a given k ∈ { 1 , … , r } {\displaystyle k\in \{1,\dots ,r\}} , partition U {\displaystyle U} , Σ {\displaystyle \Sigma } , and V {\displaystyle V} as follows: U =: [ U 1 U 2 ] , Σ =: [ Σ 1 0 0 Σ 2 ] , and V =: [ V 1 V 2 ] , {\displaystyle U=:{\begin{bmatrix}U_{1}&U_{2}\end{bmatrix}},\quad \Sigma =:{\begin{bmatrix}\Sigma _{1}&0\\0&\Sigma _{2}\end{bmatrix}},\quad {\text{and}}\quad V=:{\begin{bmatrix}V_{1}&V_{2}\end{bmatrix}},} where U 1 {\displaystyle U_{1}} is m × k {\displaystyle m\times k} , Σ 1 {\displaystyle \Sigma _{1}} is k × k {\displaystyle k\times k} , and V 1 {\displaystyle V_{1}} is n × k {\displaystyle n\times k} . Then the rank k {\displaystyle k} matrix D ^ ∗ := U 1 Σ 1 V 1 ⊤ , {\displaystyle {\widehat {D}}^{}:=U_{1}\Sigma _{1}V_{1}^{\top },} obtained from the truncated singular value decomposition is such that ‖ D − D ^ ∗ ‖ F = min rank ⁡ ( D ^ ) ≤ k ‖ D − D ^ ‖ F = σ k + 1 2 + ⋯ + σ r 2 . {\displaystyle \|D-{\widehat {D}}^{}\|_{\text{F}}=\min _{\operatorname {rank} ({\widehat {D}})\leq k}\|D-{\widehat {D}}\|_{\text{F}}={\sqrt {\sigma _{k+1}^{2}+\cdots +\sigma _{r}^{2}}}.} The minimizer D ^ ∗ {\displaystyle {\widehat {D}}^{}} is unique if and only if σ k > σ k + 1 {\displaystyle \sigma _{k}>\sigma _{k+1}} . == Proof of Eckart–Young–Mirsky theorem (for spectral norm) == Let A ∈ R m × n {\displaystyle A\in \mathbb {R} ^{m\times n}} be a real (possibly rectangular) matrix with m ≤ n {\displaystyle m\leq n} . Suppose that A = U Σ V ⊤ {\displaystyle A=U\Sigma V^{\top }} is the singular value decomposition of A {\displaystyle A} . Recall that U {\displaystyle U} and V {\displaystyle V} are orthogonal matrices, and Σ {\displaystyle \Sigma } is an m × n {\displaystyle m\times n} diagonal matrix with entries ( σ 1 , σ 2 , ⋯ , σ m ) {\displaystyle (\sigma _{1},\sigma _{2},\cdots ,\sigma _{m})} such that σ 1 ≥ σ 2 ≥ ⋯ ≥ σ m ≥ 0 {\displaystyle \sigma _{1}\geq \sigma _{2}\geq \cdots \geq \sigma _{m}\geq 0} . We claim that the best rank- k {\displaystyle k} approximation to A {\displaystyle A} in the spectral norm, denoted by ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} , is given by A k := ∑ i = 1 k σ i u i v i ⊤ {\displaystyle A_{k}:=\sum _{i=1}^{k}\sigma _{i}u_{i}v_{i}^{\top }} where u i {\displaystyle u_{i}} and v i {\displaystyle v_{i}} denote the i {\displaystyle i} th column of U {\displaystyle U} and V {\displaystyle V} , respectively. First, note that we have ‖ A − A k ‖ 2 = ‖ ∑ i = 1 n σ i u i v i ⊤ − ∑ i = 1 k σ i u i v i ⊤ ‖ 2 = ‖ ∑ i = k + 1 n σ i u i v i ⊤ ‖ 2 = σ k + 1 {\displaystyle \|A-A_{k}\|_{2}=\left\|\sum _{i=1}^{\color {red}{n}}\sigma _{i}u_{i}v_{i}^{\top }-\sum _{i=1}^{\color {red}{k}}\sigma _{i}u_{i}v_{i}^{\top }\right\|_{2}=\left\|\sum _{i=\color {red}{k+1}}^{n}\sigma _{i}u_{i}v_{i}^{\top }\right\|_{2}=\sigma _{k+1}} Therefore, we need to show that if B k = X Y ⊤ {\displaystyle B_{k}=XY^{\top }} where X {\displaystyle X} and Y {\displaystyle Y} have k {\displaystyle k} columns then ‖ A − A k ‖ 2 = σ k + 1 ≤ ‖ A − B k ‖ 2 {\displaystyle \|A-A_{k}\|_{2}=\sigma _{k+1}\leq \|A-B_{k}\|_{2}} . Since Y {\displaystyle Y} has k {\displaystyle k} columns, then there must be a nontrivial linear combination of the first k + 1 {\displaystyle k+1} columns of V {\displaystyle V} , i.e., w = γ 1 v 1 + ⋯ + γ k + 1 v k + 1 , {\displaystyle w=\gamma _{1}v_{1}+\cdots +\gamma _{k+1}v_{k+1},} such that Y ⊤ w = 0 {\displaystyle Y^{\top }w=0} . Without loss of generality, we can scale w {\displaystyle w} so that ‖ w ‖ 2 = 1 {\displaystyle \|w\|_{2}=1} or (equivalently) γ 1 2 + ⋯ + γ k + 1 2 = 1 {\displaystyle \gamma _{1}^{2}+\cdots +\gamma _{k+1}^{2}=1} . Therefore, ‖ A − B k ‖ 2 2 ≥ ‖ ( A − B k ) w ‖ 2 2 = ‖ A w ‖ 2 2 = γ 1 2 σ 1 2 + ⋯ + γ k + 1 2 σ k + 1 2 ≥ σ k + 1 2 . {\displaystyle \|A-B_{k}\|_{2}^{2}\geq \|(A-B_{k})w\|_{2}^{2}=\|Aw\|_{2}^{2}=\gamma _{1}^{2}\sigma _{1}^{2}+\cdots +\gamma _{k+1}^{2}\sigma _{k+1}^{2}\geq \sigma _{k+1}^{2}.} The result follows by taking the square root of both sides of the above inequality. == Proof of Eckart–Young–Mirsky theorem (for Frobenius norm) == Let A ∈ R m × n {\displaystyle A\in \mathbb {R} ^{m\times n}} be a real (possibly rectangular) matrix with m ≤ n {\displaystyle m\leq n} . Suppose that A = U Σ V ⊤ {\displaystyle A=U\Sigma V^{\top }} is the singular value decomposition of A {\displaystyle A} . We claim that the best rank k {\displaystyle k} approximation to A {\displaystyle A} in the Frobenius norm, denoted by ‖ ⋅ ‖ F {\displaystyle \|\cdot \|_{F}} , is given by A k = ∑ i = 1 k σ i u i v i ⊤ {\displaystyle A_{k}=\sum _{i=1}^{k}\sigma _{i}u_{i}v_{i}^{\top }} where u i {\displaystyle u_{i}} and v i {\displaystyle v_{i}} denote the i {\displaystyle i} th column of U {\displaystyle U} and V {\displaystyle V} , respectively. First, note that we have ‖ A − A k ‖ F 2 = ‖ ∑ i = k + 1 n σ i u i v i ⊤ ‖ F 2 = ∑ i = k + 1 n σ i 2 {\displaystyle \|A-A_{k}\|_{F}^{2}=\left\|\sum _{i=k+1}^{n}\sigma _{i}u_{i}v_{i}^{\top }\right\|_{F}^{2}=\sum _{i=k+1}^{n}\sigma _{i}^{2}} Therefore, we need to show that if B k = X Y ⊤ {\displaystyle B_{k}=XY^{\top }} where X {\displaystyle X} and Y {\displaystyle Y} have k {\displaystyle k} columns then ‖ A − A k ‖ F 2 = ∑ i = k + 1 n σ i 2 ≤ ‖ A − B k ‖ F 2 . {\displaystyle \|A-A_{k}\|_{F}^{2}=\sum _{i=k+1}^{n}\sigma _{i}^{2}\leq \|A-B_{k}\|_{F}^{2}.} By the triangle inequality with the spectral norm

    Read more →
  • AdaBoost

    AdaBoost

    AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

    Read more →
  • Physical neural network

    Physical neural network

    A physical neural network is a type of artificial neural network in which an electrically adjustable material is used to emulate the function of a neural synapse or a higher-order (dendritic) neuron model. "Physical" neural network is used to emphasize the reliance on physical hardware used to emulate neurons as opposed to software-based approaches. More generally the term is applicable to other artificial neural networks in which a memristor or other electrically adjustable resistance material is used to emulate a neural synapse. == Types of physical neural networks == === ADALINE === In the 1960s Bernard Widrow and Ted Hoff developed ADALINE (Adaptive Linear Neuron) which used electrochemical cells called memistors (memory resistors) to emulate synapses of an artificial neuron. The memistors were implemented as 3-terminal devices operating based on the reversible electroplating of copper such that the resistance between two of the terminals is controlled by the integral of the current applied via the third terminal. The ADALINE circuitry was briefly commercialized by the Memistor Corporation in the 1960s enabling some applications in pattern recognition. However, since the memistors were not fabricated using integrated circuit fabrication techniques the technology was not scalable and was eventually abandoned as solid-state electronics became mature. === Analog VLSI === In 1989 Carver Mead published his book Analog VLSI and Neural Systems, which spun off perhaps the most common variant of analog neural networks. The physical realization is implemented in analog VLSI. This is often implemented as field effect transistors in low inversion. Such devices can be modelled as translinear circuits. This is a technique described by Barrie Gilbert in several papers around mid 1970th, and in particular his Translinear Circuits from 1981. With this method circuits can be analyzed as a set of well-defined functions in steady-state, and such circuits assembled into complex networks. === Physical Neural Network === Alex Nugent describes a physical neural network as one or more nonlinear neuron-like nodes used to sum signals and nanoconnections formed from nanoparticles, nanowires, or nanotubes which determine the signal strength input to the nodes. Alignment or self-assembly of the nanoconnections is determined by the history of the applied electric field performing a function analogous to neural synapses. Numerous applications for such physical neural networks are possible. For example, a temporal summation device can be composed of one or more nanoconnections having an input and an output thereof, wherein an input signal provided to the input causes one or more of the nanoconnection to experience an increase in connection strength thereof over time. Another example of a physical neural network is taught by U.S. Patent No. 7,039,619 entitled "Utilized nanotechnology apparatus using a neural network, a solution and a connection gap," which issued to Alex Nugent by the U.S. Patent & Trademark Office on May 2, 2006. A further application of physical neural network is shown in U.S. Patent No. 7,412,428 entitled "Application of hebbian and anti-hebbian learning to nanotechnology-based physical neural networks," which issued on August 12, 2008. Nugent and Molter have shown that universal computing and general-purpose machine learning are possible from operations available through simple memristive circuits operating the AHaH plasticity rule. More recently, it has been argued that also complex networks of purely memristive circuits can serve as neural networks. === Phase change neural network === In 2002, Stanford Ovshinsky described an analog neural computing medium in which phase-change material has the ability to cumulatively respond to multiple input signals. An electrical alteration of the resistance of the phase change material is used to control the weighting of the input signals. === Memristive neural network === Greg Snider of HP Labs describes a system of cortical computing with memristive nanodevices. The memristors (memory resistors) are implemented by thin film materials in which the resistance is electrically tuned via the transport of ions or oxygen vacancies within the film. DARPA's SyNAPSE project has funded IBM Research and HP Labs, in collaboration with the Boston University Department of Cognitive and Neural Systems (CNS), to develop neuromorphic architectures which may be based on memristive systems. === Protonic artificial synapses === In 2022, researchers reported the development of nanoscale brain-inspired artificial synapses, using the ion proton (H+), for 'analog deep learning'.

    Read more →
  • List of Ada software and tools

    List of Ada software and tools

    This is a list of software and programming tools for the Ada programming language, including IDEs, compilers, libraries, verification and debugging tools, numerical and scientific computing libraries, and related projects. == Compilers == GNAT — GCC Ada compiler and toolchain, maintained by AdaCore AdaCore GNAT Pro — commercial Ada compiler with advanced tooling for high-integrity and real-time systems Green Hills compiler for Ada — Ada compiler for embedded and safety-critical systems ObjectAda — Ada development environment for safety-critical and embedded systems == Integrated development environments (IDEs) and editors == GNAT Studio — IDE developed by AdaCore Emacs — supports Ada editing with Ada mode and syntax checking Eclipse — supports Ada through GNATbench plugin Visual Studio Code — Ada support via Ada Language Server extensions == Libraries and frameworks == See also: Ada Libraries on Wikibooks Ada.Calendar — date and time library Ada Web Services (AWS) — support for RESTful and SOAP web services Ada.Text_IO — standard library for text input/output Florist (POSIX Ada binding) – open-source implementation of the POSIX Ada bindings GNAT – Ada compiler part of GCC, which also provides an extensive runtime and library package hierarchy. GtkAda – Ada bindings for the GTK+ graphical user interface toolkit Matreshka – multipurpose Ada framework supporting Unicode, XML, JSON, and more. XML/Ada – XML and Unicode processing library == Real-time and embedded systems == Ada tasking — built-in concurrency support with tasks, protected objects, and rendezvous. Ada.Real_Time — real-time clocks, delays, and scheduling. ARINC 653 Ada profiles — for avionics real-time applications OpenMP Ada bindings — parallel programming for multi-core embedded systems Ravenscar profile — subset of Ada tasking for real-time and deterministic execution == Numerical and scientific computing == Ada.Numerics — libraries for numerical methods, linear algebra, and mathematical functions. SPARK math libraries — formal-methods-compliant numerical routines == Verification, debugging, and analysis == GNATprove — formal verification and static analysis tool for Ada and SPARK GNATstack — runtime stack analysis and checking GNATcoverage — code coverage measurement for Ada projects AdaControl — style checking and metrics for Ada == Testing frameworks == AUnit — unit testing framework for Ada GNATtest — automated testing framework for Ada == Documentation and code generation == GNATdoc — generates HTML documentation from Ada source code

    Read more →
  • Rprop

    Rprop

    Rprop, short for resilient backpropagation, is a learning heuristic for supervised learning in feedforward artificial neural networks. This is a first-order optimization algorithm. This algorithm was created by Martin Riedmiller and Heinrich Braun in 1992. Similarly to the Manhattan update rule, Rprop takes into account only the sign of the partial derivative over all patterns (not the magnitude), and acts independently on each "weight". For each weight, if there was a sign change of the partial derivative of the total error function compared to the last iteration, the update value for that weight is multiplied by a factor η−, where η− < 1. If the last iteration produced the same sign, the update value is multiplied by a factor of η+, where η+ > 1. The update values are calculated for each weight in the above manner, and finally each weight is changed by its own update value, in the opposite direction of that weight's partial derivative, so as to minimise the total error function. η+ is empirically set to 1.2 and η− to 0.5. Rprop can result in very large weight increments or decrements if the gradients are large, which is a problem when using mini-batches as opposed to full batches. RMSprop addresses this problem by keeping the moving average of the squared gradients for each weight and dividing the gradient by the square root of the mean square. RPROP is a batch update algorithm. Next to the cascade correlation algorithm and the Levenberg–Marquardt algorithm, Rprop is one of the fastest weight update mechanisms. == Variations == Martin Riedmiller developed three algorithms, all named RPROP. Igel and Hüsken assigned names to them and added a new variant: RPROP+ is defined at A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. RPROP− is defined at Advanced Supervised Learning in Multi-layer Perceptrons – From Backpropagation to Adaptive Learning Algorithms. Backtracking is removed from RPROP+. iRPROP− is defined in Rprop – Description and Implementation Details and was reinvented by Igel and Hüsken. This variant is very popular and most simple. iRPROP+ is defined at Improving the Rprop Learning Algorithm and is very robust and typically faster than the other three variants.

    Read more →
  • LIBSVM

    LIBSVM

    LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the sequential minimal optimization (SMO) algorithm for kernelized support vector machines (SVMs), supporting classification and regression. LIBLINEAR implements linear SVMs and logistic regression models trained using a coordinate descent algorithm. The SVM learning code from both libraries is often reused in other open source machine learning toolkits, including GATE, KNIME, Orange and scikit-learn. Bindings and ports exist for programming languages such as Java, MATLAB, R, Julia, and Python. It is available in e1071 library in R and scikit-learn in Python. Both libraries are free software released under the 3-clause BSD license.

    Read more →
  • NSynth

    NSynth

    NSynth (a portmanteau of "Neural Synthesis") is a WaveNet-based autoencoder for synthesizing audio, outlined in a paper in April 2017. == Overview == The model generates sounds through a neural network based synthesis, employing a WaveNet-style autoencoder to learn its own temporal embeddings from four different sounds. Google then released an open source hardware interface for the algorithm called NSynth Super, used by notable musicians such as Grimes and YACHT to generate experimental music using artificial intelligence. The research and development of the algorithm was part of a collaboration between Google Brain, Magenta and DeepMind. == Technology == === Dataset === The NSynth dataset is composed of 305,979 one-shot instrumental notes featuring a unique pitch, timbre, and envelope, sampled from 1,006 instruments from commercial sample libraries. For each instrument the dataset contains four-second 16 kHz audio snippets by ranging over every pitch of a standard MIDI piano, as well as five different velocities. The dataset is made available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. === Machine learning model === A spectral autoencoder model and a WaveNet autoencoder model are publicly available on GitHub. The baseline model uses a spectrogram with fft_size 1024 and hop_size 256, MSE loss on the magnitudes, and the Griffin-Lim algorithm for reconstruction. The WaveNet model trains on mu-law encoded waveform chunks of size 6144. It learns embeddings with 16 dimensions that are downsampled by 512 in time. == NSynth Super == In 2018 Google released a hardware interface for the NSynth algorithm, called NSynth Super, designed to provide an accessible physical interface to the algorithm for musicians to use in their artistic production. Design files, source code and internal components are released under an open source Apache License 2.0, enabling hobbyists and musicians to freely build and use the instrument. At the core of the NSynth Super there is a Raspberry Pi, extended with a custom printed circuit board to accommodate the interface elements. == Influence == Despite not being publicly available as a commercial product, NSynth Super has been used by notable artists, including Grimes and YACHT. Grimes reported using the instrument in her 2020 studio album Miss Anthropocene. YACHT announced an extensive use of NSynth Super in their album Chain Tripping. Claire L. Evans compared the potential influence of the instrument to the Roland TR-808. The NSynth Super design was honored with a D&AD Yellow Pencil award in 2018.

    Read more →
  • Minimum resolvable contrast

    Minimum resolvable contrast

    Minimum resolvable contrast (MRC) is a subjective measure of a visible spectrum sensor’s or camera's sensitivity and ability to resolve data. A snapshot image of a series of three bar targets of selected spatial frequencies and various contrast coatings captured by the unit under test (UUT) is used to determine the MRC of the UUT, i.e., the visible spectrum camera or sensor. A trained observer selects the smallest target resolvable at each contrast level. Typically, specialized computer software collects the inputted data of the observer and provides a graph of contrast vs. spatial frequency at a given luminance level. A first order polynomial is fitted to the data and an MRC curve of spatial frequency versus contrast is generated.

    Read more →
  • Vapnik–Chervonenkis theory

    Vapnik–Chervonenkis theory

    Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view. == Introduction == VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory): Theory of consistency of learning processes What are (necessary and sufficient) conditions for consistency of a learning process based on the empirical risk minimization principle? Nonasymptotic theory of the rate of convergence of learning processes How fast is the rate of convergence of the learning process? Theory of controlling the generalization ability of learning processes How can one control the rate of convergence (the generalization ability) of the learning process? Theory of constructing learning machines How can one construct algorithms that can control the generalization ability? VC Theory is a major subbranch of statistical learning theory. One of its main applications in statistical learning theory is to provide generalization conditions for learning algorithms. From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization. In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics. == Overview of VC theory in empirical processes == === Background on empirical processes === Let ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} be a measurable space. For any measure Q {\displaystyle Q} on ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} , and any measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } , define Q f = ∫ f d Q {\displaystyle Qf=\int fdQ} Measurability issues will be ignored here, for more technical detail see. Let F {\displaystyle {\mathcal {F}}} be a class of measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } and define: ‖ Q ‖ F = sup { | Q f | : f ∈ F } . {\displaystyle \|Q\|_{\mathcal {F}}=\sup\{\vert Qf\vert \ :\ f\in {\mathcal {F}}\}.} Let X 1 , … , X n {\displaystyle X_{1},\ldots ,X_{n}} be independent, identically distributed random elements of ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} . Then define the empirical measure P n = n − 1 ∑ i = 1 n δ X i , {\displaystyle \mathbb {P} _{n}=n^{-1}\sum _{i=1}^{n}\delta _{X_{i}},} where δ here stands for the Dirac measure. The empirical measure induces a map F → R {\displaystyle {\mathcal {F}}\to \mathbf {R} } given by: f ↦ P n f = 1 n ( f ( X 1 ) + . . . + f ( X n ) ) {\displaystyle f\mapsto \mathbb {P} _{n}f={\frac {1}{n}}(f(X_{1})+...+f(X_{n}))} Now suppose P is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes F {\displaystyle {\mathcal {F}}} for which statements such as the following hold: uniform law of large numbers: ‖ P n − P ‖ F → n 0 , {\displaystyle \|\mathbb {P} _{n}-P\|_{\mathcal {F}}{\underset {n}{\to }}0,} That is, as n → ∞ {\displaystyle n\to \infty } , | 1 n ( f ( X 1 ) + . . . + f ( X n ) ) − ∫ f d P | → 0 {\displaystyle \left|{\frac {1}{n}}(f(X_{1})+...+f(X_{n}))-\int fdP\right|\to 0} uniformly for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . uniform central limit theorem: G n = n ( P n − P ) ⇝ G , in ℓ ∞ ( F ) {\displaystyle \mathbb {G} _{n}={\sqrt {n}}(\mathbb {P} _{n}-P)\rightsquigarrow \mathbb {G} ,\quad {\text{in }}\ell ^{\infty }({\mathcal {F}})} In the former case F {\displaystyle {\mathcal {F}}} is called Glivenko–Cantelli class, and in the latter case (under the assumption ∀ x , sup f ∈ F | f ( x ) − P f | < ∞ {\displaystyle \forall x,\sup \nolimits _{f\in {\mathcal {F}}}\vert f(x)-Pf\vert <\infty } ) the class F {\displaystyle {\mathcal {F}}} is called Donsker or P-Donsker. A Donsker class is Glivenko–Cantelli in probability by an application of Slutsky's theorem. These statements are true for a single f {\displaystyle f} , by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . Intuitively then, the set F {\displaystyle {\mathcal {F}}} cannot be too large, and as it turns out that the geometry of F {\displaystyle {\mathcal {F}}} plays a very important role. One way of measuring how big the function set F {\displaystyle {\mathcal {F}}} is to use the so-called covering numbers. The covering number N ( ε , F , ‖ ⋅ ‖ ) {\displaystyle N(\varepsilon ,{\mathcal {F}},\|\cdot \|)} is the minimal number of balls { g : ‖ g − f ‖ < ε } {\displaystyle \{g:\|g-f\|<\varepsilon \}} needed to cover the set F {\displaystyle {\mathcal {F}}} (here it is obviously assumed that there is an underlying norm on F {\displaystyle {\mathcal {F}}} ). The entropy is the logarithm of the covering number. Two sufficient conditions are provided below, under which it can be proved that the set F {\displaystyle {\mathcal {F}}} is Glivenko–Cantelli or Donsker. A class F {\displaystyle {\mathcal {F}}} is P-Glivenko–Cantelli if it is P-measurable with envelope F such that P ∗ F < ∞ {\displaystyle P^{\ast }F<\infty } and satisfies: ∀ ε > 0 sup Q N ( ε ‖ F ‖ Q , F , L 1 ( Q ) ) < ∞ . {\displaystyle \forall \varepsilon >0\quad \sup \nolimits _{Q}N(\varepsilon \|F\|_{Q},{\mathcal {F}},L_{1}(Q))<\infty .} The next condition is a version of Dudley's theorem. If F {\displaystyle {\mathcal {F}}} is a class of functions such that ∫ 0 ∞ sup Q log ⁡ N ( ε ‖ F ‖ Q , 2 , F , L 2 ( Q ) ) d ε < ∞ {\displaystyle \int _{0}^{\infty }\sup \nolimits _{Q}{\sqrt {\log N\left(\varepsilon \|F\|_{Q,2},{\mathcal {F}},L_{2}(Q)\right)}}d\varepsilon <\infty } then F {\displaystyle {\mathcal {F}}} is P-Donsker for every probability measure P such that P ∗ F 2 < ∞ {\displaystyle P^{\ast }F^{2}<\infty } . In the last integral, the notation means ‖ f ‖ Q , 2 = ( ∫ | f | 2 d Q ) 1 2 {\displaystyle \|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}} . === Symmetrization === The majority of the arguments about how to bound the empirical process rely on symmetrization, maximal and concentration inequalities, and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section). It is presented here: Consider the empirical process: f ↦ ( P n − P ) f = 1 n ∑ i = 1 n ( f ( X i ) − P f ) {\displaystyle f\mapsto (\mathbb {P} _{n}-P)f={\dfrac {1}{n}}\sum _{i=1}^{n}(f(X_{i})-Pf)} Turns out that there is a connection between the empirical and the following symmetrized process: f ↦ P n 0 f = 1 n ∑ i = 1 n ε i f ( X i ) {\displaystyle f\mapsto \mathbb {P} _{n}^{0}f={\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})} The symmetrized process is a Rademacher process, conditionally on the data X i {\displaystyle X_{i}} . Therefore, it is a sub-Gaussian process by Hoeffding's inequality. Lemma (Symmetrization). For every nondecreasing, convex Φ: R → R and class of measurable functions F {\displaystyle {\mathcal {F}}} , E Φ ( ‖ P n − P ‖ F ) ≤ E Φ ( 2 ‖ P n 0 ‖ F ) {\displaystyle \mathbb {E} \Phi (\|\mathbb {P} _{n}-P\|_{\mathcal {F}})\leq \mathbb {E} \Phi \left(2\left\|\mathbb {P} _{n}^{0}\right\|_{\mathcal {F}}\right)} The proof of the Symmetrization lemma relies on introducing independent copies of the original variables X i {\displaystyle X_{i}} (sometimes referred to as a ghost sample) and replacing the inner expectation of the LHS by these copies. After an application of Jensen's inequality different signs could be introduced (hence the name symmetrization) without changing the expectation. The proof can be found below because of its instructive nature. The same proof method can be used to prove the Glivenko–Cantelli theorem. A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to P n 0 {\displaystyle \mathbb {P} _{n}^{0}} and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties. === VC Connection === It turns out that there is a fascinating connection between certain combinatorial properties of the set F {\displaystyle {\mathcal {F}}} and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik–Chervonenkis classes of sets – or shortly VC sets. Consider a collection C {\displaystyle {\mathcal {C}}} of subsets of the sample space X {\displaystyle

    Read more →
  • Aleph (ILP)

    Aleph (ILP)

    Aleph (A Learning Engine for Proposing Hypotheses) is an inductive logic programming system introduced by Ashwin Srinivasan in 2001. As of 2022 it is still one of the most widely used inductive logic programming systems. It is based on the earlier system Progol. == Learning task == The input to Aleph is background knowledge, specified as a logic program, a language bias in the form of mode declarations, as well as positive and negative examples specified as ground facts. As output it returns a logic program which, together with the background knowledge, entails all of the positive examples and none of the negative examples. == Basic algorithm == Starting with an empty hypothesis, Aleph proceeds as follows: It chooses a positive example to generalise; if none are left, it aborts and outputs the current hypothesis. Then it constructs the bottom clause, that is, the most specific clause that is allowed by the mode declarations and covers the example. It then searches for a generalisation of the bottom clause that scores better on the chosen metric. It then adds the new clause to the hypothesis program and removes all examples that are covered by the new clause. == Search algorithm == Aleph searches for clauses in a top-down manner, using the bottom clause constructed in the preceding step to bound the search from below. It searches the refinement graph in a breadth-first manner, with tunable parameters to bound the maximal clause size and proof depth. It scores each clause using one of 13 different evaluation metrics, as chosen in advance by the user.

    Read more →