Apache Giraph

Apache Giraph

Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop's MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minutes. Giraph is based on a paper published by Google about its own graph processing system called Pregel. It can be compared to other Big Graph processing libraries such as Cassovary. As of September 2023, it is no longer actively developed.

Picture Prowler

Picture Prowler was an early piece of photo management software developed around and meant to show off Xing Technology's JPEG image decompression library during the early 1990s. Little known today, it featured thumbnail based picture management, printing, etc. The primary developer was Ray Bunnage from compression / decompression libraries developed by Howard Gordon and Chris Eddy.

Ordinal regression

In statistics, ordinal regression, also called ordinal classification, is a type of regression analysis used for predicting an ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and classification. Examples of ordinal regression are ordered logit and ordered probit. Ordinal regression turns up often in the social sciences, for example in the modeling of human levels of preference (on a scale from, say, 1–5 for "very poor" through "excellent"), as well as in information retrieval. In machine learning, ordinal regression may also be called ranking learning. == Linear models for ordinal regression == Ordinal regression can be performed using a generalized linear model (GLM) that fits both a coefficient vector and a set of thresholds to a dataset. Suppose one has a set of observations, represented by length-p vectors x1 through xn, with associated responses y1 through yn, where each yi is an ordinal variable on a scale 1, ..., K. For simplicity, and without loss of generality, we assume y is a non-decreasing vector, that is, yi ≤ {\displaystyle \leq } yi+1. To this data, one fits a length-p coefficient vector w and a set of thresholds θ1, ..., θK−1 with the property that θ1 < θ2 < ... < θK−1. This set of thresholds divides the real number line into K disjoint segments, corresponding to the K response levels. The model can now be formulated as Pr ( y ≤ i ∣ x ) = σ ( θ i − w ⋅ x ) {\displaystyle \Pr(y\leq i\mid \mathbf {x} )=\sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )} or, the cumulative probability of the response y being at most i is given by a function σ (the inverse link function) applied to a linear function of x. Several choices exist for σ; the logistic function σ ( θ i − w ⋅ x ) = 1 1 + e − ( θ i − w ⋅ x ) {\displaystyle \sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )={\frac {1}{1+e^{-(\theta _{i}-\mathbf {w} \cdot \mathbf {x} )}}}} gives the ordered logit model, while using the CDF of the standard normal distribution gives the ordered probit model. A third option is to use an exponential function σ ( θ i − w ⋅ x ) = 1 − exp ⁡ ( − exp ⁡ ( θ i − w ⋅ x ) ) {\displaystyle \sigma (\theta _{i}-\mathbf {w} \cdot \mathbf {x} )=1-\exp(-\exp(\theta _{i}-\mathbf {w} \cdot \mathbf {x} ))} which gives the proportional hazards model. === Latent variable model === The probit version of the above model can be justified by assuming the existence of a real-valued latent variable (unobserved quantity) y, determined by y ∗ = w ⋅ x + ε {\displaystyle y^{}=\mathbf {w} \cdot \mathbf {x} +\varepsilon } where ε is normally distributed with zero mean and unit variance, conditioned on x. The response variable y results from an "incomplete measurement" of y, where one only determines the interval into which y falls: y = { 1 if y ∗ ≤ θ 1 , 2 if θ 1 < y ∗ ≤ θ 2 , 3 if θ 2 < y ∗ ≤ θ 3 ⋮ K if θ K − 1 < y ∗ . {\displaystyle y={\begin{cases}1&{\text{if}}~~y^{}\leq \theta _{1},\\2&{\text{if}}~~\theta _{1}

Spiking neural network

Spiking neural networks (SNNs) are artificial neural networks (ANN) that mimic natural neural networks. These models leverage timing of discrete spikes as the main information carrier. In addition to neuronal and synaptic state, SNNs incorporate the concept of time into their operating model. The idea is that neurons in the SNN do not transmit information at each propagation cycle (as it happens with typical multi-layer perceptron networks), but rather transmit information only when a membrane potential—an intrinsic quality of the neuron related to its membrane electrical charge—reaches a specific value, called the threshold. When the membrane potential reaches the threshold, the neuron fires, and generates a signal that travels to other neurons which, in turn, increase or decrease their potentials in response to this signal. A neuron model that fires at the moment of threshold crossing is also called a spiking neuron model. While spike rates can be considered the analogue of the variable output of a traditional ANN, neurobiology research indicated that high speed processing cannot be performed solely through a rate-based scheme. For example humans can perform an image recognition task requiring no more than 10ms of processing time per neuron through the successive layers (going from the retina to the temporal lobe). This time window is too short for rate-based encoding. The precise spike timings in a small set of spiking neurons also has a higher information coding capacity compared with a rate-based approach. The most prominent spiking neuron model is the leaky integrate-and-fire model. In that model, the momentary activation level (modeled as a differential equation) is normally considered to be the neuron's state, with incoming spikes pushing this value higher or lower, until the state eventually either decays or—if the firing threshold is reached—the neuron fires. After firing, the state variable is reset to a lower value. Various decoding methods exist for interpreting the outgoing spike train as a real-value number, relying on either the frequency of spikes (rate-code), the time-to-first-spike after stimulation, or the interval between spikes. == History == Many multi-layer artificial neural networks are fully connected, receiving input from every neuron in the previous layer and signalling every neuron in the subsequent layer. Although these networks have achieved breakthroughs, they do not match biological networks and do not mimic neurons. The biology-inspired Hodgkin–Huxley model of a spiking neuron was proposed in 1952. This model described how action potentials are initiated and propagated. Communication between neurons, which requires the exchange of chemical neurotransmitters in the synaptic gap, is described in models such as the integrate-and-fire model, FitzHugh–Nagumo model (1961–1962), and Hindmarsh–Rose model (1984). The leaky integrate-and-fire model (or a derivative) is commonly used as it is easier to compute than Hodgkin–Huxley. While the notion of an artificial spiking neural network became popular only in the twenty-first century, studies between 1980 and 1995 supported the concept. The first models of this type of ANN appeared to simulate non-algorithmic intelligent information processing systems. However, the notion of the spiking neural network as a mathematical model was first worked on in the early 1970s. As of 2019 SNNs lagged behind ANNs in accuracy, but the gap is decreasing, and has vanished on some tasks. == Underpinnings == Information in the brain is represented as action potentials (neuron spikes), which may group into spike trains or coordinated waves. A fundamental question of neuroscience is to determine whether neurons communicate by a rate or temporal code. Temporal coding implies that a single spiking neuron can replace hundreds of hidden units on a conventional neural net. SNNs define a neuron's current state as its potential (possibly modeled as a differential equation). An input pulse causes the potential to rise and then gradually decline. Encoding schemes can interpret these pulse sequences as a number, considering pulse frequency and pulse interval. Using the precise time of pulse occurrence, a neural network can consider more information and offer better computing properties. SNNs compute in the continuous domain. Such neurons test for activation only when their potentials reach a certain value. When a neuron is activated, it produces a signal that is passed to connected neurons, accordingly raising or lowering their potentials. The SNN approach produces a continuous output instead of the binary output of traditional ANNs. Pulse trains are not easily interpretable, hence the need for encoding schemes. However, a pulse train representation may be more suited for processing spatiotemporal data (or real-world sensory data classification). SNNs connect neurons only to nearby neurons so that they process input blocks separately (similar to CNN using filters). They consider time by encoding information as pulse trains so as not to lose information. This avoids the complexity of a recurrent neural network (RNN). Impulse neurons are more powerful computational units than traditional artificial neurons. SNNs are theoretically more powerful than so called "second-generation networks" defined as ANNs "based on computational units that apply activation function with a continuous set of possible output values to a weighted sum (or polynomial) of the inputs"; however, SNN training issues and hardware requirements limit their use. Although unsupervised biologically inspired learning methods are available such as Hebbian learning and STDP, no effective supervised training method is suitable for SNNs that can provide better performance than second-generation networks. Spike-based activation of SNNs is not differentiable, thus gradient descent-based backpropagation (BP) is not available. SNNs have much larger computational costs for simulating realistic neural models than traditional ANNs. Pulse-coupled neural networks (PCNN) are often confused with SNNs. A PCNN can be seen as a kind of SNN. Researchers are actively working on various topics. The first concerns differentiability. The expressions for both the forward- and backward-learning methods contain the derivative of the neural activation function which is not differentiable because a neuron's output is either 1 when it spikes, and 0 otherwise. This all-or-nothing behavior disrupts gradients and makes these neurons unsuitable for gradient-based optimization. Approaches to resolving it include: resorting to entirely biologically inspired local learning rules for the hidden units translating conventionally trained "rate-based" NNs to SNNs smoothing the network model to be continuously differentiable defining an SG (Surrogate Gradient) as a continuous relaxation of the real gradients The second concerns the optimization algorithm. Standard BP can be expensive in terms of computation, memory, and communication and may be poorly suited to the hardware that implements it (e.g., a computer, brain, or neuromorphic device). Incorporating additional neuron dynamics such as Spike Frequency Adaptation (SFA) is a notable advance, enhancing efficiency and computational power. These neurons sit between biological complexity and computational complexity. Originating from biological insights, SFA offers significant computational benefits by reducing power usage, especially in cases of repetitive or intense stimuli. This adaptation improves signal/noise clarity and introduces an elementary short-term memory at the neuron level, which in turn, improves accuracy and efficiency. This was mostly achieved using compartmental neuron models. The simpler versions are of neuron models with adaptive thresholds, are an indirect way of achieving SFA. It equips SNNs with improved learning capabilities, even with constrained synaptic plasticity, and elevates computational efficiency. This feature lessens the demand on network layers by decreasing the need for spike processing, thus lowering computational load and memory access time—essential aspects of neural computation. Moreover, SNNs utilizing neurons capable of SFA achieve levels of accuracy that rival those of conventional ANNs, while also requiring fewer neurons for comparable tasks. This efficiency streamlines the computational workflow and conserves space and energy, while maintaining technical integrity. High-performance deep spiking neural networks can operate with 0.3 spikes per neuron. == Applications == SNNs can in principle be applied to the same applications as traditional ANNs. In addition, SNNs can model the central nervous system of biological organisms, such as an insect seeking food without prior knowledge of the environment. Due to their relative realism, they can be used to study biological neural circuits. Starting with a hypothesis about the topology of a biological neuronal circuit and its functi

Inductive logic programming

Inductive logic programming (ILP) is a subfield of symbolic artificial intelligence which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. The term "inductive" here refers to philosophical (i.e. suggesting a theory to explain observed facts) rather than mathematical (i.e. proving a property for all members of a well-ordered set) induction. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesised logic program which entails all the positive and none of the negative examples. Schema: positive examples + negative examples + background knowledge ⇒ hypothesis. Bioinformatics and drug design have been highlighted as a principal application area of inductive logic programming techniques. == History == Building on earlier work on Inductive inference, Gordon Plotkin was the first to formalise induction in a clausal setting around 1970, adopting an approach of generalising from examples. In 1981, Ehud Shapiro introduced several ideas that would shape the field in his new approach of model inference, an algorithm employing refinement and backtracing to search for a complete axiomatisation of given examples. His first implementation was the Model Inference System in 1981: a Prolog program that inductively inferred Horn clause logic programs from positive and negative examples. The term Inductive Logic Programming was first introduced in a paper by Stephen Muggleton in 1990, defined as the intersection of machine learning and logic programming. Muggleton and Wray Buntine introduced predicate invention and inverse resolution in 1988. Several inductive logic programming systems that proved influential appeared in the early 1990s. FOIL, introduced by Ross Quinlan in 1990 was based on upgrading propositional learning algorithms AQ and ID3. Golem, introduced by Muggleton and Feng in 1990, went back to a restricted form of Plotkin's least generalisation algorithm. The Progol system, introduced by Muggleton in 1995, first implemented inverse entailment, and inspired many later systems. Aleph, a descendant of Progol introduced by Ashwin Srinivasan in 2001, is still one of the most widely used systems as of 2022. At around the same time, the first practical applications emerged, particularly in bioinformatics, where by 2000 inductive logic programming had been successfully applied to drug design, carcinogenicity and mutagenicity prediction, and elucidation of the structure and function of proteins. Unlike the focus on automatic programming inherent in the early work, these fields used inductive logic programming techniques from a viewpoint of relational data mining. The success of those initial applications and the lack of progress in recovering larger traditional logic programs shaped the focus of the field. Recently, classical tasks from automated programming have moved back into focus, as the introduction of meta-interpretative learning makes predicate invention and learning recursive programs more feasible. This technique was pioneered with the Metagol system introduced by Muggleton, Dianhuan Lin, Niels Pahlavi and Alireza Tamaddoni-Nezhad in 2014. This allows ILP systems to work with fewer examples, and brought successes in learning string transformation programs, answer set grammars and general algorithms. == Setting == Inductive logic programming has adopted several different learning settings, the most common of which are learning from entailment and learning from interpretations. In both cases, the input is provided in the form of background knowledge B, a logical theory (commonly in the form of clauses used in logic programming), as well as positive and negative examples, denoted E + {\textstyle E^{+}} and E − {\textstyle E^{-}} respectively. The output is given as a hypothesis H, itself a logical theory that typically consists of one or more clauses. The two settings differ in the format of examples presented. === Learning from entailment === As of 2022, learning from entailment is by far the most popular setting for inductive logic programming. In this setting, the positive and negative examples are given as finite sets E + {\textstyle E^{+}} and E − {\textstyle E^{-}} of positive and negated ground literals, respectively. A correct hypothesis H is a set of clauses satisfying the following requirements, where the turnstile symbol ⊨ {\displaystyle \models } stands for logical entailment: Completeness: B ∪ H ⊨ E + Consistency: B ∪ H ∪ E − ⊭ false {\displaystyle {\begin{array}{llll}{\text{Completeness:}}&B\cup H&\models &E^{+}\\{\text{Consistency: }}&B\cup H\cup E^{-}&\not \models &{\textit {false}}\end{array}}} Completeness requires any generated hypothesis H to explain all positive examples E + {\textstyle E^{+}} , and consistency forbids generation of any hypothesis H that is inconsistent with the negative examples E − {\textstyle E^{-}} , both given the background knowledge B. In Muggleton's setting of concept learning, "completeness" is referred to as "sufficiency", and "consistency" as "strong consistency". Two further conditions are added: "Necessity", which postulates that B does not entail E + {\textstyle E^{+}} , does not impose a restriction on H, but forbids any generation of a hypothesis as long as the positive facts are explainable without it. "Weak consistency", which states that no contradiction can be derived from B ∧ H {\textstyle B\land H} , forbids generation of any hypothesis H that contradicts the background knowledge B. Weak consistency is implied by strong consistency; if no negative examples are given, both requirements coincide. Weak consistency is particularly important in the case of noisy data, where completeness and strong consistency cannot be guaranteed. === Learning from interpretations === In learning from interpretations, the positive and negative examples are given as a set of complete or partial Herbrand structures, each of which are themselves a finite set of ground literals. Such a structure e is said to be a model of the set of clauses B ∪ H {\textstyle B\cup H} if for any substitution θ {\textstyle \theta } and any clause h e a d ← b o d y {\textstyle \mathrm {head} \leftarrow \mathrm {body} } in B ∪ H {\textstyle B\cup H} such that b o d y θ ⊆ e {\textstyle \mathrm {body} \theta \subseteq e} , h e a d θ ⊆ e {\displaystyle \mathrm {head} \theta \subseteq e} also holds. The goal is then to output a hypothesis that is complete, meaning every positive example is a model of B ∪ H {\textstyle B\cup H} , and consistent, meaning that no negative example is a model of B ∪ H {\textstyle B\cup H} . == Approaches to ILP == An inductive logic programming system is a program that takes as an input logic theories B , E + , E − {\displaystyle B,E^{+},E^{-}} and outputs a correct hypothesis H with respect to theories B , E + , E − {\displaystyle B,E^{+},E^{-}} . A system is complete if and only if for any input logic theories B , E + , E − {\displaystyle B,E^{+},E^{-}} any correct hypothesis H with respect to these input theories can be found with its hypothesis search procedure. Inductive logic programming systems can be roughly divided into two classes, search-based and meta-interpretative systems. Search-based systems exploit that the space of possible clauses forms a complete lattice under the subsumption relation, where one clause C 1 {\textstyle C_{1}} subsumes another clause C 2 {\textstyle C_{2}} if there is a substitution θ {\textstyle \theta } such that C 1 θ {\textstyle C_{1}\theta } , the result of applying θ {\textstyle \theta } to C 1 {\textstyle C_{1}} , is a subset of C 2 {\textstyle C_{2}} . This lattice can be traversed either bottom-up or top-down. === Bottom-up search === Bottom-up methods to search the subsumption lattice have been investigated since Plotkin's first work on formalising induction in clausal logic in 1970. Techniques used include least general generalisation, based on anti-unification, and inverse resolution, based on inverting the resolution inference rule. ==== Least general generalisation ==== A least general generalisation algorithm takes as input two clauses C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} and outputs the least general generalisation of C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} , that is, a clause C {\textstyle C} that subsumes C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} , and that is subsumed by every other clause that subsumes C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} . The least general generalisation can be computed by first computing all selections from C 1 {\textstyle C_{1}} and C 2 {\textstyle C_{2}} , which are pairs of literals ( L , M ) ∈ ( C 1 × C 2 ) {\displaystyle (L,M)\in (C_{1}\times C_{2})} sharing the same predicate symbol and negated/unnegated status. Then, the least general generalisation is obtained as the disjunction of the least general generalisations of the indi

Weight initialization

In deep learning, weight initialization or parameter initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initialization is the pre-training step of assigning initial values to these parameters. The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model. Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients and activation function saturation. Note that even though this article is titled "weight initialization", both weights and biases are used in a neural network as trainable parameters, so this article describes how both of these are initialized. Similarly, trainable parameters in convolutional neural networks (CNNs) are called kernels and biases, and this article also describes these. == Constant initialization == We discuss the main methods of initialization in the context of a multilayer perceptron (MLP). Specific strategies for initializing other network architectures are discussed in later sections. For an MLP, there are only two kinds of trainable parameters, called weights and biases. Each layer l {\displaystyle l} contains a weight matrix W ( l ) ∈ R n l − 1 × n l {\displaystyle W^{(l)}\in \mathbb {R} ^{n_{l-1}\times n_{l}}} and a bias vector b ( l ) ∈ R n l {\displaystyle b^{(l)}\in \mathbb {R} ^{n_{l}}} , where n l {\displaystyle n_{l}} is the number of neurons in that layer. A weight initialization method is an algorithm for setting the initial values for W ( l ) , b ( l ) {\displaystyle W^{(l)},b^{(l)}} for each layer l {\displaystyle l} . The simplest form is zero initialization: W ( l ) = 0 , b ( l ) = 0 {\displaystyle W^{(l)}=0,b^{(l)}=0} Zero initialization is usually used for initializing biases, but it is not used for initializing weights, as it leads to symmetry in the network, causing all neurons to learn the same features. In this page, we assume b = 0 {\displaystyle b=0} unless otherwise stated. Recurrent neural networks typically use activation functions with bounded range, such as sigmoid and tanh, since unbounded activation may cause exploding values. (Le, Jaitly, Hinton, 2015) suggested initializing weights in the recurrent parts of the network to identity and zero bias, similar to the idea of residual connections and LSTM with no forget gate. In most cases, the biases are initialized to zero, though some situations can use a nonzero initialization. For example, in multiplicative units, such as the forget gate of LSTM, the bias can be initialized to 1 to allow good gradient signal through the gate. For neurons with ReLU activation, one can initialize the bias to a small positive value like 0.1, so that the gradient is likely nonzero at initialization, avoiding the dying ReLU problem. == Random initialization == Random initialization means sampling the weights from a normal distribution or a uniform distribution, usually independently. === LeCun initialization === LeCun initialization, popularized in (LeCun et al., 1998), is designed to preserve the variance of neural activations during the forward pass. It samples each entry in W ( l ) {\displaystyle W^{(l)}} independently from a distribution with mean 0 and variance 1 / n l − 1 {\displaystyle 1/n_{l-1}} . For example, if the distribution is a continuous uniform distribution, then the distribution is U ( ± 3 / n l − 1 ) {\displaystyle {\mathcal {U}}(\pm {\sqrt {3/n_{l-1}}})} . === Glorot initialization === Glorot initialization (or Xavier initialization) was proposed by Xavier Glorot and Yoshua Bengio. It was designed as a compromise between two goals: to preserve activation variance during the forward pass and to preserve gradient variance during the backward pass. For uniform initialization, it samples each entry in W ( l ) {\displaystyle W^{(l)}} independently and identically from U ( ± 6 / ( n l + 1 + n l − 1 ) ) {\displaystyle {\mathcal {U}}(\pm {\sqrt {6/(n_{l+1}+n_{l-1})}})} . In the context, n l − 1 {\displaystyle n_{l-1}} is also called the "fan-in", and n l + 1 {\displaystyle n_{l+1}} the "fan-out". When the fan-in and fan-out are equal, then Glorot initialization is the same as LeCun initialization. === He initialization === As Glorot initialization performs poorly for ReLU activation, He initialization (or Kaiming initialization) was proposed by Kaiming He et al. for networks with ReLU activation. It samples each entry in W ( l ) {\displaystyle W^{(l)}} from N ( 0 , 2 / n l − 1 ) {\displaystyle {\mathcal {N}}(0,2/n_{l-1})} . === Orthogonal initialization === (Saxe et al. 2013) proposed orthogonal initialization: initializing weight matrices as uniformly random (according to the Haar measure) semi-orthogonal matrices, multiplied by a factor that depends on the activation function of the layer. It was designed so that if one initializes a deep linear network this way, then its training time until convergence is independent of depth. Sampling a uniformly random semi-orthogonal matrix can be done by initializing X {\displaystyle X} by IID sampling its entries from a standard normal distribution, then calculate ( X X ⊤ ) − 1 / 2 X {\displaystyle \left(XX^{\top }\right)^{-1/2}X} or its transpose, depending on whether X {\displaystyle X} is tall or wide. For CNN kernels with odd widths and heights, orthogonal initialization is done this way: initialize the central point by a semi-orthogonal matrix, and fill the other entries with zero. As an illustration, a kernel K {\displaystyle K} of shape 3 × 3 × c × c ′ {\displaystyle 3\times 3\times c\times c'} is initialized by filling K [ 2 , 2 , : , : ] {\displaystyle K[2,2,:,:]} with the entries of a random semi-orthogonal matrix of shape c × c ′ {\displaystyle c\times c'} , and the other entries with zero. (Balduzzi et al., 2017) used it with stride 1 and zero-padding. This is sometimes called the Orthogonal Delta initialization. Related to this approach, unitary initialization proposes to parameterize the weight matrices to be unitary matrices, with the result that at initialization they are random unitary matrices (and throughout training, they remain unitary). This is found to improve long-sequence modelling in LSTM. Orthogonal initialization has been generalized to layer-sequential unit-variance (LSUV) initialization. It is a data-dependent initialization method, and can be used in convolutional neural networks. It first initializes weights of each convolution or fully connected layer with orthonormal matrices. Then, proceeding from the first to the last layer, it runs a forward pass on a random minibatch, and divides the layer's weights by the standard deviation of its output, so that its output has variance approximately 1. === Fixup initialization === In 2015, the introduction of residual connections allowed very deep neural networks to be trained, much deeper than the ~20 layers of the previous state of the art (such as the VGG-19). Residual connections gave rise to their own weight initialization problems and strategies. These are sometimes called "normalization-free" methods, since using residual connection could stabilize the training of a deep neural network so much that normalizations become unnecessary. Fixup initialization is designed specifically for networks with residual connections and without batch normalization, as follows: Initialize the classification layer and the last layer of each residual branch to 0. Initialize every other layer using a standard method (such as He initialization), and scale only the weight layers inside residual branches by L − 1 2 m − 2 {\displaystyle L^{-{\frac {1}{2m-2}}}} . Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer. Similarly, T-Fixup initialization is designed for Transformers without layer normalization. === Others === Instead of initializing all weights with random values on the order of O ( 1 / n ) {\displaystyle O(1/{\sqrt {n}})} , sparse initialization initialized only a small subset of the weights with larger random values, and the other weights zero, so that the total variance is still on the order of O ( 1 ) {\displaystyle O(1)} . Random walk initialization was designed for MLP so that during backpropagation, the L2 norm of gradient at each layer performs an unbiased random walk as one moves from the last layer to the first. Looks linear initialization was designed to allow the neural network to behave like a deep linear network at initialization, since W R e L U ( x ) − W R e L U ( − x ) = W x {\displaystyle W\;\mathrm {ReLU} (x)-W\;\mathrm {ReLU} (-x)=Wx} . It initializes a matrix W {\displaystyle W} of shape R n 2 × m {\displaystyle \mathbb {R} ^{{\frac {n}{2}}\times m}} by any method, such as orthogonal initialization, t

Generalized multidimensional scaling

Generalized multidimensional scaling (GMDS) is an extension of metric multidimensional scaling, in which the target space is non-Euclidean. When the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another. GMDS is an emerging research direction. Currently, main applications are recognition of deformable objects (e.g. for three-dimensional face recognition) and texture mapping.