The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. It has an overall accuracy rate of 96–97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus. == History == A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. Developed in the early 1980s, CLAWS was built to fill the ever-growing gap created by always-changing POS necessities. Originally created to add part-of-speech tags to the LOB corpus of British English, the CLAWS tagset has since been adapted to other languages as well, including Urdu and Arabic. Since its inception, CLAWS has been hailed for its functionality and adaptability. Still, it is not without flaws, and though it boasts an error-rate of only 1.5% when judged in major categories, CLAWS still remains with c.3.3% ambiguities unresolved. Ambiguity arises in cases such as with the word flies, and whether it should be classified as a noun or a verb. It's these ambiguities that will require the various upgrades and tagsets that CLAWS will endure. == Rules and processing == CLAWS uses a Hidden Markov model to determine the likelihood of sequences of words in anticipating each part-of-speech label. === Sample output === This excerpt from Bram Stoker's Dracula (1897) has been tagged using both the CLAWS C5 and C7 tagsets. This is what a CLAWS output will generally look like, with the most likely part-of-speech tag following each word. == Tagsets == === CLAWS1 tagset === The first tagset developed in CLAWS, CLAWS1 tagset, has 132 word tags. In terms of form and application, C1 tagset is similar to Brown Corpus tags. See Table of tags in C1 tagset here. === CLAWS2 tagset === From 1983 to 1986, updated versions leading to CLAWS2 were part of a larger attempt to deal with aspects such as recognizing sentence breaks, in order to avoid the need for manual pre-processing of a text before the tags were applied, moving instead to optional manual post-editing to adjust the output of the automatic annotation, if needed. The CLAWS2 tagset has 166 word tags. See Table of tags in C2 tagset here. === CLAWS4 tagset === The CLAWS4 was used for the 100-million-word British National Corpus (BNC). A general-purpose grammatical tagger, it is a successor of the CLAWS1 tagger. In tagging the BNC, the many rounds of work that went into CLAWS4 focused on making the CLAWS program independent from the tagsets. For example, the BNC project used two tagset versions: "a main tagset (C5) with 62 tags with which the whole of the corpus has been tagged, and a larger (C7) tagset with 152 tags, which has been used to make a selected 'core' sample corpus of two million words." The latest version of CLAWS4 is offered by UCREL, a research center of Lancaster University. === CLAWS5 tagset === The CLAWS5 tagset, which was used for BNC, has over 60 tags. See Table of tags in C5 tagset here. === CLAWS6 tagset === The CLAWS6 tagset was used for the BNC sampler corpus and the COLT corpus. It has over 160 tags, including 13 determiner subtypes. See Table of tags in C6 tagset here. === CLAWS7 tagset === The standard CLAWS7 tagset is used currently. It is only different in the punctuation tags when compared to the CLAWS6 tagset. See Table of tags in C7 tagset here. === CLAWS8 tagset === CLAWS8 tagset was extended from C7 tagset with further distinctions in the determiner and pronoun categories, as well as 37 new auxiliary tags for forms of be, do, and have. See Table of tags in C8 tagset here
DexNet
Dex-net is a robotic. It uses a Grasp Quality Convolutional Neural Network to learn how to grasp unusually shaped objects. == History == Dex-net was developed by University of California, Berkeley professor Ken Goldberg and graduate student Jeff Mahler. == Design == Dex-net includes a high-resolution 3-D sensor and two arms, each controlled by a different neural network. One arm is equipped with a conventional robot gripper and another with a suction system. The robot’s software scans an object and then asks both neural networks to decide, on the fly, whether to grab or suck a particular object. It runs on an off-the-shelf industrial machine made by Swiss robotics company ABB. The software learns by attempting to pick up objects in a virtual environment. Dex-Net can generalize from an object it has seen before to a new one. The robot can "nudge" such virtual objects to examine if it is unsure how to grasp them. The trial data set was 6.7 million point clouds, grasps and analytic grasp metrics generated from thousands of 3D models. Grasps are defined as a gripper's planar position, angle and depth relative to an RGB-D sensor. == Mean picks per hour == A metric called mean picks per hour (MPPH) is calculated by multiplying the average time per pick and the average probability of success for a specific set of objects. The new metric allows labs working on picking robots to compare their results. Humans are capable of between 400 and 600 MPPH. In a contest organized by Amazon recently, the best robots were capable of between 70 and 95. Dex-net has achieved 200 to 300.
Quadratic unconstrained binary optimization
Quadratic unconstrained binary optimization (QUBO), also known as unconstrained binary quadratic programming (UBQP), is a combinatorial optimization problem with a wide range of applications from finance and economics to machine learning. QUBO is an NP hard problem, and for many classical problems from theoretical computer science, like maximum cut, graph coloring and the partition problem, embeddings into QUBO have been formulated. Embeddings for machine learning models include support-vector machines, clustering and probabilistic graphical models. Moreover, due to its close connection to Ising models, QUBO constitutes a central problem class for adiabatic quantum computation, where it is solved through a physical process called quantum annealing. == Definition == Let B = { 0 , 1 } {\displaystyle \mathbb {B} =\lbrace 0,1\rbrace } the set of binary digits (or bits), then B n {\displaystyle \mathbb {B} ^{n}} is the set of binary vectors of fixed length n ∈ N {\displaystyle n\in \mathbb {N} } . Given a symmetric or upper triangular matrix Q ∈ R n × n {\displaystyle {\boldsymbol {Q}}\in \mathbb {R} ^{n\times n}} , whose entries Q i j {\displaystyle Q_{ij}} define a weight for each pair of indices i , j ∈ { 1 , … , n } {\displaystyle i,j\in \lbrace 1,\dots ,n\rbrace } , we can define the function f Q : B n → R {\displaystyle f_{\boldsymbol {Q}}:\mathbb {B} ^{n}\rightarrow \mathbb {R} } that assigns a value to each binary vector x {\displaystyle {\boldsymbol {x}}} through f Q ( x ) = x ⊺ Q x = ∑ i = 1 n ∑ j = 1 n Q i j x i x j . {\displaystyle f_{\boldsymbol {Q}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }{\boldsymbol {Qx}}=\sum _{i=1}^{n}\sum _{j=1}^{n}Q_{ij}x_{i}x_{j}.} Alternatively, the linear and quadratic parts can be separated as f Q ′ , q ( x ) = x ⊺ Q ′ x + q ⊺ x , {\displaystyle f_{{\boldsymbol {Q}}',{\boldsymbol {q}}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }{\boldsymbol {Q}}'{\boldsymbol {x}}+{\boldsymbol {q}}^{\intercal }{\boldsymbol {x}},} where Q ′ ∈ R n × n {\displaystyle {\boldsymbol {Q}}'\in \mathbb {R} ^{n\times n}} and q ∈ R n {\displaystyle {\boldsymbol {q}}\in \mathbb {R} ^{n}} . This is equivalent to the previous definition through Q = Q ′ + diag [ q ] {\displaystyle {\boldsymbol {Q}}={\boldsymbol {Q}}'+\operatorname {diag} [{\boldsymbol {q}}]} using the diag operator, exploiting that x = x ⋅ x {\displaystyle x=x\cdot x} for all binary values x {\displaystyle x} . Intuitively, the weight Q i j {\displaystyle Q_{ij}} is added if both x i = 1 {\displaystyle x_{i}=1} and x j = 1 {\displaystyle x_{j}=1} . The QUBO problem consists of finding a binary vector x ∗ {\displaystyle {\boldsymbol {x}}^{}} that minimizes f Q {\displaystyle f_{\boldsymbol {Q}}} , i.e., ∀ x ∈ B n : f Q ( x ∗ ) ≤ f Q ( x ) {\displaystyle \forall {\boldsymbol {x}}\in \mathbb {B} ^{n}:~f_{\boldsymbol {Q}}({\boldsymbol {x}}^{})\leq f_{\boldsymbol {Q}}({\boldsymbol {x}})} . In general, x ∗ {\displaystyle {\boldsymbol {x}}^{}} is not unique, meaning there may be a set of minimizing vectors with equal value w.r.t. f Q {\displaystyle f_{\boldsymbol {Q}}} . The complexity of QUBO arises from the number of candidate binary vectors to be evaluated, as | B n | = 2 n {\displaystyle \left|\mathbb {B} ^{n}\right|=2^{n}} grows exponentially in n {\displaystyle n} . Sometimes, QUBO is defined as the problem of maximizing f Q {\displaystyle f_{\boldsymbol {Q}}} , which is equivalent to minimizing f − Q = − f Q {\displaystyle f_{-{\boldsymbol {Q}}}=-f_{\boldsymbol {Q}}} . == Properties == QUBO is scale invariant for positive factors α > 0 {\displaystyle \alpha >0} , which leave the optimum x ∗ {\displaystyle {\boldsymbol {x}}^{}} unchanged: f α Q ( x ) = x ⊺ ( α Q ) x = α ( x ⊺ Q x ) = α f Q ( x ) {\displaystyle f_{\alpha {\boldsymbol {Q}}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }(\alpha {\boldsymbol {Q}}){\boldsymbol {x}}=\alpha ({\boldsymbol {x}}^{\intercal }{\boldsymbol {Qx}})=\alpha f_{\boldsymbol {Q}}({\boldsymbol {x}})} . In its general form, QUBO is NP-hard and cannot be solved efficiently by any known polynomial-time algorithm. However, there are polynomially-solvable special cases, where Q {\displaystyle {\boldsymbol {Q}}} has certain properties, for example: If all coefficients are positive, the optimum is trivially x ∗ = ( 0 , … , 0 ) ⊺ {\displaystyle {\boldsymbol {x}}^{}=(0,\dots ,0)^{\intercal }} . Similarly, if all coefficients are negative, the optimum is x ∗ = ( 1 , … , 1 ) ⊺ {\displaystyle {\boldsymbol {x}}^{}=(1,\dots ,1)^{\intercal }} . If Q {\displaystyle {\boldsymbol {Q}}} is diagonal, the bits can be optimized independently, and the problem is solvable in O ( n ) {\displaystyle {\mathcal {O}}(n)} . The optimal variable assignments are simply x i ∗ = 1 {\displaystyle x_{i}^{}=1} if Q i i < 0 {\displaystyle Q_{ii}<0} , and x i ∗ = 0 {\displaystyle x_{i}^{}=0} otherwise. If all off-diagonal elements of Q {\displaystyle {\boldsymbol {Q}}} are non-positive, the corresponding QUBO problem is solvable in polynomial time. QUBO can be solved using integer linear programming solvers like CPLEX or Gurobi Optimizer. This is possible since QUBO can be reformulated as a linear constrained binary optimization problem. To achieve this, substitute the product x i x j {\displaystyle x_{i}x_{j}} by an additional binary variable z i j ∈ B {\displaystyle z_{ij}\in \mathbb {B} } and add the constraints x i ≥ z i j {\displaystyle x_{i}\geq z_{ij}} , x j ≥ z i j {\displaystyle x_{j}\geq z_{ij}} and x i + x j − 1 ≤ z i j {\displaystyle x_{i}+x_{j}-1\leq z_{ij}} . Note that z i j {\displaystyle z_{ij}} can also be relaxed to continuous variables within the bounds zero and one. == Applications == QUBO is a structurally simple, yet computationally hard optimization problem. It can be used to encode a wide range of optimization problems from various scientific areas. === Maximum Cut === Given a graph G = ( V , E ) {\displaystyle G=(V,E)} with vertex set V = { 1 , … , n } {\displaystyle V=\lbrace 1,\dots ,n\rbrace } and edges E ⊆ V × V {\displaystyle E\subseteq V\times V} , the maximum cut (max-cut) problem consists of finding two subsets S , T ⊆ V {\displaystyle S,T\subseteq V} with T = V ∖ S {\displaystyle T=V\setminus S} , such that the number of edges between S {\displaystyle S} and T {\displaystyle T} is maximized. The more general weighted max-cut problem assumes edge weights w i j ≥ 0 ∀ i , j ∈ V {\displaystyle w_{ij}\geq 0~\forall i,j\in V} , with ( i , j ) ∉ E ⇒ w i j = 0 {\displaystyle (i,j)\notin E\Rightarrow w_{ij}=0} , and asks for a partition S , T ⊆ V {\displaystyle S,T\subseteq V} that maximizes the sum of edge weights between S {\displaystyle S} and T {\displaystyle T} , i.e., max S ⊆ V ∑ i ∈ S , j ∉ S w i j . {\displaystyle \max _{S\subseteq V}\sum _{i\in S,j\notin S}w_{ij}.} By setting w i j = 1 {\displaystyle w_{ij}=1} for all ( i , j ) ∈ E {\displaystyle (i,j)\in E} this becomes equivalent to the original max-cut problem above, which is why we focus on this more general form in the following. For every vertex in i ∈ V {\displaystyle i\in V} we introduce a binary variable x i {\displaystyle x_{i}} with the interpretation x i = 0 {\displaystyle x_{i}=0} if i ∈ S {\displaystyle i\in S} and x i = 1 {\displaystyle x_{i}=1} if i ∈ T {\displaystyle i\in T} . As T = V ∖ S {\displaystyle T=V\setminus S} , every i {\displaystyle i} is in exactly one set, meaning there is a 1:1 correspondence between binary vectors x ∈ B n {\displaystyle {\boldsymbol {x}}\in \mathbb {B} ^{n}} and partitions of V {\displaystyle V} into two subsets. We observe that, for any i , j ∈ V {\displaystyle i,j\in V} , the expression x i ( 1 − x j ) + ( 1 − x i ) x j {\displaystyle x_{i}(1-x_{j})+(1-x_{i})x_{j}} evaluates to 1 if and only if i {\displaystyle i} and j {\displaystyle j} are in different subsets, equivalent to logical XOR. Let W ∈ R + n × n {\displaystyle {\boldsymbol {W}}\in \mathbb {R} _{+}^{n\times n}} with W i j = w i j ∀ i , j ∈ V {\displaystyle W_{ij}=w_{ij}~\forall i,j\in V} . By extending above expression to matrix-vector form we find that x ⊺ W ( 1 − x ) + ( 1 − x ) ⊺ W x = − 2 x ⊺ W x + ( W 1 + W ⊺ 1 ) ⊺ x {\displaystyle {\boldsymbol {x}}^{\intercal }{\boldsymbol {W}}({\boldsymbol {1}}-{\boldsymbol {x}})+({\boldsymbol {1}}-{\boldsymbol {x}})^{\intercal }{\boldsymbol {Wx}}=-2{\boldsymbol {x}}^{\intercal }{\boldsymbol {Wx}}+({\boldsymbol {W1}}+{\boldsymbol {W}}^{\intercal }{\boldsymbol {1}})^{\intercal }{\boldsymbol {x}}} is the sum of weights of all edges between S {\displaystyle S} and T {\displaystyle T} , where 1 = ( 1 , 1 , … , 1 ) ⊺ ∈ R n {\displaystyle {\boldsymbol {1}}=(1,1,\dots ,1)^{\intercal }\in \mathbb {R} ^{n}} . As this is a quadratic function over x {\displaystyle {\boldsymbol {x}}} , it is a QUBO problem whose parameter matrix we can read from above expression as Q = 2 W − diag [ W 1 + W ⊺ 1 ] , {\displaystyle {\boldsymbol {Q}}=2{\boldsymbol {W}}-\operatorname {diag} [{\boldsymbol {W1}}+{\boldsymbol {W}}^{\intercal }{\bol
Distributional Soft Actor Critic
Distributional Soft Actor Critic (DSAC) is a suite of model-free off-policy reinforcement learning algorithms, tailored for learning decision-making or control policies in complex systems with continuous action spaces. Distinct from traditional methods that focus solely on expected returns, DSAC algorithms are designed to learn a Gaussian distribution over stochastic returns, called value distribution. This focus on Gaussian value distribution learning notably diminishes value overestimations, which in turn boosts policy performance. Additionally, the value distribution learned by DSAC can also be used for risk-aware policy learning. From a technical standpoint, DSAC is essentially a distributional adaptation of the well-established soft actor-critic (SAC) method. To date, the DSAC family comprises two iterations: the original DSAC-v1 and its successor, DSAC-T (also known as DSAC-v2), with the latter demonstrating superior capabilities over the Soft Actor-Critic (SAC) in Mujoco benchmark tasks. The source code for DSAC-T can be found at the following URL: Jingliang-Duan/DSAC-T. Both iterations have been integrated into an advanced, Pytorch-powered reinforcement learning toolkit named GOPS: GOPS (General Optimal control Problem Solver).
Cross-entropy
In information theory, the cross-entropy between two probability distributions p {\displaystyle p} and q {\displaystyle q} , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution q {\displaystyle q} , rather than the true distribution p {\displaystyle p} . == Definition == The cross-entropy of the distribution q {\displaystyle q} relative to a distribution p {\displaystyle p} over a given set is defined as follows: H ( p , q ) = − E p [ log q ] , {\displaystyle H(p,q)=-\operatorname {E} _{p}[\log q],} where E p [ ⋅ ] {\displaystyle \operatorname {E} _{p}[\cdot ]} is the expected value operator with respect to the distribution p {\displaystyle p} . The definition may be formulated using the Kullback–Leibler divergence D K L ( p ∥ q ) {\displaystyle D_{\mathrm {KL} }(p\parallel q)} , divergence of p {\displaystyle p} from q {\displaystyle q} (also known as the relative entropy of p {\displaystyle p} with respect to q {\displaystyle q} ). H ( p , q ) = H ( p ) + D K L ( p ∥ q ) , {\displaystyle H(p,q)=H(p)+D_{\mathrm {KL} }(p\parallel q),} where H ( p ) {\displaystyle H(p)} is the entropy of p {\displaystyle p} . For discrete probability distributions p {\displaystyle p} and q {\displaystyle q} with the same support X {\displaystyle {\mathcal {X}}} , this means The situation for continuous distributions is analogous. We have to assume that p {\displaystyle p} and q {\displaystyle q} are absolutely continuous with respect to some reference measure r {\displaystyle r} (usually r {\displaystyle r} is a Lebesgue measure on a Borel σ-algebra). Let P {\displaystyle P} and Q {\displaystyle Q} be probability density functions of p {\displaystyle p} and q {\displaystyle q} with respect to r {\displaystyle r} . Then − ∫ X P ( x ) log Q ( x ) d x = E p [ − log Q ] , {\displaystyle -\int _{\mathcal {X}}P(x)\,\log Q(x)\,\mathrm {d} x=\operatorname {E} _{p}[-\log Q],} and therefore NB: The notation H ( p , q ) {\displaystyle H(p,q)} is also used for a different concept, the joint entropy of p {\displaystyle p} and q {\displaystyle q} . == Motivation == In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value x i {\displaystyle x_{i}} out of a set of possibilities { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} can be seen as representing an implicit probability distribution q ( x i ) = ( 1 2 ) ℓ i {\displaystyle q(x_{i})=\left({\frac {1}{2}}\right)^{\ell _{i}}} over { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} , where ℓ i {\displaystyle \ell _{i}} is the length of the code for x i {\displaystyle x_{i}} in bits. Therefore, cross-entropy can be interpreted as the expected message-length per datum when a wrong distribution q {\displaystyle q} is assumed while the data actually follows a distribution p {\displaystyle p} . That is why the expectation is taken over the true probability distribution p {\displaystyle p} and not q . {\displaystyle q.} Indeed the expected message-length under the true distribution p {\displaystyle p} is E p [ ℓ ] = − E p [ ln q ( x ) ln ( 2 ) ] = − E p [ log 2 q ( x ) ] = − ∑ x i p ( x i ) log 2 q ( x i ) = − ∑ x p ( x ) log 2 q ( x ) = H ( p , q ) . {\displaystyle {\begin{aligned}\operatorname {E} _{p}[\ell ]&=-\operatorname {E} _{p}\left[{\frac {\ln {q(x)}}{\ln(2)}}\right]\\[1ex]&=-\operatorname {E} _{p}\left[\log _{2}{q(x)}\right]\\[1ex]&=-\sum _{x_{i}}p(x_{i})\,\log _{2}q(x_{i})\\[1ex]&=-\sum _{x}p(x)\,\log _{2}q(x)=H(p,q).\end{aligned}}} == Estimation == There are many situations where cross-entropy needs to be measured but the distribution of p {\displaystyle p} is unknown. An example is language modeling, where a model is created based on a training set T {\displaystyle T} , and then its cross-entropy is measured on a test set to assess how accurate the model is in predicting the test data. In this example, p {\displaystyle p} is the true distribution of words in any corpus, and q {\displaystyle q} is the distribution of words as predicted by the model. Since the true distribution is unknown, cross-entropy cannot be directly calculated. In these cases, an estimate of cross-entropy is calculated using the following formula: H ( T , q ) = − ∑ i = 1 N 1 N log 2 q ( x i ) {\displaystyle H(T,q)=-\sum _{i=1}^{N}{\frac {1}{N}}\log _{2}q(x_{i})} where N {\displaystyle N} is the size of the test set, and q ( x ) {\displaystyle q(x)} is the probability of event x {\displaystyle x} estimated from the training set. In other words, q ( x i ) {\displaystyle q(x_{i})} is the probability estimate of the model that the i-th word of the text is x i {\displaystyle x_{i}} . The sum is averaged over the N {\displaystyle N} words of the test. This is a Monte Carlo estimate of the true cross-entropy, where the test set is treated as samples from p ( x ) {\displaystyle p(x)} . == Relation to maximum likelihood == The cross entropy arises in classification problems when introducing a logarithm in the guise of the log-likelihood function. This section concerns the estimation of the probabilities of different discrete outcomes. To this end, denote a parametrized family of distributions by q θ {\displaystyle q_{\theta }} , with θ {\displaystyle \theta } subject to the optimization effort. Consider a given finite sequence of N {\displaystyle N} values x i {\displaystyle x_{i}} from a training set, obtained from conditionally independent sampling. The likelihood assigned to any considered parameter θ {\displaystyle \theta } of the model is then given by the product over all probabilities q θ ( X = x i ) {\displaystyle q_{\theta }(X=x_{i})} . Repeated occurrences are possible, leading to equal factors in the product. If the count of occurrences of the value equal to x {\displaystyle x} is denoted by # x {\displaystyle \#x} , then the frequency of that value equals # x / N {\displaystyle \#x/N} . If p ( X = x ) {\displaystyle p(X=x)} is the underlying probability distribution, for large N {\displaystyle N} we expect p ( X = x ) ≈ # x / N {\displaystyle p(X=x)\approx \#x/N} , by the law of large numbers. Writing our likelihood function as the product of observations from the distribution q θ {\displaystyle q_{\theta }} : L ( θ ; x ) = ∏ i q θ ( X = x i ) = ∏ x q θ ( X = x ) # x ≈ ∏ x q θ ( X = x ) N ⋅ p ( X = x ) = exp log [ ∏ x q θ ( X = x ) N ⋅ p ( X = x ) ] = exp ( ∑ x N ⋅ p ( X = x ) log q θ ( X = x ) ) , {\displaystyle {\begin{aligned}{\mathcal {L}}(\theta ;{\mathbf {x} })&=\prod _{i}q_{\theta }(X=x_{i})=\prod _{x}q_{\theta }(X=x)^{\#x}\\&\approx \prod _{x}q_{\theta }(X=x)^{N\cdot p(X=x)}=\exp \log \left[\prod _{x}q_{\theta }(X=x)^{N\cdot p(X=x)}\right]\\&=\exp \left(\sum _{x}N\cdot p(X=x)\log q_{\theta }(X=x)^{}\right),\end{aligned}}} where we have used the calculation rules for the logarithm in the final line. Notice how the exponent contains a − H ( p , q θ ) {\displaystyle -H(p,q_{\theta })} term. Taking the logarithm of both sides gives: log L ( θ ; x ) = − N ⋅ H ( p , q θ ) . {\displaystyle \log {\mathcal {L}}(\theta ;{\mathbf {x} })=-N\cdot H(p,q_{\theta }).} Since the logarithm is a monotonically increasing function, the maximizing value of θ {\displaystyle \theta } is unaffected by this final step. Similarly, the maximizing value of θ {\displaystyle \theta } is unaffected by the factor of N {\displaystyle N} . So we observe that the likelihood maximization amounts to minimization of the cross-entropy. == Cross-entropy minimization == Cross-entropy minimization is frequently used in optimization and rare-event probability estimation. When comparing a distribution q {\displaystyle q} against a fixed reference distribution p {\displaystyle p} , cross-entropy and KL divergence are identical up to an additive constant (since p {\displaystyle p} is fixed): According to the Gibbs' inequality, both take on their minimal values when p = q {\displaystyle p=q} , which is 0 {\displaystyle 0} for KL divergence, and H ( p ) {\displaystyle \mathrm {H} (p)} for cross-entropy. In the engineering literature, the principle of minimizing KL divergence (Kullback's "Principle of Minimum Discrimination Information") is often called the Principle of Minimum Cross-Entropy (MCE), or Minxent. However, as discussed in the article Kullback–Leibler divergence, sometimes the distribution q {\displaystyle q} is the fixed prior reference distribution, and the distribution p {\displaystyle p} is optimized to be as close to q {\displaystyle q} as possible, subject to some constraint. In this case the two minimizations are not equivalent. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by restating cross-entropy to be D K L ( p ∥ q ) {\displaystyle D_{\mathrm {KL} }(p\parallel q)} , rather than H (
Key–value database
A key-value database, or key-value store, is a data storage paradigm designed for storing, retrieving, and managing associative arrays, a data structure more commonly known today as a dictionary. Dictionaries contain a collection of objects, or records, which in turn have many different fields within them. These records are stored and retrieved using a key that uniquely identifies the record, and is used to find the data within the database. Key-value databases differ from the better known relational databases (RDB). RDBs pre-define the data structure in the database as a series of tables containing fields with well-defined data types. Exposing the data types to the database program allows it to apply various optimizations. In contrast, key-value systems treat the value as opaque to the database itself, and typically support only simple operations such as storing, retrieving, updating, and deleting a value by its key. This offers considerable flexibility and makes such systems well suited to low-latency, high-throughput workloads dominated by direct key lookups, but less suitable for applications that require complex queries or explicit relationships among records. A lack of standardization, limited transaction support, and relatively simple query interfaces long restricted many key-value systems to specialized uses, but the rapid move to cloud computing after 2010 helped drive renewed interest in them as part of the broader NoSQL movement. Some graph databases, such as ArangoDB, are also key–value databases internally, adding the concept of relationships (pointers) between records as a first-class data type. == Types and examples == Key–value systems span a wide consistency spectrum, from eventually consistent designs to strongly consistent or serializable ones, and some allow the consistency level to be configured as part of the trade-off against latency and availability. Renewed interest in key–value and other NoSQL systems was driven in part by the demands of big data, distributed, and cloud applications. Their scalability and availability made them attractive for cloud data management, although limited transaction support, low-level query interfaces, and the lack of standardization remained obstacles to wider adoption. Some maintain data in memory (RAM), while others employ solid-state drives or rotating disks. Some key–value systems add additional structure to their keys. For example, Oracle NoSQL Database organizes records using composite keys with "major" and "minor" components, an arrangement that Oracle compares to a directory-path structure in a file system. More generally, however, key–value stores are defined by their use of unique keys associated with opaque values and by their emphasis on simple key-based operations. Unix included dbm (database manager), a minimal database library written by Ken Thompson for managing associative arrays with a single key and hash-based access. Later implementations and related libraries included sdbm, GNU dbm (gdbm), and Berkeley DB. A more recent example is RocksDB, a persistent key–value storage engine developed at Facebook and designed for large-scale applications. Other examples include in-memory systems such as Memcached and Redis, and persistent systems such as Berkeley DB, Riak, and Voldemort.
Probit model
In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model. A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function. It is most often estimated using the maximum likelihood procedure, such an estimation being called a probit regression. == Conceptual framework == Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form P ( Y = 1 ∣ X ) = Φ ( X T β ) , {\displaystyle P(Y=1\mid X)=\Phi (X^{\operatorname {T} }\beta ),} where P is the probability and Φ {\displaystyle \Phi } is the cumulative distribution function (CDF) of the standard normal distribution. The parameters β are typically estimated by maximum likelihood. It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable Y ∗ = X T β + ε , {\displaystyle Y^{\ast }=X^{T}\beta +\varepsilon ,} where ε ~ N(0, 1). Then Y can be viewed as an indicator for whether this latent variable is positive: Y = { 1 Y ∗ > 0 0 otherwise } = { 1 X T β + ε > 0 0 otherwise } {\displaystyle Y=\left.{\begin{cases}1&Y^{}>0\\0&{\text{otherwise}}\end{cases}}\right\}=\left.{\begin{cases}1&X^{\operatorname {T} }\beta +\varepsilon >0\\0&{\text{otherwise}}\end{cases}}\right\}} The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that P ( Y = 1 ∣ X ) = P ( Y ∗ > 0 ) = P ( X T β + ε > 0 ) = P ( ε > − X T β ) = P ( ε < X T β ) by symmetry of the normal distribution = Φ ( X T β ) {\displaystyle {\begin{aligned}P(Y=1\mid X)&=P(Y^{\ast }>0)\\&=P(X^{\operatorname {T} }\beta +\varepsilon >0)\\&=P(\varepsilon >-X^{\operatorname {T} }\beta )\\&=P(\varepsilon