AI News and Guides

Explore the best AI News and Guides — independent reviews, comparisons, pricing and step-by-step how-to guides, curated by Aizhi.

  • Voice user interface

    Voice user interface

    A voice user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface. Voice user interfaces have been added to automobiles, home automation systems, computer operating systems, home appliances like washing machines and microwave ovens, and television remote controls. They are the primary way of interacting with virtual assistants on smartphones and smart speakers. Older automated attendants (which route phone calls to the correct extension) and interactive voice response systems (which conduct more complicated transactions over the phone) can respond to the pressing of keypad buttons via DTMF tones, but those with a full voice user interface allow callers to speak requests and responses without having to press any buttons. Newer voice command devices are speaker-independent, so they can respond to multiple voices, regardless of accent or dialectal influences. They are also capable of responding to several commands at once, separating vocal messages, and providing appropriate feedback, accurately imitating a natural conversation. == Overview == A VUI is the interface to any speech application. Only a short time ago, controlling a machine by simply talking to it was only possible in science fiction. Until recently, this area was considered to be artificial intelligence. However, advances in technologies like text-to-speech, speech-to-text, natural language processing, and cloud services contributed to the mass adoption of these types of interfaces. VUIs have become more commonplace, and people are taking advantage of the value that these hands-free, eyes-free interfaces provide in many situations. VUIs rely on the ability to process input reliably, inconsistent performance often leads to decreased user engagement and negative feedback. Designing a good VUI requires interdisciplinary talents of computer science, linguistics and human factors such as psychology. Even with advanced development tools, constructing an effective VUI requires understanding of both the tasks to be performed, as well as the target audience that will use the final system. The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction. A VUI designed for the general public should emphasize ease of use and provide a lot of help and guidance for first-time callers. In contrast, a VUI designed for a small group of power users (including field service workers), should focus more on productivity and less on help and guidance. Such applications should streamline the call flows, minimize prompts, eliminate unnecessary iterations and allow elaborate "mixed initiative dialogs", which enable callers to enter several pieces of information in a single utterance and in any order or combination. In short, speech applications have to be carefully crafted for the specific business process that is being automated. Not all business processes render themselves equally well for speech automation. In general, the more complex the inquiries and transactions are, the more challenging they will be to automate, and the more likely they will be to fail with the general public. In some scenarios, automation is simply not applicable, so live agent assistance is the only option. A legal advice hotline, for example, would be very difficult to automate. On the flip side, speech is perfect for handling quick and routine transactions, like changing the status of a work order, completing a time or expense entry, or transferring funds between accounts. == History == Early applications for VUI included voice-activated dialing of phones, either directly or through a (typically Bluetooth) headset or vehicle audio system. In 2007, a CNN business article reported that voice command was over a billion dollar industry and that companies like Google and Apple were trying to create speech recognition features. In the years since the article was published, the world has witnessed a variety of voice command devices. Additionally, Google has created a speech recognition engine called Pico TTS and Apple released Siri. Voice command devices are becoming more widely available, and innovative ways for using the human voice are always being created. For example, Business Week suggests that the future remote controller is going to be the human voice. Currently Xbox Live allows such features and Jobs hinted at such a feature on the new Apple TV. == Voice command software products on computing devices == Both Apple Mac and Windows PC provide built in speech recognition features for their latest operating systems. === Microsoft Windows === Two Microsoft operating systems, Windows 7 and Windows Vista, provide speech recognition capabilities. Microsoft integrated voice commands into their operating systems to provide a mechanism for people who want to limit their use of the mouse and keyboard, but still want to maintain or increase their overall productivity. ==== Windows Vista ==== With Windows Vista voice control, a user may dictate documents and emails in mainstream applications, start and switch between applications, control the operating system, format documents, save documents, edit files, efficiently correct errors, and fill out forms on the Web. The speech recognition software learns automatically every time a user uses it, and speech recognition is available in English (U.S.), English (U.K.), German (Germany), French (France), Spanish (Spain), Japanese, Chinese (Traditional), and Chinese (Simplified). In addition, the software comes with an interactive tutorial, which can be used to train both the user and the speech recognition engine. ==== Windows 7 ==== In addition to all the features provided in Windows Vista, Windows 7 provides a wizard for setting up the microphone and a tutorial on how to use the feature. ==== Mac OS X ==== All Mac OS X computers come pre-installed with the speech recognition software. The software is user-independent, and it allows for a user to, "navigate menus and enter keyboard shortcuts; speak checkbox names, radio button names, list items, and button names; and open, close, control, and switch among applications." However, the Apple website recommends a user buy a commercial product called Dictate. === Commercial products === If a user is not satisfied with the built in speech recognition software or a user does not have a built speech recognition software for their OS, then a user may experiment with a commercial product such as Braina Pro or DragonNaturallySpeaking for Windows PCs, and Dictate, the name of the same software for Mac OS. == Voice command mobile devices == Any mobile device running Android OS, Microsoft Windows Phone, iOS 9 or later, or Blackberry OS provides voice command capabilities. In addition to the built-in speech recognition software for each mobile phone's operating system, a user may download third party voice command applications from each operating system's application store: Apple App store, Google Play, Windows Phone Marketplace (initially Windows Marketplace for Mobile), or BlackBerry App World. === Android OS === Google has developed an open source operating system called Android, which allows a user to perform voice commands such as: send text messages, listen to music, get directions, call businesses, call contacts, send email, view a map, go to websites, write a note, and search Google. The speech recognition software is available for all devices since Android 2.2 "Froyo", but the settings must be set to English. Google allows for the user to change the language, and the user is prompted when he or she first uses the speech recognition feature if he or she would like their voice data to be attached to their Google account. If a user decides to opt into this service, it allows Google to train the software to the user's voice. Google introduced the Google Assistant with Android 7.0 "Nougat". It is much more advanced than the older version. Amazon.com has the Echo that uses Amazon's custom version of Android to provide a voice interface. === Microsoft Windows === Windows Phone is Microsoft's mobile device's operating system. On Windows Phone 7.5, the speech app is user independent and can be used to: call someone from your contact list, call any phone number, redial the last number, send a text message, call your voice mail, open an application, read appointments, query phone status, and search the web. In addition, speech can also be used during a phone call, and the following actions are possible during a phone call: press a number, turn the speaker phone on, or call someone, which puts the current call on hold. Windows 10 introduces Cortana, a voice control system that replaces the formerly used voice control on Windows

    Read more →
  • Quantum neural network

    Quantum neural network

    Quantum neural networks are computational neural network models which are based on the principles of quantum mechanics. The first ideas on quantum neural computation were published independently in 1995 by Subhash Kak and Ron Chrisley, engaging with the theory of quantum mind, which posits that quantum effects play a role in cognitive function. However, typical research in quantum neural networks involves combining classical artificial neural network models (which are widely used in machine learning for the important task of pattern recognition) with the advantages of quantum information in order to develop more efficient algorithms. One important motivation for these investigations is the difficulty to train classical neural networks, especially in big data applications. The hope is that features of quantum computing such as quantum parallelism or the effects of interference and entanglement can be used as resources. Since the technological implementation of a quantum computer is still in a premature stage, such quantum neural network models are mostly theoretical proposals that await their full implementation in physical experiments. Most Quantum neural networks are developed as feed-forward networks. Similar to their classical counterparts, this structure intakes input from one layer of qubits, and passes that input onto another layer of qubits. This layer of qubits evaluates this information and passes on the output to the next layer. Eventually the path leads to the final layer of qubits. The layers do not have to be of the same width, meaning they don't have to have the same number of qubits as the layer before or after it. This structure is trained on which path to take similar to classical artificial neural networks. This is discussed in a lower section. Quantum neural networks refer to three different categories: Quantum computer with classical data, classical computer with quantum data, and quantum computer with quantum data. == Examples == Quantum neural network research is still in its infancy, and a conglomeration of proposals and ideas of varying scope and mathematical rigor have been put forward. Most of them are based on the idea of replacing classical binary or McCulloch-Pitts neurons with a qubit (which can be called a "quron"), resulting in neural units that can be in a superposition of the state 'firing' and 'resting'. === Quantum perceptrons === A lot of proposals attempt to find a quantum equivalent for the perceptron unit from which neural nets are constructed. A problem is that nonlinear activation functions do not immediately correspond to the mathematical structure of quantum theory, since a quantum evolution is described by linear operations and leads to probabilistic observation. Ideas to imitate the perceptron activation function with a quantum mechanical formalism reach from special measurements to postulating non-linear quantum operators (a mathematical framework that is disputed). A direct implementation of the activation function using the circuit-based model of quantum computation has recently been proposed by Schuld, Sinayskiy and Petruccione based on the quantum phase estimation algorithm. === Quantum networks === At a larger scale, researchers have attempted to generalize neural networks to the quantum setting. One way of constructing a quantum neuron is to first generalise classical neurons and then generalising them further to make unitary gates. Interactions between neurons can be controlled quantumly, with unitary gates, or classically, via measurement of the network states. This high-level theoretical technique can be applied broadly, by taking different types of networks and different implementations of quantum neurons, such as photonically implemented neurons and quantum reservoir processor (quantum version of reservoir computing). Most learning algorithms follow the classical model of training an artificial neural network to learn the input-output function of a given training set and use classical feedback loops to update parameters of the quantum system until they converge to an optimal configuration. Learning as a parameter optimisation problem has also been approached by adiabatic models of quantum computing. Quantum neural networks can be applied to algorithmic design: given qubits with tunable mutual interactions, one can attempt to learn interactions following the classical backpropagation rule from a training set of desired input-output relations, taken to be the desired output algorithm's behavior. The quantum network thus 'learns' an algorithm. === Quantum associative memory === The first quantum associative memory algorithm was introduced by Dan Ventura and Tony Martinez in 1999. The authors do not attempt to translate the structure of artificial neural network models into quantum theory, but propose an algorithm for a circuit-based quantum computer that simulates associative memory. The memory states (in Hopfield neural networks saved in the weights of the neural connections) are written into a superposition, and a Grover-like quantum search algorithm retrieves the memory state closest to a given input. As such, this is not a fully content-addressable memory, since only incomplete patterns can be retrieved. The first truly content-addressable quantum memory, which can retrieve patterns also from corrupted inputs, was proposed by Carlo A. Trugenberger. Both memories can store an exponential (in terms of n qubits) number of patterns but can be used only once due to the no-cloning theorem and their destruction upon measurement. Trugenberger, however, has shown that his probabilistic model of quantum associative memory can be efficiently implemented and re-used multiples times for any polynomial number of stored patterns, a large advantage with respect to classical associative memories. === Classical neural networks inspired by quantum theory === A substantial amount of interest has been given to a "quantum-inspired" model that uses ideas from quantum theory to implement a neural network based on fuzzy logic. == Training == Quantum Neural Networks can be theoretically trained similarly to training classical/artificial neural networks. A key difference lies in communication between the layers of a neural networks. For classical neural networks, at the end of a given operation, the current perceptron copies its output to the next layer of perceptron(s) in the network. However, in a quantum neural network, where each perceptron is a qubit, this would violate the no-cloning theorem. A proposed generalized solution to this is to replace the classical fan-out method with an arbitrary unitary that spreads out, but does not copy, the output of one qubit to the next layer of qubits. Using this fan-out Unitary ( U f {\displaystyle U_{f}} ) with a dummy state qubit in a known state (Ex. | 0 ⟩ {\displaystyle |0\rangle } in the computational basis), also known as an Ancilla bit, the information from the qubit can be transferred to the next layer of qubits. This process adheres to the quantum operation requirement of reversibility. Using this quantum feed-forward network, deep neural networks can be executed and trained efficiently. A deep neural network is essentially a network with many hidden-layers, as seen in the sample model neural network above. Since the Quantum neural network being discussed uses fan-out Unitary operators, and each operator only acts on its respective input, only two layers are used at any given time. In other words, no Unitary operator is acting on the entire network at any given time, meaning the number of qubits required for a given step depends on the number of inputs in a given layer. Since Quantum Computers are notorious for their ability to run multiple iterations in a short period of time, the efficiency of a quantum neural network is solely dependent on the number of qubits in any given layer, and not on the depth of the network. === Cost functions === To determine the effectiveness of a neural network, a cost function is used, which essentially measures the proximity of the network's output to the expected or desired output. In a Classical Neural Network, the weights ( w {\displaystyle w} ) and biases ( b {\displaystyle b} ) at each step determine the outcome of the cost function C ( w , b ) {\displaystyle C(w,b)} . When training a Classical Neural network, the weights and biases are adjusted after each iteration, and given equation 1 below, where y ( x ) {\displaystyle y(x)} is the desired output and a out ( x ) {\displaystyle a^{\text{out}}(x)} is the actual output, the cost function is optimized when C ( w , b ) {\displaystyle C(w,b)} = 0. For a quantum neural network, the cost function is determined by measuring the fidelity of the outcome state ( ρ out {\displaystyle \rho ^{\text{out}}} ) with the desired outcome state ( ϕ out {\displaystyle \phi ^{\text{out}}} ), seen in Equation 2 below. In this case, the Unitary operators are adjusted after each it

    Read more →
  • Deterministic blockmodeling

    Deterministic blockmodeling

    Deterministic blockmodeling is an approach in blockmodeling that does not assume a probabilistic model, and instead relies on the exact or approximate algorithms, which are used to find blockmodel(s). This approach typically minimizes some inconsistency that can occur with the ideal block structure. Such analysis is focused on clustering (grouping) of the network (or adjacency matrix) that is obtained with minimizing an objective function, which measures discrepancy from the ideal block structure. However, some indirect approaches (or methods between direct and indirect approaches, such as CONCOR) do not explicitly minimize inconsistencies or optimize some criterion function. This approach was popularized in the 1970s, due to the presence of two computer packages (CONCOR and STRUCTURE) that were used to "find a permutation of the rows and columns in the adjacency matrix leading to an approximate block structure". The opposite approach to deterministic blockmodeling is a stochastic blockmodeling approach.

    Read more →
  • Probit model

    Probit model

    In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model. A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function. It is most often estimated using the maximum likelihood procedure, such an estimation being called a probit regression. == Conceptual framework == Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form P ( Y = 1 ∣ X ) = Φ ( X T β ) , {\displaystyle P(Y=1\mid X)=\Phi (X^{\operatorname {T} }\beta ),} where P is the probability and Φ {\displaystyle \Phi } is the cumulative distribution function (CDF) of the standard normal distribution. The parameters β are typically estimated by maximum likelihood. It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable Y ∗ = X T β + ε , {\displaystyle Y^{\ast }=X^{T}\beta +\varepsilon ,} where ε ~ N(0, 1). Then Y can be viewed as an indicator for whether this latent variable is positive: Y = { 1 Y ∗ > 0 0 otherwise } = { 1 X T β + ε > 0 0 otherwise } {\displaystyle Y=\left.{\begin{cases}1&Y^{}>0\\0&{\text{otherwise}}\end{cases}}\right\}=\left.{\begin{cases}1&X^{\operatorname {T} }\beta +\varepsilon >0\\0&{\text{otherwise}}\end{cases}}\right\}} The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that P ( Y = 1 ∣ X ) = P ( Y ∗ > 0 ) = P ( X T β + ε > 0 ) = P ( ε > − X T β ) = P ( ε < X T β ) by symmetry of the normal distribution = Φ ( X T β ) {\displaystyle {\begin{aligned}P(Y=1\mid X)&=P(Y^{\ast }>0)\\&=P(X^{\operatorname {T} }\beta +\varepsilon >0)\\&=P(\varepsilon >-X^{\operatorname {T} }\beta )\\&=P(\varepsilon 0 {\displaystyle t,\lim _{n\rightarrow \infty }n_{t}/n=c_{t}>0} . Denote p ^ t = r t / n t {\displaystyle {\hat {p}}_{t}=r_{t}/n_{t}} σ ^ t 2 = 1 n t p ^ t ( 1 − p ^ t ) φ 2 ( Φ − 1 ( p ^ t ) ) {\displaystyle {\hat {\sigma }}_{t}^{2}={\frac {1}{n_{t}}}{\frac {{\hat {p}}_{t}(1-{\hat {p}}_{t})}{\varphi ^{2}{\big (}\Phi ^{-1}({\hat {p}}_{t}){\big )}}}} Then Berkson's minimum chi-square estimator is a generalized least squares estimator in a regression of Φ − 1 ( p ^ t ) {\displaystyle \Phi ^{-1}({\hat {p}}_{t})} on x ( t ) {\displaystyle x_{(t)}} with weights σ ^ t − 2 {\displaystyle {\hat {\sigma }}_{t}^{-2}} : β ^ = ( ∑ t = 1 T σ ^ t − 2 x ( t ) x ( t ) T ) − 1 ∑ t = 1 T σ ^ t − 2 x ( t ) Φ − 1 ( p ^ t ) {\displaystyle {\hat {\beta }}={\Bigg (}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}x_{(t)}^{\operatorname {T} }{\Bigg )}^{-1}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}\Phi ^{-1}({\hat {p}}_{t})} It can be shown that this estimator is consistent (as n→∞ and T fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts r t {\displaystyle r_{t}} , n t {\disp

    Read more →
  • Limnu

    Limnu

    Limnu was an online whiteboarding app founded in 2015 by David DeBry and David Hart. It allowed users to draw on virtual whiteboards and invite others by e-mail or by sharing a link. Invitees see any changes to the board in real time and, if allowed by the owner of the board, can also draw on the board. The service was accessible through a web application in desktop and mobile web browsers, as well as through an iOS application. It was headquartered in San Mateo, California. == History == In 2018, ZipSocket, a maker of online meeting software acquired Limnu. == Staff Directory == Andrew Kunz - CEO & Founder of ZipSocket Jenny Rice - Product Manager Max Requenes - Software Engineer Henry Maguire - Machine Learning Engineer

    Read more →
  • Swish function

    Swish function

    The swish function is a family of mathematical function defined as follows: swish β ⁡ ( x ) = x sigmoid ⁡ ( β x ) = x 1 + e − β x . {\displaystyle \operatorname {swish} _{\beta }(x)=x\operatorname {sigmoid} (\beta x)={\frac {x}{1+e^{-\beta x}}}.} where β {\displaystyle \beta } can be constant (usually set to 1) or trainable and "sigmoid" refers to the logistic function. The swish family was designed to smoothly interpolate between a linear function and the Rectified linear unit (ReLU) function. When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in . Variants of the swish function include Mish. == Special values == For β = 0, the function is linear: f(x) = x/2. For β = 1, the function is the Sigmoid Linear Unit (SiLU). For β = 1.702, the function approximates GeLU. With β → ∞, the function converges to ReLU. Thus, the swish family smoothly interpolates between a linear function and the ReLU function. Since swish β ⁡ ( x ) = swish 1 ⁡ ( β x ) / β {\displaystyle \operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta } , all instances of swish have the same shape as the default swish 1 {\displaystyle \operatorname {swish} _{1}} , zoomed by β {\displaystyle \beta } . One usually sets β > 0 {\displaystyle \beta >0} . When β {\displaystyle \beta } is trainable, this constraint can be enforced by β = e b {\displaystyle \beta =e^{b}} , where b {\displaystyle b} is trainable. swish 1 ⁡ ( x ) = x 2 + x 2 4 − x 4 48 + x 6 480 + O ( x 8 ) {\displaystyle \operatorname {swish} _{1}(x)={\frac {x}{2}}+{\frac {x^{2}}{4}}-{\frac {x^{4}}{48}}+{\frac {x^{6}}{480}}+O\left(x^{8}\right)} swish 1 ⁡ ( x ) = x 2 tanh ⁡ ( x 2 ) + x 2 swish 1 ⁡ ( x ) + swish − 1 ⁡ ( x ) = x tanh ⁡ ( x 2 ) swish 1 ⁡ ( x ) − swish − 1 ⁡ ( x ) = x {\displaystyle {\begin{aligned}\operatorname {swish} _{1}(x)&={\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)+{\frac {x}{2}}\\\operatorname {swish} _{1}(x)+\operatorname {swish} _{-1}(x)&=x\tanh \left({\frac {x}{2}}\right)\\\operatorname {swish} _{1}(x)-\operatorname {swish} _{-1}(x)&=x\end{aligned}}} == Derivatives == Because swish β ⁡ ( x ) = swish 1 ⁡ ( β x ) / β {\displaystyle \operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta } , it suffices to calculate its derivatives for the default case. swish 1 ′ ⁡ ( x ) = x + sinh ⁡ ( x ) 4 cosh 2 ⁡ ( x 2 ) + 1 2 {\displaystyle \operatorname {swish} _{1}'(x)={\frac {x+\sinh(x)}{4\cosh ^{2}\left({\frac {x}{2}}\right)}}+{\frac {1}{2}}} so swish 1 ′ ⁡ ( x ) − 1 2 {\displaystyle \operatorname {swish} _{1}'(x)-{\frac {1}{2}}} is odd. swish 1 ″ ⁡ ( x ) = 1 − x 2 tanh ⁡ ( x 2 ) 2 cosh 2 ⁡ ( x 2 ) {\displaystyle \operatorname {swish} _{1}''(x)={\frac {1-{\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)}{2\cosh ^{2}\left({\frac {x}{2}}\right)}}} so swish 1 ″ ⁡ ( x ) {\displaystyle \operatorname {swish} _{1}''(x)} is even. == History == SiLU was first proposed alongside the GELU in 2016, then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning. The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β. In 2017, after performing analysis on ImageNet data, researchers from Google indicated that using this function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions. It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.

    Read more →
  • Probably approximately correct learning

    Probably approximately correct learning

    In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). == Definitions and terminology == In order to give the definition for something that is PAC-learnable, we first have to introduce some terminology. For the following definitions, two examples will be used. The first is the problem of character recognition given an array of n {\displaystyle n} bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive and the points outside of the range as negative. Let X {\displaystyle X} be a set called the instance space or the encoding of all the samples. In the character recognition problem, the instance space is X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} . In the interval problem the instance space, X {\displaystyle X} , is the set of all bounded intervals in R {\displaystyle \mathbb {R} } , where R {\displaystyle \mathbb {R} } denotes the set of all real numbers. A concept is a subset c ⊂ X {\displaystyle c\subset X} . One concept is the set of all patterns of bits in X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} that encode a picture of the letter "P". An example concept from the second example is the set of open intervals, { ( a , b ) ∣ 0 ≤ a ≤ π / 2 , π ≤ b ≤ 13 } {\displaystyle \{(a,b)\mid 0\leq a\leq \pi /2,\pi \leq b\leq {\sqrt {13}}\}} , each of which contains only the positive points. A concept class C {\displaystyle C} is a collection of concepts over X {\displaystyle X} . This could be the set of all subsets of the array of bits that are skeletonized 4-connected (width of the font is 1). Let EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} be a procedure that draws an example, x {\displaystyle x} , using a probability distribution D {\displaystyle D} and gives the correct label c ( x ) {\displaystyle c(x)} , that is 1 if x ∈ c {\displaystyle x\in c} and 0 otherwise. Now, given 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , assume there is an algorithm A {\displaystyle A} and a polynomial p {\displaystyle p} in 1 / ϵ , 1 / δ {\displaystyle 1/\epsilon ,1/\delta } (and other relevant parameters of the class C {\displaystyle C} ) such that, given a sample of size p {\displaystyle p} drawn according to EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} , then, with probability of at least 1 − δ {\displaystyle 1-\delta } , A {\displaystyle A} outputs a hypothesis h ∈ C {\displaystyle h\in C} that has an average error less than or equal to ϵ {\displaystyle \epsilon } on X {\displaystyle X} with the same distribution D {\displaystyle D} . Further if the above statement for algorithm A {\displaystyle A} is true for every concept c ∈ C {\displaystyle c\in C} and for every distribution D {\displaystyle D} over X {\displaystyle X} , and for all 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} then C {\displaystyle C} is (efficiently) PAC learnable (or distribution-free PAC learnable). We can also say that A {\displaystyle A} is a PAC learning algorithm for C {\displaystyle C} . == Equivalence == Under some regularity conditions these conditions are equivalent: The concept class C is PAC learnable. The VC dimension of C is finite. C is a uniformly Glivenko-Cantelli class. C is compressible in the sense of Littlestone and Warmuth

    Read more →
  • Calibration (statistics)

    Calibration (statistics)

    There are two main uses of the term calibration in statistics that denote special types of statistical inference problems. Calibration can mean a reverse process to regression, where instead of a future dependent variable being predicted from known explanatory variables, a known observation of the dependent variables is used to predict a corresponding explanatory variable; procedures in statistical classification to determine class membership probabilities which assess the uncertainty of a given new observation belonging to each of the already established classes. In addition, calibration is used in statistics with the usual general meaning of calibration. For example, model calibration can be also used to refer to Bayesian inference about the value of a model's parameters, given some data set, or more generally to any type of fitting of a statistical model. As Philip Dawid puts it, "a forecaster is well calibrated if, for example, of those events to which he assigns a probability 30 percent, the long-run proportion that actually occurs turns out to be 30 percent." == In classification == Calibration in classification means transforming classifier scores into class membership probabilities. An overview of calibration methods for two-class and multi-class classification tasks is given by Gebel (2009). A classifier might separate the classes well, but be poorly calibrated, meaning that the estimated class probabilities are far from the true class probabilities. In this case, a calibration step may help improve the estimated probabilities. A variety of metrics exist that are aimed to measure the extent to which a classifier produces well-calibrated probabilities. Foundational work includes the Expected Calibration Error (ECE). Into the 2020s, variants include the Adaptive Calibration Error (ACE) and the Test-based Calibration Error (TCE), which address limitations of the ECE metric that may arise when classifier scores concentrate on narrow subset of the [0,1] range. A 2020s advancement in calibration assessment is the introduction of the Estimated Calibration Index (ECI). The ECI extends the concepts of the Expected Calibration Error (ECE) to provide a more nuanced measure of a model's calibration, particularly addressing overconfidence and underconfidence tendencies. Originally formulated for binary settings, the ECI has been adapted for multiclass settings, offering both local and global insights into model calibration. This framework aims to overcome some of the theoretical and interpretative limitations of existing calibration metrics. Through a series of experiments, Famiglini et al. demonstrate the framework's effectiveness in delivering a more accurate understanding of model calibration levels and discuss strategies for mitigating biases in calibration assessment. An online tool has been proposed to compute both ECE and ECI. The following univariate calibration methods exist for transforming classifier scores into class membership probabilities in the two-class case: Assignment value approach, see Garczarek (2002) Bayes approach, see Bennett (2002) Isotonic regression, see Zadrozny and Elkan (2002) Platt scaling (a form of logistic regression), see Lewis and Gale (1994) and Platt (1999) Bayesian Binning into Quantiles (BBQ) calibration, see Naeini, Cooper, Hauskrecht (2015) Beta calibration, see Kull, Filho, Flach (2017) === In probability prediction and forecasting === In prediction and forecasting, a Brier score is sometimes used to assess prediction accuracy of a set of predictions, specifically that the magnitude of the assigned probabilities track the relative frequency of the observed outcomes. Philip E. Tetlock employs the term "calibration" in this sense in his 2015 book Superforecasting. This differs from accuracy and precision. For example, as expressed by Daniel Kahneman, "if you give all events that happen a probability of .6 and all the events that don't happen a probability of .4, your discrimination is perfect but your calibration is miserable". In meteorology, in particular, as concerns weather forecasting, a related mode of assessment is known as forecast skill. == In regression == The calibration problem in regression is the use of known data on the observed relationship between a dependent variable and an independent variable to make estimates of other values of the independent variable from new observations of the dependent variable. This can be known as "inverse regression"; there is also sliced inverse regression. The following multivariate calibration methods exist for transforming classifier scores into class membership probabilities in the case with classes count greater than two: Reduction to binary tasks and subsequent pairwise coupling, see Hastie and Tibshirani (1998) Dirichlet calibration, see Gebel (2009) === Example === One example is that of dating objects, using observable evidence such as tree rings for dendrochronology or carbon-14 for radiometric dating. The observation is caused by the age of the object being dated, rather than the reverse, and the aim is to use the method for estimating dates based on new observations. The problem is whether the model used for relating known ages with observations should aim to minimise the error in the observation, or minimise the error in the date. The two approaches will produce different results, and the difference will increase if the model is then used for extrapolation at some distance from the known results.

    Read more →
  • Fully probabilistic design

    Fully probabilistic design

    Decision making (DM) can be seen as a purposeful choice of action sequences. It also covers control, a purposeful choice of input sequences. As a rule, it runs under randomness, uncertainty and incomplete knowledge. A range of prescriptive theories have been proposed how to make optimal decisions under these conditions. They optimise sequence of decision rules, mappings of the available knowledge on possible actions. This sequence is called strategy or policy. Among various theories, Bayesian DM is broadly accepted axiomatically based theory that solves the design of optimal decision strategy. It describes random, uncertain or incompletely known quantities as random variables, i.e. by their joint probability expressing belief in their possible values. The strategy that minimises expected loss (or equivalently maximises expected reward) expressing decision-maker's goals is then taken as the optimal strategy. While the probabilistic description of beliefs is uniquely and deductively driven by rules for joint probabilities, the composition and decomposition of the loss function have no such universally applicable formal machinery. Fully probabilistic design (of decision strategies or control, FPD) removes the mentioned drawback and expresses also the DM goals of by the "ideal" probability, which assigns high (small) values to desired (undesired) behaviours of the closed DM loop formed by the influenced world part and by the used strategy. FPD has axiomatic basis and has Bayesian DM as its restricted subpart. FPD has a range of theoretical consequences , and, importantly, has been successfully used to quite diverse application domains.

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Analogical modeling

    Analogical modeling

    Analogical modeling (AM) is a formal theory of exemplar based analogical reasoning, proposed by Royal Skousen, professor of Linguistics and English language at Brigham Young University in Provo, Utah. It is applicable to language modeling and other categorization tasks. Analogical modeling is related to connectionism and nearest neighbor approaches, in that it is data-based rather than abstraction-based; but it is distinguished by its ability to cope with imperfect datasets (such as caused by simulated short term memory limits) and to base predictions on all relevant segments of the dataset, whether near or far. In language modeling, AM has successfully predicted empirically valid forms for which no theoretical explanation was known (see the discussion of Finnish morphology in Skousen et al. 2002). == Implementation == === Overview === An exemplar-based model consists of a general-purpose modeling engine and a problem-specific dataset. Within the dataset, each exemplar (a case to be reasoned from, or an informative past experience) appears as a feature vector: a row of values for the set of parameters that define the problem. For example, in a spelling-to-sound task, the feature vector might consist of the letters of a word. Each exemplar in the dataset is stored with an outcome, such as a phoneme or phone to be generated. When the model is presented with a novel situation (in the form of an outcome-less feature vector), the engine algorithmically sorts the dataset to find exemplars that helpfully resemble it, and selects one, whose outcome is the model's prediction. The particulars of the algorithm distinguish one exemplar-based modeling system from another. In AM, we think of the feature values as characterizing a context, and the outcome as a behavior that occurs within that context. Accordingly, the novel situation is known as the given context. Given the known features of the context, the AM engine systematically generates all contexts that include it (all of its supracontexts), and extracts from the dataset the exemplars that belong to each. The engine then discards those supracontexts whose outcomes are inconsistent (this measure of consistency will be discussed further below), leaving an analogical set of supracontexts, and probabilistically selects an exemplar from the analogical set with a bias toward those in large supracontexts. This multilevel search exponentially magnifies the likelihood of a behavior's being predicted as it occurs reliably in settings that specifically resemble the given context. === Analogical modeling in detail === AM performs the same process for each case it is asked to evaluate. The given context, consisting of n variables, is used as a template to generate 2 n {\displaystyle 2^{n}} supracontexts. Each supracontext is a set of exemplars in which one or more variables have the same values that they do in the given context, and the other variables are ignored. In effect, each is a view of the data, created by filtering for some criteria of similarity to the given context, and the total set of supracontexts exhausts all such views. Alternatively, each supracontext is a theory of the task or a proposed rule whose predictive power needs to be evaluated. It is important to note that the supracontexts are not equal peers one with another; they are arranged by their distance from the given context, forming a hierarchy. If a supracontext specifies all of the variables that another one does and more, it is a subcontext of that other one, and it lies closer to the given context. (The hierarchy is not strictly branching; each supracontext can itself be a subcontext of several others, and can have several subcontexts.) This hierarchy becomes significant in the next step of the algorithm. The engine now chooses the analogical set from among the supracontexts. A supracontext may contain exemplars that only exhibit one behavior; it is deterministically homogeneous and is included. It is a view of the data that displays regularity, or a relevant theory that has never yet been disproven. A supracontext may exhibit several behaviors, but contain no exemplars that occur in any more specific supracontext (that is, in any of its subcontexts); in this case it is non-deterministically homogeneous and is included. Here there is no great evidence that a systematic behavior occurs, but also no counterargument. Finally, a supracontext may be heterogeneous, meaning that it exhibits behaviors that are found in a subcontext (closer to the given context), and also behaviors that are not. Where the ambiguous behavior of the nondeterministically homogeneous supracontext was accepted, this is rejected because the intervening subcontext demonstrates that there is a better theory to be found. The heterogeneous supracontext is therefore excluded. This guarantees that we see an increase in meaningfully consistent behavior in the analogical set as we approach the given context. With the analogical set chosen, each appearance of an exemplar (for a given exemplar may appear in several of the analogical supracontexts) is given a pointer to every other appearance of an exemplar within its supracontexts. One of these pointers is then selected at random and followed, and the exemplar to which it points provides the outcome. This gives each supracontext an importance proportional to the square of its size, and makes each exemplar likely to be selected in direct proportion to the sum of the sizes of all analogically consistent supracontexts in which it appears. Then, of course, the probability of predicting a particular outcome is proportional to the summed probabilities of all the exemplars that support it. (Skousen 2002, in Skousen et al. 2002, pp. 11–25, and Skousen 2003, both passim) === Formulas === Given a context with n {\displaystyle n} elements: total number of pairings: n 2 {\displaystyle n^{2}} number of agreements for outcome i: n i 2 {\displaystyle n_{i}^{2}} number of disagreements for outcome i: n i ( n − n i ) {\displaystyle n_{i}(n-n_{i})} total number of agreements: ∑ n i 2 {\displaystyle \sum {n_{i}^{2}}} total number of disagreements: ∑ n i ( n − n i ) = n 2 − ∑ n i 2 {\displaystyle \sum {n_{i}(n-n_{i})}=n^{2}-\sum {n_{i}^{2}}} === Example === This terminology is best understood through an example. In the example used in the second chapter of Skousen (1989), each context consists of three variables with potential values 0-3 Variable 1: 0,1,2,3 Variable 2: 0,1,2,3 Variable 3: 0,1,2,3 The two outcomes for the dataset are e and r, and the exemplars are: 3 1 0 e 0 3 2 r 2 1 0 r 2 1 2 r 3 1 1 r We define a network of pointers like so: The solid lines represent pointers between exemplars with matching outcomes; the dotted lines represent pointers between exemplars with non-matching outcomes. The statistics for this example are as follows: n = 5 {\displaystyle n=5} n r = 4 {\displaystyle n_{r}=4} n e = 1 {\displaystyle n_{e}=1} total number of pairings: n 2 = 25 {\displaystyle n^{2}=25} number of agreements for outcome r: n r 2 = 16 {\displaystyle n_{r}^{2}=16} number of agreements for outcome e: n e 2 = 1 {\displaystyle n_{e}^{2}=1} number of disagreements for outcome r: n r ( n − n r ) = 4 {\displaystyle n_{r}(n-n_{r})=4} number of disagreements for outcome e: n e ( n − n e ) = 4 {\displaystyle n_{e}(n-n_{e})=4} total number of agreements: n r 2 + n e 2 = 17 {\displaystyle n_{r}^{2}+n_{e}^{2}=17} total number of disagreements: n r ( n − n r ) + n e ( n − n e ) = n 2 − ( n r 2 + n e 2 ) = 8 {\displaystyle n_{r}(n-n_{r})+n_{e}(n-n_{e})=n^{2}-(n_{r}^{2}+n_{e}^{2})=8} uncertainty or fraction of disagreement: 8 / 25 = .32 {\displaystyle 8/25=.32} Behavior can only be predicted for a given context; in this example, let us predict the outcome for the context "3 1 2". To do this, we first find all of the contexts containing the given context; these contexts are called supracontexts. We find the supracontexts by systematically eliminating the variables in the given context; with m variables, there will generally be 2 m {\displaystyle 2^{m}} supracontexts. The following table lists each of the sub- and supracontexts; x means "not x", and - means "anything". These contexts are shown in the venn diagram below: The next step is to determine which exemplars belong to which contexts in order to determine which of the contexts are homogeneous. The table below shows each of the subcontexts, their behavior in terms of the given exemplars, and the number of disagreements within the behavior: Analyzing the subcontexts in the table above, we see that there is only 1 subcontext with any disagreements: "3 1 2", which in the dataset consists of "3 1 0 e" and "3 1 1 r". There are 2 disagreements in this subcontext; 1 pointing from each of the exemplars to the other (see the pointer network pictured above). Therefore, only supracontexts containing this subcontext will contain any disagreements. We use a simple rule to identify the homogeneous supraco

    Read more →
  • Structural risk minimization

    Structural risk minimization

    Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data. This principle was first set out in a 1974 book by Vladimir Vapnik and Alexey Chervonenkis and uses the VC dimension. In practical terms, Structural Risk Minimization is implemented by minimizing E t r a i n + β H ( W ) {\displaystyle E_{train}+\beta H(W)} , where E t r a i n {\displaystyle E_{train}} is the train error, the function H ( W ) {\displaystyle H(W)} is called a regularization function, and β {\displaystyle \beta } is a constant. H ( W ) {\displaystyle H(W)} is chosen such that it takes large values on parameters W {\displaystyle W} that belong to high-capacity subsets of the parameter space. Minimizing H ( W ) {\displaystyle H(W)} in effect limits the capacity of the accessible subsets of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error. The SRM problem can be formulated in terms of data. Given n data points consisting of data x and labels y, the objective J ( θ ) {\displaystyle J(\theta )} is often expressed in the following manner: J ( θ ) = 1 2 n ∑ i = 1 n ( h θ ( x i ) − y i ) 2 + λ 2 ∑ j = 1 d θ j 2 {\displaystyle J(\theta )={\frac {1}{2n}}\sum _{i=1}^{n}(h_{\theta }(x^{i})-y^{i})^{2}+{\frac {\lambda }{2}}\sum _{j=1}^{d}\theta _{j}^{2}} The first term is the mean squared error (MSE) term between the value of the learned model, h θ {\displaystyle h_{\theta }} , and the given labels y {\displaystyle y} . This term is the training error, E t r a i n {\displaystyle E_{train}} , that was discussed earlier. The second term, places a prior over the weights, to favor sparsity and penalize larger weights. The trade-off coefficient, λ {\displaystyle \lambda } , is a hyperparameter that places more or less importance on the regularization term. Larger λ {\displaystyle \lambda } encourages sparser weights at the expense of a more optimal MSE, and smaller λ {\displaystyle \lambda } relaxes regularization allowing the model to fit to data. Note that as λ → ∞ {\displaystyle \lambda \to \infty } the weights become zero, and as λ → 0 {\displaystyle \lambda \to 0} , the model typically suffers from overfitting.

    Read more →
  • PVLV

    PVLV

    The primary value learned value (PVLV) model is a possible explanation for the reward-predictive firing properties of dopamine (DA) neurons. It simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards. It is an alternative to the temporal-differences (TD) algorithm. It is used as part of Leabra.

    Read more →
  • Genetic operator

    Genetic operator

    A genetic operator is an operator used in evolutionary algorithms (EA) to guide the algorithm towards a solution to a given problem. There are three main types of operators (mutation, crossover and selection), which must work in conjunction with one another in order for the algorithm to be successful. Genetic operators are used to create and maintain genetic diversity (mutation operator), combine existing solutions (also known as chromosomes) into new solutions (crossover) and select between solutions (selection). The classic representatives of evolutionary algorithms include genetic algorithms, evolution strategies, genetic programming and evolutionary programming. In his book discussing the use of genetic programming for the optimization of complex problems, computer scientist John Koza has also identified an 'inversion' or 'permutation' operator; however, the effectiveness of this operator has never been conclusively demonstrated and this operator is rarely discussed in the field of genetic programming. For combinatorial problems, however, these and other operators tailored to permutations are frequently used by other EAs. Mutation (or mutation-like) operators are said to be unary operators, as they only operate on one chromosome at a time. In contrast, crossover operators are said to be binary operators, as they operate on two chromosomes at a time, combining two existing chromosomes into one new chromosome. == Operators == Genetic variation is a necessity for the process of evolution. Genetic operators used in evolutionary algorithms are analogous to those in the natural world: survival of the fittest, or selection; reproduction (crossover, also called recombination); and mutation. === Selection === Selection operators give preference to better candidate solutions (chromosomes), allowing them to pass on their 'genes' to the next generation (iteration) of the algorithm. The best solutions are determined using some form of objective function (also known as a 'fitness function' in evolutionary algorithms), before being passed to the crossover operator. Different methods for choosing the best solutions exist, for example, fitness proportionate selection and tournament selection. A further or the same selection operator is used to determine the individuals for being selected to form the next parental generation. The selection operator may also ensure that the best solution(s) from the current generation always become(s) a member of the next generation without being altered; this is known as elitism or elitist selection. === Crossover === Crossover is the process of taking more than one parent solutions (chromosomes) and producing a child solution from them. By recombining portions of good solutions, the evolutionary algorithm is more likely to create a better solution. As with selection, there are a number of different methods for combining the parent solutions, including the edge recombination operator (ERO) and the 'cut and splice crossover' and 'uniform crossover' methods. The crossover method is often chosen to closely match the chromosome's representation of the solution; this may become particularly important when variables are grouped together as building blocks, which might be disrupted by a non-respectful crossover operator. Similarly, crossover methods may be particularly suited to certain problems; the ERO is considered a good option for solving the travelling salesman problem. === Mutation === The mutation operator encourages genetic diversity amongst solutions and attempts to prevent the evolutionary algorithm converging to a local minimum by stopping the solutions becoming too close to one another. In mutating the current pool of solutions, a given solution may change between slightly and entirely from the previous solution. By mutating the solutions, an evolutionary algorithm can reach an improved solution solely through the mutation operator. Again, different methods of mutation may be used; these range from a simple bit mutation (flipping random bits in a binary string chromosome with some low probability) to more complex mutation methods in which genes in the solution are changed, for example by adding a random value from the Gaussian distribution to the current gene value. As with the crossover operator, the mutation method is usually chosen to match the representation of the solution within the chromosome. == Combining operators == While each operator acts to improve the solutions produced by the evolutionary algorithm working individually, the operators must work in conjunction with each other for the algorithm to be successful in finding a good solution. Using the selection operator on its own will tend to fill the solution population with copies of the best solution from the population. If the selection and crossover operators are used without the mutation operator, the algorithm will tend to converge to a local minimum, that is, a good but sub-optimal solution to the problem. Using the mutation operator on its own leads to a random walk through the search space. Only by using all three operators together can the evolutionary algorithm become a noise-tolerant global search algorithm, yielding good solutions to the problem at hand.

    Read more →