Locality-sensitive hashing

Locality-sensitive hashing

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH). Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion. == Definitions == A finite family F {\displaystyle {\mathcal {F}}} of functions h : M → S {\displaystyle h\colon M\to S} is defined to be an LSH family for a metric space M = ( M , d ) {\displaystyle {\mathcal {M}}=(M,d)} , a threshold r > 0 {\displaystyle r>0} , an approximation factor c > 1 {\displaystyle c>1} , and probabilities p 1 > p 2 {\displaystyle p_{1}>p_{2}} if it satisfies the following condition. For any two points a , b ∈ M {\displaystyle a,b\in M} and a hash function h {\displaystyle h} chosen uniformly at random from F {\displaystyle {\mathcal {F}}} : If d ( a , b ) ≤ r {\displaystyle d(a,b)\leq r} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} (i.e., a and b collide) with probability at least p 1 {\displaystyle p_{1}} , If d ( a , b ) ≥ c r {\displaystyle d(a,b)\geq cr} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} with probability at most p 2 {\displaystyle p_{2}} . Such a family F {\displaystyle {\mathcal {F}}} is called ( r , c r , p 1 , p 2 ) {\displaystyle (r,cr,p_{1},p_{2})} -sensitive. === LSH with respect to a similarity measure === Alternatively it is possible to define an LSH family on a universe of items U endowed with a similarity function ϕ : U × U → [ 0 , 1 ] {\displaystyle \phi \colon U\times U\to [0,1]} . In this setting, a LSH scheme is a family of hash functions H coupled with a probability distribution D over H such that a function h ∈ H {\displaystyle h\in H} chosen according to D satisfies P r [ h ( a ) = h ( b ) ] = ϕ ( a , b ) {\displaystyle Pr[h(a)=h(b)]=\phi (a,b)} for each a , b ∈ U {\displaystyle a,b\in U} . === Amplification === Given a ( d 1 , d 2 , p 1 , p 2 ) {\displaystyle (d_{1},d_{2},p_{1},p_{2})} -sensitive family F {\displaystyle {\mathcal {F}}} , we can construct new families G {\displaystyle {\mathcal {G}}} by either the AND-construction or OR-construction of F {\displaystyle {\mathcal {F}}} . To create an AND-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if all h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for i = 1 , 2 , … , k {\displaystyle i=1,2,\ldots ,k} . Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , p 1 k , p 2 k ) {\displaystyle (d_{1},d_{2},p_{1}^{k},p_{2}^{k})} -sensitive family. To create an OR-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for one or more values of i. Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , 1 − ( 1 − p 1 ) k , 1 − ( 1 − p 2 ) k ) {\displaystyle (d_{1},d_{2},1-(1-p_{1})^{k},1-(1-p_{2})^{k})} -sensitive family. == Applications == LSH has been applied to several problem domains, including: Near-duplicate detection Hierarchical clustering Genome-wide association study Image similarity identification VisualRank Gene expression similarity identification Audio similarity identification Nearest neighbor search Audio fingerprint Digital video fingerprinting Shared memory organization in parallel computing Physical data organization in database management systems Training fully connected neural networks Computer security Machine learning == Methods == === Bit sampling for Hamming distance === One of the easiest ways to construct an LSH family is by bit sampling. This approach works for the Hamming distance over d-dimensional vectors { 0 , 1 } d {\displaystyle \{0,1\}^{d}} . Here, the family F {\displaystyle {\mathcal {F}}} of hash functions is simply the family of all the projections of points on one of the d {\displaystyle d} coordinates, i.e., F = { h : { 0 , 1 } d → { 0 , 1 } ∣ h ( x ) = x i for some i ∈ { 1 , … , d } } {\displaystyle {\mathcal {F}}=\{h\colon \{0,1\}^{d}\to \{0,1\}\mid h(x)=x_{i}{\text{ for some }}i\in \{1,\ldots ,d\}\}} , where x i {\displaystyle x_{i}} is the i {\displaystyle i} th coordinate of x {\displaystyle x} . A random function h {\displaystyle h} from F {\displaystyle {\mathcal {F}}} simply selects a random bit from the input point. This family has the following parameters: P 1 = 1 − R / d {\displaystyle P_{1}=1-R/d} , P 2 = 1 − c R / d {\displaystyle P_{2}=1-cR/d} . That is, any two vectors x , y {\displaystyle x,y} with Hamming distance at most R {\displaystyle R} collide under a random h {\displaystyle h} with probability at least P 1 {\displaystyle P_{1}} . Any x , y {\displaystyle x,y} with Hamming distance at least c R {\displaystyle cR} collide with probability at most P 2 {\displaystyle P_{2}} . === Min-wise independent permutations === Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J. If π is a permutation on the indices of S, for A ⊆ S {\displaystyle A\subseteq S} let h ( A ) = min a ∈ A { π ( a ) } {\displaystyle h(A)=\min _{a\in A}\{\pi (a)\}} . Each possible choice of π defines a single hash function h mapping input sets to elements of S. Define the function family H to be the set of all such functions and let D be the uniform distribution. Given two sets A , B ⊆ S {\displaystyle A,B\subseteq S} the event that h ( A ) = h ( B ) {\displaystyle h(A)=h(B)} corresponds exactly to the event that the minimizer of π over A ∪ B {\displaystyle A\cup B} lies inside A ∩ B {\displaystyle A\cap B} . As h was chosen uniformly at random, P r [ h ( A ) = h ( B ) ] = J ( A , B ) {\displaystyle Pr[h(A)=h(B)]=J(A,B)\,} and ( H , D ) {\displaystyle (H,D)\,} define an LSH scheme for the Jaccard index. Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" — a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π. It has been established that a min-wise independent family of permutations is at least of size lcm ⁡ { 1 , 2 , … , n } ≥ e n − o ( n ) {\displaystyle \operatorname {lcm} \{\,1,2,\ldots ,n\,\}\geq e^{n-o(n)}} , and that this bound is tight. Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k. Approximate min-wise independence differs from the property by at most a fixed ε. === Open source methods === ==== Nilsimsa Hash ==== Nilsimsa is a locality-sensitive hashing algorithm used in anti-spam efforts. The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. The paper suggests that the Nilsimsa satisfies three requirements: The digest identifying each message should not

Dropbox Paper

Dropbox Paper, or simply Paper, is a collaborative document-editing service developed by Dropbox. Originating from the company's acquisition of document collaboration company Hackpad in April 2014, Dropbox Paper was officially announced in October 2015, and launched in January 2017. It offers a web application, as well as mobile apps for Android and iOS. Dropbox Paper was described in the official announcement post as "a flexible workspace that brings people and ideas together. With Paper, teams can create, review, revise, manage, and organize — all in shared documents". Reception of Dropbox Paper has been mixed. Critics praised collaboration functionality, including content available immediately, the ability to mention specific collaborators, assign tasks, write comments, as well as editing attribution, and revision history. It received particular praise for its support for rich media from a variety of sources, with one reviewer noting that the Paper's support for rich media exceeds the capabilities of most of its competitors. However, it was criticized for a lack of formatting options and editing features. While the user interface was liked for being minimal, reviewers cited the lack of a fixed formatting bar and missing features present in competitors' products as making Dropbox Paper seem like a "light" tool. == History == Dropbox acquired document collaboration company Hackpad in April 2014. A year later, Dropbox launched a Dropbox Notes note-taking product in beta testing phase. Dropbox Paper was officially announced on October 15, 2015, followed by an open beta and release of mobile Android and iOS apps in August 2016. Dropbox Paper was officially released on January 30, 2017. == Reception == In a comparison between Dropbox Paper and Evernote, PC World's Michael Ansaldo wrote that "With its emphasis on document creation, you might expect formatting to be front and center in Dropbox Paper. That's not the case." Ansaldo noted the lack of a "fixed formatting toolbar as you'd find in Evernote or a word processor like Google Docs or Microsoft Word. Instead, the text editor appears as a floating ribbon only when you highlight selected text." The only formatting options available for emphasis were bolding, strikethrough, bulleted and numbered lists, and H1 and H2 tags. Users can also add links, convert text to checklists, and add comments. Ansaldo wrote that "Both Evernote and Dropbox Paper make it easy to add images to a document", but also noted that "Dropbox Paper doesn't support any image editing". Paper supports rich media, and users can "add rich content to your document just by pasting a link to the file. In addition to Dropbox, Paper supports media from a variety of popular services including YouTube, Spotify, Vimeo, SoundCloud, Facebook, and Google's productivity suite. Once the file appears, you can delete the link for a cleaner display." To start working with other people, Paper "allows you to invite people via email from within a document", with sharing options for who can view the link (anyone with the link or just the invited person), and action permissions (edit or only comment). Regarding collaboration, Ansaldo wrote that "Creative collaboration is Paper’s marquee feature, and it provides a variety of ways to work effectively with others in real time". Users can "make any content immediately visible and accessible to a specific collaborator with "@mentions"", and "You can also use @mentions to create and assign task lists within a document." Paper also "boasts essential collaboration tools including comments, editing attribution, and revision history." Writing for TechRadar, John Brandon wrote that Dropbox Paper "might be a 'light' tool for now without the extensive templates of Microsoft Office or the integration with other apps in the Zoho suite, but it does work well with the Dropbox storage service that's so popular with office workers these days." Kyle Wiggers of Digital Trends wrote that Paper is "all about minimizing distractions. Its interface is quite literally a big, blank canvas on which you tap out your agenda. You can organize notes by title and create to-do lists, but even basic formatting tools are obscured from view", noting Paper's "floating box above words and phrases highlighted by your cursor". Wiggers stated that "Paper is not a to-do organizer", but that it's "well suited to the purpose thanks to a bevy of labor-saving conveniences", highlighting that Paper "supports more media than most of its to-do and note-taking counterparts". He praised the collaboration tools, writing that they "are as extensive as you'd hope, and then some", citing its invitation system with permission controls, lists of changes and revision history, comment and chat support, and "perhaps best of all", the ability to assign tasks with a "@" mention. Business Insider's Alex Heath praised that "Paper's interface is spotless and friendly to write in. You don't feel overwhelmed with formatting options", but criticized the available features, writing that "Google Docs is much more full-featured in the formatting department, so Paper has some catching up to do if it wants to be on par with the competition". Writing for The Verge, Casey Newton praised Paper's handling of rich media, complimenting it for being "great", and added that "I imagine that creative types who work on teams will appreciate having rich media embedded in the documents they're working on rather than in a series of infinite tabs".

Least-squares support vector machine

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods. == From support-vector machine to least-squares support-vector machine == Given a training set { x i , y i } i = 1 N {\displaystyle \{x_{i},y_{i}\}_{i=1}^{N}} with input data x i ∈ R n {\displaystyle x_{i}\in \mathbb {R} ^{n}} and corresponding binary class labels y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} , the SVM classifier, according to Vapnik's original formulation, satisfies the following conditions: { w T ϕ ( x i ) + b ≥ 1 , if y i = + 1 , w T ϕ ( x i ) + b ≤ − 1 , if y i = − 1 , {\displaystyle {\begin{cases}w^{T}\phi (x_{i})+b\geq 1,&{\text{if }}\quad y_{i}=+1,\\w^{T}\phi (x_{i})+b\leq -1,&{\text{if }}\quad y_{i}=-1,\end{cases}}} which is equivalent to y i [ w T ϕ ( x i ) + b ] ≥ 1 , i = 1 , … , N , {\displaystyle y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1,\quad i=1,\ldots ,N,} where ϕ ( x ) {\displaystyle \phi (x)} is the nonlinear map from original space to the high- or infinite-dimensional space. === Inseparable data === In case such a separating hyperplane does not exist, we introduce so-called slack variables ξ i {\displaystyle \xi _{i}} such that { y i [ w T ϕ ( x i ) + b ] ≥ 1 − ξ i , i = 1 , … , N , ξ i ≥ 0 , i = 1 , … , N . {\displaystyle {\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N.\end{cases}}} According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem: min J 1 ( w , ξ ) = 1 2 w T w + c ∑ i = 1 N ξ i , {\displaystyle \min J_{1}(w,\xi )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}\xi _{i},} Subject to { y i [ w T ϕ ( x i ) + b ] ≥ 1 − ξ i , i = 1 , … , N , ξ i ≥ 0 , i = 1 , … , N , {\displaystyle {\text{Subject to }}{\begin{cases}y_{i}\left[{w^{T}\phi (x_{i})+b}\right]\geq 1-\xi _{i},&i=1,\ldots ,N,\\\xi _{i}\geq 0,&i=1,\ldots ,N,\end{cases}}} To solve this problem, we could construct the Lagrangian function: L 1 ( w , b , ξ , α , β ) = 1 2 w T w + c ∑ i = 1 N ξ i − ∑ i = 1 N α i { y i [ w T ϕ ( x i ) + b ] − 1 + ξ i } − ∑ i = 1 N β i ξ i , {\displaystyle L_{1}(w,b,\xi ,\alpha ,\beta )={\frac {1}{2}}w^{T}w+c\sum \limits _{i=1}^{N}{\xi _{i}}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{y_{i}\left[{w^{T}\phi (x_{i})+b}\right]-1+\xi _{i}\right\}-\sum \limits _{i=1}^{N}\beta _{i}\xi _{i},} where α i ≥ 0 , β i ≥ 0 ( i = 1 , … , N ) {\displaystyle \alpha _{i}\geq 0,\ \beta _{i}\geq 0\ (i=1,\ldots ,N)} are the Lagrangian multipliers. The optimal point will be in the saddle point of the Lagrangian function, and then we obtain By substituting w {\displaystyle w} by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem: max Q 1 ( α ) = − 1 2 ∑ i , j = 1 N α i α j y i y j K ( x i , x j ) + ∑ i = 1 N α i , {\displaystyle \max Q_{1}(\alpha )=-{\frac {1}{2}}\sum \limits _{i,j=1}^{N}{\alpha _{i}\alpha _{j}y_{i}y_{j}K(x_{i},x_{j})}+\sum \limits _{i=1}^{N}\alpha _{i},} where K ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ {\displaystyle K(x_{i},x_{j})=\left\langle \phi (x_{i}),\phi (x_{j})\right\rangle } is called the kernel function. Solving this QP problem subject to constraints in (1), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space. === Least-squares SVM formulation === The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as min J 2 ( w , b , e ) = μ 2 w T w + ζ 2 ∑ i = 1 N e i 2 , {\displaystyle \min J_{2}(w,b,e)={\frac {\mu }{2}}w^{T}w+{\frac {\zeta }{2}}\sum \limits _{i=1}^{N}e_{i}^{2},} subject to the equality constraints y i [ w T ϕ ( x i ) + b ] = 1 − e i , i = 1 , … , N . {\displaystyle y_{i}\left[{w^{T}\phi (x_{i})+b}\right]=1-e_{i},\quad i=1,\ldots ,N.} The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets y i = ± 1 {\displaystyle y_{i}=\pm 1} . Using y i 2 = 1 {\displaystyle y_{i}^{2}=1} , we have ∑ i = 1 N e i 2 = ∑ i = 1 N ( y i e i ) 2 = ∑ i = 1 N e i 2 = ∑ i = 1 N ( y i − ( w T ϕ ( x i ) + b ) ) 2 , {\displaystyle \sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}(y_{i}e_{i})^{2}=\sum \limits _{i=1}^{N}e_{i}^{2}=\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2},} with e i = y i − ( w T ϕ ( x i ) + b ) . {\displaystyle e_{i}=y_{i}-(w^{T}\phi (x_{i})+b).} Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case. Hence the LS-SVM classifier formulation is equivalent to J 2 ( w , b , e ) = μ E W + ζ E D {\displaystyle J_{2}(w,b,e)=\mu E_{W}+\zeta E_{D}} with E W = 1 2 w T w {\displaystyle E_{W}={\frac {1}{2}}w^{T}w} and E D = 1 2 ∑ i = 1 N e i 2 = 1 2 ∑ i = 1 N ( y i − ( w T ϕ ( x i ) + b ) ) 2 . {\displaystyle E_{D}={\frac {1}{2}}\sum \limits _{i=1}^{N}e_{i}^{2}={\frac {1}{2}}\sum \limits _{i=1}^{N}\left(y_{i}-(w^{T}\phi (x_{i})+b)\right)^{2}.} Both μ {\displaystyle \mu } and ζ {\displaystyle \zeta } should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio γ = ζ / μ {\displaystyle \gamma =\zeta /\mu } , therefore the original formulation uses only γ {\displaystyle \gamma } as tuning parameter. We use both μ {\displaystyle \mu } and ζ {\displaystyle \zeta } as parameters in order to provide a Bayesian interpretation to LS-SVM. The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function: { L 2 ( w , b , e , α ) = J 2 ( w , e ) − ∑ i = 1 N α i { [ w T ϕ ( x i ) + b ] + e i − y i } , = 1 2 w T w + γ 2 ∑ i = 1 N e i 2 − ∑ i = 1 N α i { [ w T ϕ ( x i ) + b ] + e i − y i } , {\displaystyle {\begin{cases}L_{2}(w,b,e,\alpha )\;=J_{2}(w,e)-\sum \limits _{i=1}^{N}\alpha _{i}\left\{{\left[{w^{T}\phi (x_{i})+b}\right]+e_{i}-y_{i}}\right\},\\\quad \quad \quad \quad \quad \;={\frac {1}{2}}w^{T}w+{\frac {\gamma }{2}}\sum \limits _{i=1}^{N}e_{i}^{2}-\sum \limits _{i=1}^{N}\alpha _{i}\left\{\left[w^{T}\phi (x_{i})+b\right]+e_{i}-y_{i}\right\},\end{cases}}} where α i ∈ R {\displaystyle \alpha _{i}\in \mathbb {R} } are the Lagrange multipliers. The conditions for optimality are { ∂ L 2 ∂ w = 0 → w = ∑ i = 1 N α i ϕ ( x i ) , ∂ L 2 ∂ b = 0 → ∑ i = 1 N α i = 0 , ∂ L 2 ∂ e i = 0 → α i = γ e i , i = 1 , … , N , ∂ L 2 ∂ α i = 0 → y i = w T ϕ ( x i ) + b + e i , i = 1 , … , N . {\displaystyle {\begin{cases}{\frac {\partial L_{2}}{\partial w}}=0\quad \to \quad w=\sum \limits _{i=1}^{N}\alpha _{i}\phi (x_{i}),\\{\frac {\partial L_{2}}{\partial b}}=0\quad \to \quad \sum \limits _{i=1}^{N}\alpha _{i}=0,\\{\frac {\partial L_{2}}{\partial e_{i}}}=0\quad \to \quad \alpha _{i}=\gamma e_{i},\;i=1,\ldots ,N,\\{\frac {\partial L_{2}}{\partial \alpha _{i}}}=0\quad \to \quad y_{i}=w^{T}\phi (x_{i})+b+e_{i},\,i=1,\ldots ,N.\end{cases}}} Elimination of w {\displaystyle w} and e {\displaystyle e} will yield a linear system instead of a quadratic programming problem: [ 0 1 N T 1 N Ω + γ − 1 I N ] [ b α ] = [ 0 Y ] , {\displaystyle \left[{\begin{matrix}0&1_{N}^{T}\\1_{N}&\Omega +\gamma ^{-1}I_{N}\end{matrix}}\right]\left[{\begin{matrix}b\\\alpha \end{matrix}}\right]=\left[{\begin{matrix}0\\Y\end{matrix}}\right],} with Y = [ y 1 , … , y N ] T {\displaystyle Y=[y_{1},\ldots ,y_{N}]^{T}} , 1 N = [ 1 , … , 1 ] T {\displaystyle 1_{N}=[1,\ldots ,1]^{T}} and α = [ α 1 , … , α N ] T {\displaystyle \alpha =[\alpha _{1},\ldots ,\alpha _{N}]^{T}} . Here, I N {\displaystyle I_{N}} is an N × N {\displaystyle N\times N} identity matrix, and Ω ∈ R N × N {\displaystyle \Omega \in \mathbb {R} ^{N\times N}} is the kernel matrix defined by Ω i j = ϕ ( x i ) T ϕ ( x j ) = K ( x i , x j ) {\displaystyle \Omega _{ij}=\phi (x_{i})^{T}\phi (x_{j})=K(x_{i},x_{j})} . === Kernel function K === For the kernel function K(•, •) one typically has the following choices: Linear kernel : K ( x , x i ) = x i T x , {\displaystyle K(x,x_{i})=x_{i}^{T}x,} Polynomial kernel of degree d {\displaystyle d} : K ( x , x i ) = ( 1 + x i T x / c ) d , {\displaystyle K(x,x_{i})=\left({1+x_{i}^{T}x/c}\right)^{d},} Radial basis function RBF kernel : K ( x , x i ) = exp ⁡ ( − ‖ x − x i ‖ 2 / σ 2 ) , {\displaystyle K(x,x_{i})=\exp \left({-\left\|{x-x_{i}}\right\|^{2}/\sigma ^{2}}\right),} MLP kernel : K ( x , x i ) = tanh ⁡ ( k x i T x + θ ) , {\displaystyle K(x,x_{i})=\tanh \left({k

Extremal Ensemble Learning

Extremal Ensemble Learning (EEL) is a machine learning algorithmic paradigm for graph partitioning. EEL creates an ensemble of partitions and then uses information contained in the ensemble to find new and improved partitions. The ensemble evolves and learns how to form improved partitions through extremal updating procedure. The final solution is found by achieving consensus among its member partitions about what the optimal partition is. == Reduced-Network Extremal Ensemble Learning (RenEEL) == A particular implementation of the EEL paradigm is the Reduced-Network Extremal Ensemble Learning (RenEEL) scheme for partitioning a graph. RenEEL uses consensus across many partitions in an ensemble to create a reduced network that can be efficiently analyzed to find more accurate partitions. These better quality partitions are subsequently used to update the ensemble. An algorithm that utilizes the RenEEL scheme is currently the best algorithm for finding the graph partition with maximum modularity, which is an NP-hard problem.

Influence diagram

An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

Data cube

In computer programming, a data cube (or datacube) is a multi-dimensional array of values. Typically, the term "data cube" is applied in contexts where these arrays are massively larger than the hosting computer's main memory; examples include multi-terabyte/petabyte data warehouses and time series of image data. Even though it is called a cube, a data cube generally is a multi-dimensional concept which can be 1-dimensional, 2-dimensional, 3-dimensional, or higher-dimensional. The data cube is used to represent data (sometimes called facts) along some dimensions of interest. In satellite image timeseries, dimensions would be latitude and longitude coordinates and time; a fact (sometimes called measure) would be a pixel at a given space and time as taken by the satellite. For example, in online analytical processing, an OLAP cube about a company would have dimensions that could be the company subsidiaries, the company products, and time; in this setup, a fact would be a sales event where a particular product has been sold in a particular subsidiary at a particular time. In any case, every dimension divides data into groups of cells whereas each cell in the cube represents a single measure of interest. Sometimes cubes hold only a few values with the rest being empty, i.e. undefined, while sometimes most or all cube coordinates hold a cell value. In the first case such data are called sparse, and in the second case they are called dense, although there is no hard delineation between the two. Data cubes may be stored in database management systems (DBMS) as part of array DBMS. Spatio-temporal databases and geospatial databases may also be represented as coverage data. == History == Multi-dimensional arrays have long been familiar in programming languages. Fortran offers arbitrarily-indexed 1-D arrays and arrays of arrays, which allows the construction of higher-dimensional arrays, up to 15 dimensions. APL supports n-D arrays with a rich set of operations. All these have in common that arrays must fit into the main memory and are available only while the particular program maintaining them (such as image processing software) is running. A series of data exchange formats support storage and transmission of data cube-like data, often tailored towards particular application domains. Examples include MDX for statistical (in particular, business) data, Zarr and Hierarchical Data Format for general scientific data, and TIFF for imagery. In 1992, Peter Baumann introduced management of massive data cubes with high-level user functionality combined with an efficient software architecture. Datacube operations include subset extraction, processing, fusion, and in general queries in the spirit of data manipulation languages like SQL. Some years after, the data cube concept was applied to describe time-varying business data as data cubes by Jim Gray, et al., and by Venky Harinarayan, Anand Rajaraman and Jeff Ullman. Around that time, a working group on Multi-Dimensional Databases ("Arbeitskreis Multi-Dimensionale Datenbanken") was established at German Gesellschaft für Informatik. Datacube Inc. was an image processing company selling hardware and software applications for the PC market in 1996, however without addressing data cubes as such. The EarthServer initiative has established geo data cube service requirements. == Standardization == In 2018, the ISO SQL database language was extended with data cube functionality as "SQL – Part 15: Multi-dimensional arrays (SQL/MDA)". Web Coverage Processing Service is a geo data cube analytics language issued by the Open Geospatial Consortium in 2008. In addition to the common data cube operations, the language knows about the semantics of space and time and supports both regular and irregular grid data cubes, based on the concept of coverage data. An industry standard for querying business data cubes, originally developed by Microsoft, is MultiDimensional eXpressions. == Implementation == Many high-level computer languages treat data cubes and other large arrays as single entities distinct from their contents. These languages, of which Fortran, APL, IDL, NumPy, PDL, and S-Lang are examples, allow the programmer to manipulate complete film clips and other data en masse with simple expressions derived from linear algebra and vector mathematics. Some languages (such as PDL) distinguish between a list of images and a data cube, while many (such as IDL) do not. Array DBMSs (Database Management Systems) offer a data model which generically supports definition, management, retrieval, and manipulation of n-dimensional data cubes. This database category has been pioneered by the rasdaman system since 1994. == Applications == Multi-dimensional arrays can meaningfully represent spatio-temporal sensor, image, and simulation data, but also statistics data where the semantics of dimensions is not necessarily of spatial or temporal nature. Generally, any kind of axis can be combined with any other into a data cube. === Mathematics === In mathematics, a one-dimensional array corresponds to a vector, a two-dimensional array resembles a matrix; more generally, a tensor may be represented as an n-dimensional data cube. === Science and engineering === For a time sequence of color images, the array is generally four-dimensional, with the dimensions representing image X and Y coordinates, time, and RGB (or other color space) color plane. For example, the EarthServer initiative unites data centers from different continents offering 3-D x/y/t satellite image timeseries and 4-D x/y/z/t weather data for retrieval and server-side processing through the Open Geospatial Consortium WCPS geo data cube query language standard. A data cube is also used in the field of imaging spectroscopy, since a spectrally-resolved image is represented as a three-dimensional volume. Earth observation data cubes combine satellite imagery such as Landsat 8 and Sentinel-2 with Geographic information system analytics. === Business intelligence === In online analytical processing (OLAP), data cubes are a common arrangement of business data suitable for analysis from different perspectives through operations like slicing, dicing, pivoting, and aggregation.

Multi expression programming

Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is not specific (multiple representations have been tested). In the simplest variant, MEP chromosomes are linear strings of instructions. This representation was inspired by Three-address code. MEP strength consists in the ability to encode multiple solutions, of a problem, in the same chromosome. In this way, one can explore larger zones of the search space. For most of the problems this advantage comes with no running-time penalty compared with genetic programming variants encoding a single solution in a chromosome. == Representation == MEP chromosomes are arrays of instructions represented in Three-address code format. Each instruction contains a variable, a constant, or a function. If the instruction is a function, then the arguments (given as instruction's addresses) are also present. === Example of MEP program === Here is a simple MEP chromosome (labels on the left side are not a part of the chromosome): 1: a 2: b 3: + 1, 2 4: c 5: d 6: + 4, 5 7: 3, 5 == Fitness computation == When the chromosome is evaluated it is unclear which instruction will provide the output of the program. In many cases, a set of programs is obtained, some of them being completely unrelated (they do not have common instructions). For the above chromosome, here is the list of possible programs obtained during decoding: E1 = a, E2 = b, E4 = c, E5 = d, E3 = a + b. E6 = c + d. E7 = (a + b) d. Each instruction is evaluated as a possible output of the program. The fitness (or error) is computed in a standard manner. For instance, in the case of symbolic regression, the fitness is the sum of differences (in absolute value) between the expected output (called target) and the actual output. == Fitness assignment process == Which expression will represent the chromosome? Which one will give the fitness of the chromosome? In MEP, the best of them (which has the lowest error) will represent the chromosome. This is different from other GP techniques: In Linear genetic programming the last instruction will give the output. In Cartesian Genetic Programming the gene providing the output is evolved like all other genes. Note that, for many problems, this evaluation has the same complexity as in the case of encoding a single solution in each chromosome. Thus, there is no penalty in running time compared to other techniques. == Software == === MEPX === MEPX is a cross-platform (Windows, macOS, and Linux Ubuntu) free software for the automatic generation of computer programs. It can be used for data analysis, particularly for solving symbolic regression, statistical classification and time-series problems. === libmep === Libmep is a free and open source library implementing Multi Expression Programming technique. It is written in C++. === hmep === hmep is a new open source library implementing Multi Expression Programming technique in Haskell programming language.