AI Chat Free No Sign Up

AI Chat Free No Sign Up — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Oculus Medium

    Oculus Medium

    Oculus Medium is a digital sculpting software that works with virtual reality headsets and 6DoF motion controllers. It is used to create and paint digital sculptures. Medium works only on Oculus Rift. It was released on December 5, 2016, following with a major update in 2018 introducing new features and a revamped UI. On December 9, 2019, Oculus Medium was acquired by Adobe and re-named to "Medium by Adobe".

    Read more →
  • Julia (programming language)

    Julia (programming language)

    Julia is a dynamic general-purpose programming language. As a high-level language, distinctive aspects of Julia's design include a type system with parametric polymorphism, the use of multiple dispatch as a core programming paradigm, just-in-time compilation and a parallel garbage collection implementation. Notably, Julia does not support classes with encapsulated methods but instead relies on the types of all of a function's arguments to determine which method will be called. By default, Julia is run similarly to scripting languages, using its runtime, and allows for interactions, but Julia programs can also be compiled to small binary standalone executables (or to small libraries for e.g. Python), with e.g. the JuliaC.jl compiler. Julia programs can reuse libraries from other languages, and vice versa. Julia has interoperability with C, C++, Fortran, Rust, Python, and R. Additionally, some Julia packages have bindings to be used from Python and R as libraries. Julia is supported by programmer tools like IDEs (see below) and by notebooks like Pluto.jl, Jupyter, and since 2025, Google Colab officially supports Julia natively. Julia is sometimes used in embedded systems (e.g. has been used in a satellite in space on a Raspberry Pi Compute Module 4; 64-bit Pis work best with Julia, and Julia is supported in Raspbian). == History == Work on Julia began in 2009, when Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman set out to create a free language that was both high-level and fast. On 14 February 2012, the team launched a website with a blog post explaining the language's mission. In an interview with InfoWorld in April 2012, Karpinski said about the name of the language, Julia: "There's no good reason, really. It just seemed like a pretty name." Bezanson said he chose the name on the recommendation of a friend, then years later wrote: Maybe julia stands for "Jeff's uncommon lisp is automated"? Julia's syntax is stable, since version 1.0 in 2018, and Julia has a backward compatibility guarantee for 1.x and also a stability promise for the documented (stable) API, while in the years before in the early development prior to 0.7 the syntax (and semantics) was changed in new versions. All of the (registered package) ecosystem uses the new and improved syntax, and in most cases relies on new APIs that have been added regularly, and in some cases minor additional syntax added in a forward compatible way e.g. in Julia 1.7. In the 10 years since the 2012 launch of pre-1.0 Julia, the community has grown. The Julia package ecosystem has over 11.8 million lines of code (including docs and tests). The JuliaCon academic conference for Julia users and developers has been held annually since 2014 with JuliaCon2020 welcoming over 28,900 unique viewers, and then JuliaCon2021 breaking all previous records (with more than 300 JuliaCon2021 presentations available for free on YouTube, up from 162 the year before), and 43,000 unique viewers during the conference. Three of the Julia co-creators are the recipients of the 2019 James H. Wilkinson Prize for Numerical Software (awarded every four years) "for the creation of Julia, an innovative environment for the creation of high-performance tools that enable the analysis and solution of computational science problems." Also, Alan Edelman, professor of applied mathematics at MIT, has been selected to receive the 2019 IEEE Computer Society Sidney Fernbach Award "for outstanding breakthroughs in high-performance computing, linear algebra, and computational science and for contributions to the Julia programming language." Version 0.3 was released in August 2014. Both Julia 0.7 and version 1.0 were released on 8 August 2018. Julia 1.4 added syntax for generic array indexing to handle e.g. 0-based arrays. The memory model was also changed. Julia 1.5 released in August 2020 added record and replay debugging support, for Mozilla's rr tool. The release changed the behavior in the REPL (to soft scope) to the one used in Jupyter, but keeps full compatible with non-REPL code (that retains hard scope). Julia 1.6 was the largest release since 1.0, and it was the long-term support (LTS) version for the longest time. Since Julia 1.7 development is back to time-based releases, and it was released in November 2021 with e.g. a new default random-number generator and Julia 1.7.3 fixed at least one security issue. Julia 1.8 added options for hiding source code when compiling Julia source code to executables. Julia 1.9 has added the ability to precompile packages to native machine code, done automatically; to improve precompilation of packages a new package PrecompileTools.jl was introduced, for use by package developers. Julia 1.10 was released on 25 December 2023 with new features such as parallel garbage collection. Julia 1.11 was released on 7 October 2024, and with it 1.10.5 became the next long-term support (LTS) version (i.e. those became the only two supported versions), since replaced by 1.10.10 released on 27 June, and 1.6 is no longer an LTS version. Julia 1.11 adds e.g. the new public keyword to signal safe public API (Julia users are advised to use such API, not internals, of Julia or packages, and package authors advised to use the keyword, generally indirectly, e.g. prefixed with the @compat macro, from Compat.jl, to also support older Julia versions, at least the LTS version). Julia 1.12 was released on 7 October 2025 (and 1.12.5 on 9 February 2026), and with it a JuliaC.jl package including the juliac compiler that works with it, for making rather small binary executables (much smaller than was possible before; through the use of new so-called trimming feature). Julia 1.10 LTS is an officially still-supported branch, but the 1.11 branch has also been maintained after 1.12 release, with 1.11.8 released and then 1.11.9 released on 8 February 2026. === JuliaCon === Since 2014, the Julia Community has hosted an annual Julia Conference focused on developers and users. The first JuliaCon took place in Chicago and kickstarted the annual occurrence of the conference. Since 2014, the conference has taken place across a number of locations including MIT and the University of Maryland, Baltimore. The event audience has grown from a few dozen people to over 28,900 unique attendees during JuliaCon 2020, which took place virtually. JuliaCon 2021 also took place virtually with keynote addresses from professors William Kahan, the primary architect of the IEEE 754 floating-point standard (which virtually all CPUs and languages, including Julia, use), Jan Vitek, Xiaoye Sherry Li, and Soumith Chintala, a co-creator of PyTorch. JuliaCon grew to 43,000 unique attendees and more than 300 presentations (still freely accessible, plus for older years). JuliaCon 2022 will also be virtual held between July 27 and July 29, 2022, for the first time in several languages, not just in English. === Sponsors === The Julia language became a NumFOCUS fiscally sponsored project in 2014 in an effort to ensure the project's long-term sustainability. Jeremy Kepner at MIT Lincoln Laboratory was the founding sponsor of the Julia project in its early days. In addition, funds from the Gordon and Betty Moore Foundation, the Alfred P. Sloan Foundation, Intel, and agencies such as NSF, DARPA, NIH, NASA, and FAA have been essential to the development of Julia. Mozilla, the maker of Firefox web browser, with its research grants for H1 2019, sponsored "a member of the official Julia team" for the project "Bringing Julia to the Browser", meaning to Firefox and other web browsers. The Julia language is also supported by individual donors on GitHub. === The Julia company === JuliaHub, Inc. was founded in 2015 as Julia Computing, Inc. by Viral B. Shah, Deepak Vinchhi, Alan Edelman, Jeff Bezanson, Stefan Karpinski and Keno Fischer. In June 2017, Julia Computing raised US$4.6 million in seed funding from General Catalyst and Founder Collective, the same month was "granted $910,000 by the Alfred P. Sloan Foundation to support open-source Julia development, including $160,000 to promote diversity in the Julia community", and in December 2019 the company got $1.1 million funding from the US government to "develop a neural component machine learning tool to reduce the total energy consumption of heating, ventilation, and air conditioning (HVAC) systems in buildings". In July 2021, Julia Computing announced they raised a $24 million Series A round led by Dorilton Ventures, which also owns Formula One team Williams Racing, that partnered with Julia Computing. Williams' Commercial Director said: "Investing in companies building best-in-class cloud technology is a strategic focus for Dorilton and Julia's versatile platform, with revolutionary capabilities in simulation and modelling, is hugely relevant to our business. We look forward to embedding Julia Computing in the world's most technologically advanced sport". In June 2023, JuliaHub received (again, now

    Read more →
  • Witness set

    Witness set

    In combinatorics and computational learning theory, a witness set is a set of elements that distinguishes a given Boolean function from a given class of other Boolean functions. Let C {\displaystyle C} be a concept class over a domain X {\displaystyle X} (that is, a family of Boolean functions over X {\displaystyle X} ) and c {\displaystyle c} be a concept in X {\displaystyle X} (a single Boolean function). A subset S {\displaystyle S} of X {\displaystyle X} is a witness set for c {\displaystyle c} in X {\displaystyle X} if S {\displaystyle S} distinguishes c {\displaystyle c} from all the other functions in C {\displaystyle C} , in the sense that no other function in C {\displaystyle C} has the same values on S {\displaystyle S} . For a concept class with | C | {\displaystyle |C|} concepts, there exists a concept that has a witness of size at most log 2 ⁡ | C | {\displaystyle \log _{2}|C|} ; this bound is tight when C {\displaystyle C} consists of all Boolean functions over X {\displaystyle X} . By a result of Bondy (1972) there exists a single witness set of size at most | C | − 1 {\displaystyle |C|-1} that is valid for all concepts in C {\displaystyle C} ; this bound is tight when C {\displaystyle C} consists of the indicator functions of the empty set and some singleton sets. One way to construct this set is to interpret the concepts as bitstrings, and the domain elements as positions in these bitstrings. Then the set of positions at which a trie of the bitstrings branches forms the desired witness set. This construction is central to the operation of the fusion tree data structure. The minimum size of a witness set for c {\displaystyle c} is called the witness size or specification number and is denoted by w C ( c ) {\displaystyle w_{C}(c)} . The value max { w C ( c ) : c ∈ C } {\displaystyle \max\{w_{C}(c):c\in C\}} is called the teaching dimension of C {\displaystyle C} . It represents the number of examples of a concept that need to be presented by a teacher to a learner, in the worst case, to enable the learner to determine which concept is being presented. Witness sets have also been called teaching sets, keys, specifying sets, or discriminants. The "witness set" terminology is from Kushilevitz et al. (1996), who trace the concept of witness sets to work by Cover (1965).

    Read more →
  • FERET database

    FERET database

    The Facial Recognition Technology (FERET) database is a dataset used for facial recognition system evaluation as part of the Face Recognition Technology (FERET) program. It was first established in 1993 under a collaborative effort between Harry Wechsler at George Mason University and Jonathon Phillips at the Army Research Laboratory in Adelphi, Maryland. The FERET database serves as a standard database of facial images for researchers to use to develop various algorithms and report results. The use of a common database also allowed one to compare the effectiveness of different approaches in methodology and gauge their strengths and weaknesses. The facial images for the database were collected between December 1993 and August 1996, accumulating a total of 14,126 images pertaining to 1,199 individuals along with 365 duplicate sets of images that were taken on a different day. In 2003, the Defense Advanced Research Projects Agency (DARPA) released a high-resolution, 24-bit color version of these images. The dataset tested includes 2,413 still facial images, representing 856 individuals. The FERET database has been used by more than 460 research groups and is managed by the National Institute of Standards and Technology (NIST).

    Read more →
  • Empowerment (artificial intelligence)

    Empowerment (artificial intelligence)

    Empowerment in the field of artificial intelligence formalises and quantifies (via information theory) the potential an agent perceives that it has to influence its environment. An agent which follows an empowerment maximising policy, acts to maximise future options (typically up to some limited horizon). Empowerment can be used as a (pseudo) utility function that depends only on information gathered from the local environment to guide action, rather than seeking an externally imposed goal, thus is a form of intrinsic motivation. The empowerment formalism depends on a probabilistic model commonly used in artificial intelligence. An autonomous agent operates in the world by taking in sensory information and acting to change its state, or that of the environment, in a cycle of perceiving and acting known as the perception-action loop. Agent state and actions are modelled by random variables ( S : s ∈ S , A : a ∈ A {\displaystyle S:s\in {\mathcal {S}},A:a\in {\mathcal {A}}} ) and time ( t {\displaystyle t} ). The choice of action depends on the current state, and the future state depends on the choice of action, thus the perception-action loop unrolled in time forms a causal bayesian network. == Definition == Empowerment ( E {\displaystyle {\mathfrak {E}}} ) is defined as the channel capacity ( C {\displaystyle C} ) of the actuation channel of the agent, and is formalised as the maximal possible information flow between the actions of the agent and the effect of those actions some time later. Empowerment can be thought of as the future potential of the agent to affect its environment, as measured by its sensors. E := C ( A t ⟶ S t + 1 ) ≡ max p ( a t ) I ( A t ; S t + 1 ) {\displaystyle {\mathfrak {E}}:=C(A_{t}\longrightarrow S_{t+1})\equiv \max _{p(a_{t})}I(A_{t};S_{t+1})} In a discrete time model, Empowerment can be computed for a given number of cycles into the future, which is referred to in the literature as 'n-step' empowerment. E ( A t n ⟶ S t + n ) = max p ( a t , . . . , a t + n − 1 ) I ( A t , . . . , A t + n − 1 ; S t + n ) {\displaystyle {\mathfrak {E}}(A_{t}^{n}\longrightarrow S_{t+n})=\max _{p(a_{t},...,a_{t+n-1})}I(A_{t},...,A_{t+n-1};S_{t+n})} The unit of empowerment depends on the logarithm base. Base 2 is commonly used in which case the unit is bits. === Contextual Empowerment === In general the choice of action (action distribution) that maximises empowerment varies from state to state. Knowing the empowerment of an agent in a specific state is useful, for example to construct an empowerment maximising policy. State-specific empowerment can be found using the more general formalism for 'contextual empowerment'. C {\displaystyle C} is a random variable describing the context (e.g. state). E ( A t n ⟶ S t + n ∣ C ) = ∑ c ∈ C p ( c ) E ( A t n ⟶ S t + n ∣ C = c ) {\displaystyle {\mathfrak {E}}(A_{t}^{n}\longrightarrow S_{t+n}{\mid }C)=\sum _{c{\in }C}p(c){\mathfrak {E}}(A_{t}^{n}\longrightarrow S_{t+n}{\mid }C=c)} == Application == Empowerment maximisation can be used as a pseudo-utility function to enable agents to exhibit intelligent behaviour without requiring the definition of external goals, for example balancing a pole in a cart-pole balancing scenario where no indication of the task is provided to the agent. Empowerment has been applied in studies of collective behaviour and in continuous domains. As is the case with Bayesian methods in general, computation of empowerment becomes computationally expensive as the number of actions and time horizon extends, but approaches to improve efficiency have led to usage in real-time control. Empowerment has been used for intrinsically motivated reinforcement learning agents playing video games, and in the control of underwater vehicles.

    Read more →
  • Sum of absolute transformed differences

    Sum of absolute transformed differences

    The sum of absolute transformed differences (SATD) is a block matching criterion widely used in fractional motion estimation for video compression. It works by taking a frequency transform, usually a Hadamard transform, of the differences between the pixels in the original block and the corresponding pixels in the block being used for comparison. The transform itself is often of a small block rather than the entire macroblock. For example, in x264, a series of 4×4 blocks are transformed rather than doing the more processor-intensive 16×16 transform. == Comparison to other metrics == SATD is slower than the sum of absolute differences (SAD), both due to its increased complexity and the fact that SAD-specific MMX and SSE2 instructions exist, while there are no such instructions for SATD. However, SATD can still be optimized considerably with SIMD instructions on most modern CPUs. The benefit of SATD is that it more accurately models the number of bits required to transmit the residual error signal. As such, it is often used in video compressors, either as a way to drive and estimate rate explicitly, such as in the Theora encoder (since 1.1 alpha2), as an optional metric used in wide motion searches, such as in the Microsoft VC-1 encoder, or as a metric used in sub-pixel refinement, such as in x264.

    Read more →
  • Bondy's theorem

    Bondy's theorem

    In mathematics, Bondy's theorem is a bound on the number of elements needed to distinguish the sets in a family of sets from each other. It belongs to the field of combinatorics, and is named after John Adrian Bondy, who published it in 1972. == Statement == The theorem is as follows: Let X be a set with n elements and let A1, A2, ..., An be distinct subsets of X. Then there exists a subset S of X with n − 1 elements such that the sets Ai ∩ S are all distinct. In other words, if we have a 0-1 matrix with n rows and n columns such that each row is distinct, we can remove one column such that the rows of the resulting n × (n − 1) matrix are distinct. == Example == Consider the 4 × 4 matrix [ 1 1 0 1 0 1 0 1 0 0 1 1 0 1 1 0 ] {\displaystyle {\begin{bmatrix}1&1&0&1\\0&1&0&1\\0&0&1&1\\0&1&1&0\end{bmatrix}}} where all rows are pairwise distinct. If we delete, for example, the first column, the resulting matrix [ 1 0 1 1 0 1 0 1 1 1 1 0 ] {\displaystyle {\begin{bmatrix}1&0&1\\1&0&1\\0&1&1\\1&1&0\end{bmatrix}}} no longer has this property: the first row is identical to the second row. Nevertheless, by Bondy's theorem we know that we can always find a column that can be deleted without introducing any identical rows. In this case, we can delete the third column: all rows of the 3 × 4 matrix [ 1 1 1 0 1 1 0 0 1 0 1 0 ] {\displaystyle {\begin{bmatrix}1&1&1\\0&1&1\\0&0&1\\0&1&0\end{bmatrix}}} are distinct. Another possibility would have been deleting the fourth column. == Learning theory application == From the perspective of computational learning theory, Bondy's theorem can be rephrased as follows: Let C be a concept class over a finite domain X. Then there exists a subset S of X with the size at most |C| − 1 such that S is a witness set for every concept in C. This implies that every finite concept class C has its teaching dimension bounded by |C| − 1.

    Read more →
  • Semidefinite embedding

    Semidefinite embedding

    Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is an algorithm in computer science that uses semidefinite programming to perform non-linear dimensionality reduction of high-dimensional vectorial input data. It is motivated by the observation that kernel Principal Component Analysis (kPCA) does not reduce the data dimensionality, as it leverages the Kernel trick to non-linearly map the original data into an inner-product space. == Algorithm == MVU creates a mapping from the high dimensional input vectors to some low dimensional Euclidean vector space in the following steps: A neighbourhood graph is created. Each input is connected with its k-nearest input vectors (according to Euclidean distance metric) and all k-nearest neighbors are connected with each other. If the data is sampled well enough, the resulting graph is a discrete approximation of the underlying manifold. The neighbourhood graph is "unfolded" with the help of semidefinite programming. Instead of learning the output vectors directly, the semidefinite programming aims to find an inner product matrix that maximizes the pairwise distances between any two inputs that are not connected in the neighbourhood graph while preserving the nearest neighbors distances. The low-dimensional embedding is finally obtained by application of multidimensional scaling on the learned inner product matrix. The steps of applying semidefinite programming followed by a linear dimensionality reduction step to recover a low-dimensional embedding into a Euclidean space were first proposed by Linial, London, and Rabinovich. == Optimization formulation == Let X {\displaystyle X\,\!} be the original input and Y {\displaystyle Y\,\!} be the embedding. If i , j {\displaystyle i,j\,\!} are two neighbors, then the local isometry constraint that needs to be satisfied is: | X i − X j | 2 = | Y i − Y j | 2 {\displaystyle |X_{i}-X_{j}|^{2}=|Y_{i}-Y_{j}|^{2}\,\!} Let G , K {\displaystyle G,K\,\!} be the Gram matrices of X {\displaystyle X\,\!} and Y {\displaystyle Y\,\!} (i.e.: G i j = X i ⋅ X j , K i j = Y i ⋅ Y j {\displaystyle G_{ij}=X_{i}\cdot X_{j},K_{ij}=Y_{i}\cdot Y_{j}\,\!} ). We can express the above constraint for every neighbor points i , j {\displaystyle i,j\,\!} in term of G , K {\displaystyle G,K\,\!} : G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i {\displaystyle G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji}\,\!} In addition, we also want to constrain the embedding Y {\displaystyle Y\,\!} to center at the origin: 0 = | ∑ i Y i | 2 ⇔ ( ∑ i Y i ) ⋅ ( ∑ i Y i ) ⇔ ∑ i , j Y i ⋅ Y j ⇔ ∑ i , j K i j {\displaystyle 0=|\sum _{i}Y_{i}|^{2}\Leftrightarrow (\sum _{i}Y_{i})\cdot (\sum _{i}Y_{i})\Leftrightarrow \sum _{i,j}Y_{i}\cdot Y_{j}\Leftrightarrow \sum _{i,j}K_{ij}} As described above, except the distances of neighbor points are preserved, the algorithm aims to maximize the pairwise distance of every pair of points. The objective function to be maximized is: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}} Intuitively, maximizing the function above is equivalent to pulling the points as far away from each other as possible and therefore "unfold" the manifold. The local isometry constraint Let τ = m a x { η i j | Y i − Y j | 2 } {\displaystyle \tau =max\{\eta _{ij}|Y_{i}-Y_{j}|^{2}\}\,\!} where η i j := { 1 if i is a neighbour of j 0 otherwise . {\displaystyle \eta _{ij}:={\begin{cases}1&{\mbox{if}}\ i{\mbox{ is a neighbour of }}j\\0&{\mbox{otherwise}}.\end{cases}}} prevents the objective function from diverging (going to infinity). Since the graph has N points, the distance between any two points | Y i − Y j | 2 ≤ N τ {\displaystyle |Y_{i}-Y_{j}|^{2}\leq N\tau \,\!} . We can then bound the objective function as follows: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 ≤ 1 2 N ∑ i , j ( N τ ) 2 = N 3 τ 2 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\leq {\dfrac {1}{2N}}\sum _{i,j}(N\tau )^{2}={\dfrac {N^{3}\tau ^{2}}{2}}\,\!} The objective function can be rewritten purely in the form of the Gram matrix: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 = 1 2 N ∑ i , j ( Y i 2 + Y j 2 − Y i ⋅ Y j − Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − ∑ i , j Y i ⋅ Y j − ∑ i , j Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − 0 − 0 ) = 1 N ( ∑ i Y i 2 ) = 1 N ( T r ( K ) ) {\displaystyle {\begin{aligned}T(Y)&{}={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\\&{}={\dfrac {1}{2N}}\sum _{i,j}(Y_{i}^{2}+Y_{j}^{2}-Y_{i}\cdot Y_{j}-Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-\sum _{i,j}Y_{i}\cdot Y_{j}-\sum _{i,j}Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-0-0)\\&{}={\dfrac {1}{N}}(\sum _{i}Y_{i}^{2})={\dfrac {1}{N}}(Tr(K))\\\end{aligned}}\,\!} Finally, the optimization can be formulated as: Maximize T r ( K ) subject to K ⪰ 0 , ∑ i j K i j = 0 and G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i , ∀ i , j where η i j = 1 , {\displaystyle {\begin{aligned}&{\text{Maximize}}&&Tr(\mathbf {K} )\\&{\text{subject to}}&&\mathbf {K} \succeq 0,\sum _{ij}\mathbf {K} _{ij}=0\\&{\text{and}}&&G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji},\forall i,j{\mbox{ where }}\eta _{ij}=1,\end{aligned}}} After the Gram matrix K {\displaystyle K\,\!} is learned by semidefinite programming, the output Y {\displaystyle Y\,\!} can be obtained via Cholesky decomposition. In particular, the Gram matrix can be written as K i j = ∑ α = 1 N ( λ α V α i V α j ) {\displaystyle K_{ij}=\sum _{\alpha =1}^{N}(\lambda _{\alpha }V_{\alpha i}V_{\alpha j})\,\!} where V α i {\displaystyle V_{\alpha i}\,\!} is the i-th element of eigenvector V α {\displaystyle V_{\alpha }\,\!} of the eigenvalue λ α {\displaystyle \lambda _{\alpha }\,\!} . It follows that the α {\displaystyle \alpha \,\!} -th element of the output Y i {\displaystyle Y_{i}\,\!} is λ α V α i {\displaystyle {\sqrt {\lambda _{\alpha }}}V_{\alpha i}\,\!} .

    Read more →
  • DigitaltMuseum

    DigitaltMuseum

    DigitaltMuseum (lit. 'The Digital Museum') is a website database in Norwegian and Swedish for art, images and cultural history museums. The service was established in 2009 after a trial period. The database is developed and operated by KulturIT. KulturIT ANS was established by the Norwegian Museum of Cultural History and Maihaugen in consultation with the Norwegian Archive, Library and Museum Authority (ABM) in 2007. In 2015, the company underwent a corporate transformation and KulturIT AS was established on 12 February. The website has per 2025 around 2,548,022 images. Many of the images are in the public domain or under Creative Commons licenses and are being imported into Wikimedia Commons. The website's API was developed in 2012. == Institutions == As of 2025, there are 223 collaborating museums. == Mission == DigitaltMuseum aims to make the museums' collections accessible to all interested parties, regardless of time and place. The website aims to facilitate easy use of the collections through various methods including image searches, research, teaching and joint knowledge development. DigitaltMuseum contains collections from several hundred Norwegian and Swedish museums, totalling around five million objects. The website contains both historical images from the areas and themes covered by the museums, as well as images of artefacts from the collections. Parts of the collection have previously only been shown in the museums' exhibitions and books and have therefore rarely or never been shown to the public.

    Read more →
  • Diffusion map

    Diffusion map

    Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space (often low-dimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the "diffusion distance" between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis (PCA), diffusion maps are part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive. == Definition of diffusion maps == Following and , diffusion maps can be defined in four steps. === Connectivity === Diffusion maps exploit the relationship between heat diffusion and random walk Markov chain. The basic observation is that if we take a random walk on the data, walking to a nearby data-point is more likely than walking to another that is far away. Let ( X , A , μ ) {\displaystyle (X,{\mathcal {A}},\mu )} be a measure space, where X {\displaystyle X} is the data set and μ {\displaystyle \mu } represents the distribution of the points on X {\displaystyle X} . Based on this, the connectivity k {\displaystyle k} between two data points, x {\displaystyle x} and y {\displaystyle y} , can be defined as the probability of walking from x {\displaystyle x} to y {\displaystyle y} in one step of the random walk. Usually, this probability is specified in terms of a kernel function of the two points: k : X × X → R {\displaystyle k:X\times X\rightarrow \mathbb {R} } . For example, the popular Gaussian kernel: k ( x , y ) = exp ⁡ ( − | | x − y | | 2 ϵ ) {\displaystyle k(x,y)=\exp \left(-{\frac {||x-y||^{2}}{\epsilon }}\right)} More generally, the kernel function has the following properties k ( x , y ) = k ( y , x ) {\displaystyle k(x,y)=k(y,x)} ( k {\displaystyle k} is symmetric) k ( x , y ) ≥ 0 ∀ x , y {\displaystyle k(x,y)\geq 0\,\,\forall x,y} ( k {\displaystyle k} is positivity preserving). The kernel constitutes the prior definition of the local geometry of the data-set. Since a given kernel will capture a specific feature of the data set, its choice should be guided by the application that one has in mind. This is a major difference with methods such as principal component analysis, where correlations between all data points are taken into account at once. Given ( X , k ) {\displaystyle (X,k)} , we can then construct a reversible discrete-time Markov chain on X {\displaystyle X} (a process known as the normalized graph Laplacian construction): d ( x ) = ∫ X k ( x , y ) d μ ( y ) {\displaystyle d(x)=\int _{X}k(x,y)d\mu (y)} and define: p ( x , y ) = k ( x , y ) d ( x ) {\displaystyle p(x,y)={\frac {k(x,y)}{d(x)}}} Although the new normalized kernel does not inherit the symmetric property, it does inherit the positivity-preserving property and gains a conservation property: ∫ X p ( x , y ) d μ ( y ) = 1 {\displaystyle \int _{X}p(x,y)d\mu (y)=1} === Diffusion process === From p ( x , y ) {\displaystyle p(x,y)} we can construct a transition matrix of a Markov chain ( M {\displaystyle M} ) on X {\displaystyle X} . In other words, p ( x , y ) {\displaystyle p(x,y)} represents the one-step transition probability from x {\displaystyle x} to y {\displaystyle y} , and M t {\displaystyle M^{t}} gives the t-step transition matrix. We define the diffusion matrix L {\displaystyle L} (it is also a version of graph Laplacian matrix) L i , j = k ( x i , x j ) {\displaystyle L_{i,j}=k(x_{i},x_{j})\,} We then define the new kernel L i , j ( α ) = k ( α ) ( x i , x j ) = L i , j ( d ( x i ) d ( x j ) ) α {\displaystyle L_{i,j}^{(\alpha )}=k^{(\alpha )}(x_{i},x_{j})={\frac {L_{i,j}}{(d(x_{i})d(x_{j}))^{\alpha }}}\,} or equivalently, L ( α ) = D − α L D − α {\displaystyle L^{(\alpha )}=D^{-\alpha }LD^{-\alpha }\,} where D is a diagonal matrix and D i , i = ∑ j L i , j . {\displaystyle D_{i,i}=\sum _{j}L_{i,j}.} We apply the graph Laplacian normalization to this new kernel: M = ( D ( α ) ) − 1 L ( α ) , {\displaystyle M=({D}^{(\alpha )})^{-1}L^{(\alpha )},\,} where D ( α ) {\displaystyle D^{(\alpha )}} is a diagonal matrix and D i , i ( α ) = ∑ j L i , j ( α ) . {\displaystyle {D}_{i,i}^{(\alpha )}=\sum _{j}L_{i,j}^{(\alpha )}.} p ( x j , t | x i ) = M i , j t {\displaystyle p(x_{j},t|x_{i})=M_{i,j}^{t}\,} One of the main ideas of the diffusion framework is that running the chain forward in time (taking larger and larger powers of M {\displaystyle M} ) reveals the geometric structure of X {\displaystyle X} at larger and larger scales (the diffusion process). Specifically, the notion of a cluster in the data set is quantified as a region in which the probability of escaping this region is low (within a certain time t). Therefore, t not only serves as a time parameter, but it also has the dual role of scale parameter. The eigendecomposition of the matrix M t {\displaystyle M^{t}} yields M i , j t = ∑ l λ l t ψ l ( x i ) ϕ l ( x j ) {\displaystyle M_{i,j}^{t}=\sum _{l}\lambda _{l}^{t}\psi _{l}(x_{i})\phi _{l}(x_{j})\,} where { λ l } {\displaystyle \{\lambda _{l}\}} is the sequence of eigenvalues of M {\displaystyle M} and { ψ l } {\displaystyle \{\psi _{l}\}} and { ϕ l } {\displaystyle \{\phi _{l}\}} are the biorthogonal left and right eigenvectors respectively. Due to the spectrum decay of the eigenvalues, only a few terms are necessary to achieve a given relative accuracy in this sum. ==== Parameter α and the diffusion operator ==== The reason to introduce the normalization step involving α {\displaystyle \alpha } is to tune the influence of the data point density on the infinitesimal transition of the diffusion. In some applications, the sampling of the data is generally not related to the geometry of the manifold we are interested in describing. In this case, we can set α = 1 {\displaystyle \alpha =1} and the diffusion operator approximates the Laplace–Beltrami operator. We then recover the Riemannian geometry of the data set regardless of the distribution of the points. To describe the long-term behavior of the point distribution of a system of stochastic differential equations, we can use α = 0.5 {\displaystyle \alpha =0.5} and the resulting Markov chain approximates the Fokker–Planck diffusion. With α = 0 {\displaystyle \alpha =0} , it reduces to the classical graph Laplacian normalization. === Diffusion distance === The diffusion distance at time t {\displaystyle t} between two points can be measured as the similarity of two points in the observation space with the connectivity between them. It is given by D t ( x i , x j ) 2 = ∑ y ( p ( y , t | x i ) − p ( y , t | x j ) ) 2 ϕ 0 ( y ) {\displaystyle D_{t}(x_{i},x_{j})^{2}=\sum _{y}{\frac {(p(y,t|x_{i})-p(y,t|x_{j}))^{2}}{\phi _{0}(y)}}} where ϕ 0 ( y ) {\displaystyle \phi _{0}(y)} is the stationary distribution of the Markov chain, given by the first left eigenvector of M {\displaystyle M} . Explicitly: ϕ 0 ( y ) = d ( y ) ∑ z ∈ X d ( z ) {\displaystyle \phi _{0}(y)={\frac {d(y)}{\sum _{z\in X}d(z)}}} Intuitively, D t ( x i , x j ) {\displaystyle D_{t}(x_{i},x_{j})} is small if there is a large number of short paths connecting x i {\displaystyle x_{i}} and x j {\displaystyle x_{j}} . There are several interesting features associated with the diffusion distance, based on our previous discussion that t {\displaystyle t} also serves as a scale parameter: Points are closer at a given scale (as specified by D t ( x i , x j ) {\displaystyle D_{t}(x_{i},x_{j})} ) if they are highly connected in the graph, therefore emphasizing the concept of a cluster. This distance is robust to noise, since the distance between two points depends on all possible paths of length t {\displaystyle t} between the points. From a machine learning point of view, the distance takes into account all evidences linking x i {\displaystyle x_{i}} to x j {\displaystyle x_{j}} , allowing us to conclude that this distance is appropriate for the design of inference algorithms based on the majority of preponderance. === Diffusion process and low-dimensional embedding === The diffusion distance can be calculated using the eigenvectors by D t ( x i , x j ) 2 = ∑ l λ l 2 t ( ψ l ( x i ) − ψ l ( x j ) ) 2 {\displaystyle D_{t}(x_{i},x_{j})^{2}=\sum _{l}\lambda _{l}^{2t}(\psi _{l}(x_{i})-\psi _{l}(x_{j}))^{2}\,} So the eigenvectors can be used as a new set of coordinates for the data. The diffusion map is defined as: Ψ t ( x ) = ( λ 1 t ψ 1 ( x ) , λ 2 t ψ 2 ( x ) , … , λ k t ψ k ( x ) ) {\displaystyle \Psi _{t}(x)=(\lambda _{1}^{t}\psi _{1}(x),\lambda _{2}^{t}\psi _{2}(x),\ld

    Read more →
  • Probably approximately correct learning

    Probably approximately correct learning

    In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). == Definitions and terminology == In order to give the definition for something that is PAC-learnable, we first have to introduce some terminology. For the following definitions, two examples will be used. The first is the problem of character recognition given an array of n {\displaystyle n} bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive and the points outside of the range as negative. Let X {\displaystyle X} be a set called the instance space or the encoding of all the samples. In the character recognition problem, the instance space is X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} . In the interval problem the instance space, X {\displaystyle X} , is the set of all bounded intervals in R {\displaystyle \mathbb {R} } , where R {\displaystyle \mathbb {R} } denotes the set of all real numbers. A concept is a subset c ⊂ X {\displaystyle c\subset X} . One concept is the set of all patterns of bits in X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} that encode a picture of the letter "P". An example concept from the second example is the set of open intervals, { ( a , b ) ∣ 0 ≤ a ≤ π / 2 , π ≤ b ≤ 13 } {\displaystyle \{(a,b)\mid 0\leq a\leq \pi /2,\pi \leq b\leq {\sqrt {13}}\}} , each of which contains only the positive points. A concept class C {\displaystyle C} is a collection of concepts over X {\displaystyle X} . This could be the set of all subsets of the array of bits that are skeletonized 4-connected (width of the font is 1). Let EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} be a procedure that draws an example, x {\displaystyle x} , using a probability distribution D {\displaystyle D} and gives the correct label c ( x ) {\displaystyle c(x)} , that is 1 if x ∈ c {\displaystyle x\in c} and 0 otherwise. Now, given 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , assume there is an algorithm A {\displaystyle A} and a polynomial p {\displaystyle p} in 1 / ϵ , 1 / δ {\displaystyle 1/\epsilon ,1/\delta } (and other relevant parameters of the class C {\displaystyle C} ) such that, given a sample of size p {\displaystyle p} drawn according to EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} , then, with probability of at least 1 − δ {\displaystyle 1-\delta } , A {\displaystyle A} outputs a hypothesis h ∈ C {\displaystyle h\in C} that has an average error less than or equal to ϵ {\displaystyle \epsilon } on X {\displaystyle X} with the same distribution D {\displaystyle D} . Further if the above statement for algorithm A {\displaystyle A} is true for every concept c ∈ C {\displaystyle c\in C} and for every distribution D {\displaystyle D} over X {\displaystyle X} , and for all 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} then C {\displaystyle C} is (efficiently) PAC learnable (or distribution-free PAC learnable). We can also say that A {\displaystyle A} is a PAC learning algorithm for C {\displaystyle C} . == Equivalence == Under some regularity conditions these conditions are equivalent: The concept class C is PAC learnable. The VC dimension of C is finite. C is a uniformly Glivenko-Cantelli class. C is compressible in the sense of Littlestone and Warmuth

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Lesk algorithm

    Lesk algorithm

    The Lesk algorithm is a classical algorithm for word sense disambiguation introduced by Michael E. Lesk in 1986. It operates on the premise that words within a given context are likely to share a common meaning. This algorithm compares the dictionary definitions of an ambiguous word with the words in its surrounding context to determine the most appropriate sense. Variations, such as the Simplified Lesk algorithm, have demonstrated improved precision and efficiency. However, the Lesk algorithm has faced criticism for its sensitivity to definition wording and its reliance on brief glosses. Researchers have sought to enhance its accuracy by incorporating additional resources like thesauruses and syntactic models. == Overview == The Lesk algorithm is based on the assumption that words in a given "neighborhood" (section of text) will tend to share a common topic. A simplified version of the Lesk algorithm is to compare the dictionary definition of an ambiguous word with the terms contained in its neighborhood. Versions have been adapted to use WordNet. An implementation might look like this: for every sense of the word being disambiguated one should count the number of words that are in both the neighborhood of that word and in the dictionary definition of that sense the sense that is to be chosen is the sense that has the largest number of this count. A frequently used example illustrating this algorithm is for the context "pine cone". The following dictionary definitions are used: PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees As can be seen, the best intersection is Pine #1 ⋂ Cone #3 = 2. == Simplified Lesk algorithm == In Simplified Lesk algorithm, the correct meaning of each word in a given context is determined individually by locating the sense that overlaps the most between its dictionary definition and the given context. Rather than simultaneously determining the meanings of all words in a given context, this approach tackles each word individually, independent of the meaning of the other words occurring in the same context. "A comparative evaluation performed by Vasilescu et al. (2004) has shown that the simplified Lesk algorithm can significantly outperform the original definition of the algorithm, both in terms of precision and efficiency. By evaluating the disambiguation algorithms on the Senseval-2 English all words data, they measure a 58% precision using the simplified Lesk algorithm compared to the only 42% under the original algorithm. Note: Vasilescu et al. implementation considers a back-off strategy for words not covered by the algorithm, consisting of the most frequent sense defined in WordNet. This means that words for which all their possible meanings lead to zero overlap with current context or with other word definitions are by default assigned sense number one in WordNet." Simplified LESK Algorithm with smart default word sense (Vasilescu et al., 2004) The COMPUTEOVERLAP function returns the number of words in common between two sets, ignoring function words or other words on a stop list. The original Lesk algorithm defines the context in a more complex way. == Criticisms == Unfortunately, Lesk’s approach is very sensitive to the exact wording of definitions, so the absence of a certain word can radically change the results. Further, the algorithm determines overlaps only among the glosses of the senses being considered. This is a significant limitation in that dictionary glosses tend to be fairly short and do not provide sufficient vocabulary to relate fine-grained sense distinctions. A lot of work has appeared offering different modifications of this algorithm. These works use other resources for analysis (thesauruses, synonyms dictionaries or morphological and syntactic models): for instance, it may use such information as synonyms, different derivatives, or words from definitions of words from definitions. == Lesk variants == Original Lesk (Lesk, 1986) Adapted/Extended Lesk (Banerjee and Pederson, 2002/2003): In the adaptive lesk algorithm, a word vector is created corresponds to every content word in the wordnet gloss. Concatenating glosses of related concepts in WordNet can be used to augment this vector. The vector contains the co-occurrence counts of words co-occurring with w in a large corpus. Adding all the word vectors for all the content words in its gloss creates the Gloss vector g for a concept. Relatedness is determined by comparing the gloss vector using the Cosine similarity measure. There are a lot of studies concerning Lesk and its extensions: Wilks and Stevenson, 1998, 1999; Mahesh et al., 1997; Cowie et al., 1992; Yarowsky, 1992; Pook and Catlett, 1988; Kilgarriff and Rosensweig, 2000; Kwong, 2001; Nastase and Szpakowicz, 2001; Gelbukh and Sidorov, 2004.

    Read more →
  • One-shot learning (computer vision)

    One-shot learning (computer vision)

    One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed. == Motivation == The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects. == Background == As with most classification schemes, one-shot learning involves three main challenges: Representation: How should objects and categories be described? Learning: How can such descriptions be created? Recognition: How can a known object be filtered from enveloping clutter, irrespective of occlusion, viewpoint, and lighting? One-shot learning differs from single object recognition and standard category recognition algorithms in its emphasis on knowledge transfer, which makes use of previously learned categories. Model parameters: Reuses model parameters, based on the similarity between old and new categories. Categories are first learned on numerous training examples, then new categories are learned using transformations of model parameters from those initial categories or selecting relevant parameters for a classifier. Feature sharing: Shares parts or features of objects across categories. One algorithm extracts "diagnostic information" in patches from already learned categories by maximizing the patches' mutual information, and then applies these features to the learning of a new category. A dog category, for example, may be learned in one shot from previous knowledge of horse and cow categories, because dog objects may contain similar distinguishing patches. Contextual information: Appeals to global knowledge of the scene in which the object appears. Such global information can be used as frequency distributions in a conditional random field framework to recognize objects. Alternatively context can consider camera height and scene geometry. Algorithms of this type have two advantages. First, they learn object categories that are relatively dissimilar; and second, they perform well in ad hoc situations where an image has not been hand-cropped and aligned. == Theory == The Bayesian one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models. During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and variational Bayesian expectation–maximization (VBEM). In this stage the previously learned object categories inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of p(object | test, train) to p(background clutter | test, train) where p is the probability of the outcome. === Bayesian framework === Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of (1 ~ 5) training images containing examples. To formalize these ideas, let I {\displaystyle I} be the query image, which contains either an example of the foreground category O f g {\displaystyle O_{fg}} or only background clutter of a generic background category O b g {\displaystyle O_{bg}} . Also let I t {\displaystyle I_{t}} be the set of training images used as the foreground category. The decision of whether I {\displaystyle I} contains an object from the foreground category, or only clutter from the background category is: R = p ( O f g | I , I t ) p ( O b g | I , I t ) = p ( I | I t , O f g ) p ( O f g ) p ( I | I t , O b g ) p ( O b g ) , {\displaystyle R={\frac {p(O_{fg}|I,I_{t})}{p(O_{bg}|I,I_{t})}}={\frac {p(I|I_{t},O_{fg})p(O_{fg})}{p(I|I_{t},O_{bg})p(O_{bg})}},} where the class posteriors p ( O f g | I , I t ) {\displaystyle p(O_{fg}|I,I_{t})} and p ( O b g | I , I t ) {\displaystyle p(O_{bg}|I,I_{t})} have been expanded by Bayes' theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image I {\displaystyle I} contains an object from the foreground class if R {\displaystyle R} exceeds a certain threshold T {\displaystyle T} . We next introduce parametric models for the foreground and background categories with parameters θ {\displaystyle \theta } and θ b g {\displaystyle \theta _{bg}} respectively. This foreground parametric model is learned during the learning stage from I t {\displaystyle I_{t}} , as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors, p ( O f g ) p ( O b g ) {\displaystyle {\frac {p(O_{fg})}{p(O_{bg})}}} , and parametrizing over θ {\displaystyle \theta } and θ b g {\displaystyle \theta _{bg}} yields R ∝ ∫ p ( I | θ , O f g ) p ( θ | I t , O f g ) d θ ∫ p ( I | θ b g , O b g ) p ( θ b g | I t , O b g ) d θ b g = ∫ p ( I | θ ) p ( θ | I t , O f g ) d θ ∫ p ( I | θ b g ) p ( θ b g | I t , O b g ) d θ b g {\displaystyle R\propto {\frac {\int {p(I|\theta ,O_{fg})p(\theta |I_{t},O_{fg})}d\theta }{\int {p(I|\theta _{bg},O_{bg})p(\theta _{bg}|I_{t},O_{bg})}d\theta _{bg}}}={\frac {\int {p(I|\theta )p(\theta |I_{t},O_{fg})}d\theta }{\int {p(I|\theta _{bg})p(\theta _{bg}|I_{t},O_{bg})}d\theta _{bg}}}} , having simplified p ( I | θ , O f g ) {\displaystyle p(I|\theta ,O_{fg})} and p ( I | θ , O b g ) {\displaystyle p(I|\theta ,O_{bg})} to p ( I | θ f g ) {\displaystyle p(I|\theta _{fg})} and p ( I | θ b g ) . {\displaystyle p(I|\theta _{bg}).} The posterior distribution of model parameters given the training images, p ( θ | I t , O f g ) {\displaystyle p(\theta |I_{t},O_{fg})} is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as δ ( θ M L ) {\displaystyle \delta (\theta ^{ML})} . Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional maximum likelihood estimation of the model parameters is used for the background model and the categories learned in advance through training. === Object category model === For each query image I {\displaystyle I} and training images I t {\displaystyle I_{t}} , a constellation model is used for representation. To obtain this model for a given image I {\displaystyle I} , first a set of N interesting regions is detected in the image using the Kadir–Brady saliency detector. Each region selected is represented by a location in the image, X i {\displaystyle X_{i}} and a description of its appearance, A i {\displaystyle A_{i}} . Letting X = ∑ i = 1 N X i , A = ∑ i = 1 N A i {\displaystyle X=\sum _{i=1}^{N}X_{i},A=\sum _{i=1}^{N}A_{i}} and X t {\displaystyle X_{t}} and A t {\displaystyle A_{t}} the analogous representations for training images, the expression for R becomes: R ∝ ∫ p ( X , A | θ , O f g ) p ( θ | X t , A t , O f g ) d θ ∫ p ( X , A | θ b g , O b g ) p ( θ b g | X t , A t , O b g ) d θ b g = ∫ p ( X , A | θ ) p ( θ | X t , A t , O f g ) d θ ∫ p ( X , A | θ b g ) p ( θ b g | X t , A t , O b g ) d θ b g {\displaystyle R\propto {\frac {\int {p(X,A|\theta ,O_{fg})p(\theta |X_{t},A_{t},O_{fg})}d\theta }{\int {p(X,A|\theta _{bg},O_{bg})p(\theta _{bg}|X_{t},A_{t},O_{bg})}d\theta _{bg}}}={\frac {\int {p(X,A|\theta )p(\theta |X_{t},A_{t},O_{fg})}d\theta }{\int {p(X,A|\theta _{bg})p(\theta _{bg}|X_{t},A_{t},O_{bg})}\,d\theta _{bg}}}} The likelihoods p ( X , A | θ ) {\displaystyle p(X,A|\theta )} and p ( X , A | θ b g ) {\displaystyle p(X,A|\theta _{bg})} are represented as mixtures of constellation models. A typical constellation model has

    Read more →
  • Proper generalized decomposition

    Proper generalized decomposition

    The proper generalized decomposition (PGD) is an iterative numerical method for solving boundary value problems (BVPs), that is, partial differential equations constrained by a set of boundary conditions, such as the Poisson's equation or the Laplace's equation. The PGD algorithm computes an approximation of the solution of the BVP by successive enrichment. This means that, in each iteration, a new component (or mode) is computed and added to the approximation. In principle, the more modes obtained, the closer the approximation is to its theoretical solution. Unlike POD principal components, PGD modes are not necessarily orthogonal to each other. By selecting only the most relevant PGD modes, a reduced order model of the solution is obtained. Because of this, PGD is considered a dimensionality reduction algorithm. == Description == The proper generalized decomposition is a method characterized by a variational formulation of the problem, a discretization of the domain in the style of the finite element method, the assumption that the solution can be approximated as a separate representation and a numerical greedy algorithm to find the solution. === Variational formulation === In the Proper Generalized Decomposition method, the variational formulation involves translating the problem into a format where the solution can be approximated by minimizing (or sometimes maximizing) a functional. A functional is a scalar quantity that depends on a function, which in this case, represents our problem. The most commonly implemented variational formulation in PGD is the Bubnov-Galerkin method. This method is chosen for its ability to provide an approximate solution to complex problems, such as those described by partial differential equations (PDEs). In the Bubnov-Galerkin approach, the idea is to project the problem onto a space spanned by a finite number of basis functions. These basis functions are chosen to approximate the solution space of the problem. In the Bubnov-Galerkin method, we seek an approximate solution that satisfies the integral form of the PDEs over the domain of the problem. This is different from directly solving the differential equations. By doing so, the method transforms the problem into finding the coefficients that best fit this integral equation in the chosen function space. While the Bubnov-Galerkin method is prevalent, other variational formulations are also used in PGD, depending on the specific requirements and characteristics of the problem, such as: Petrov-Galerkin Method: This method is similar to the Bubnov-Galerkin approach but differs in the choice of test functions. In the Petrov-Galerkin method, the test functions (used to project the residual of the differential equation) are different from the trial functions (used to approximate the solution). This can lead to improved stability and accuracy for certain types of problems. Collocation Method: In collocation methods, the differential equation is satisfied at a finite number of points in the domain, known as collocation points. This approach can be simpler and more direct than the integral-based methods like Galerkin's, but it may also be less stable for some problems. Least Squares Method: This approach involves minimizing the square of the residual of the differential equation over the domain. It is particularly useful when dealing with problems where traditional methods struggle with stability or convergence. Mixed Finite Element Method: In mixed methods, additional variables (such as fluxes or gradients) are introduced and approximated along with the primary variable of interest. This can lead to more accurate and stable solutions for certain problems, especially those involving incompressibility or conservation laws. Discontinuous Galerkin Method: This is a variant of the Galerkin method where the solution is allowed to be discontinuous across element boundaries. This method is particularly useful for problems with sharp gradients or discontinuities. === Domain discretization === The discretization of the domain is a well defined set of procedures that cover (a) the creation of finite element meshes, (b) the definition of basis function on reference elements (also called shape functions) and (c) the mapping of reference elements onto the elements of the mesh. === Separate representation === PGD assumes that the solution u of a (multidimensional) problem can be approximated as a separate representation of the form u ≈ u N ( x 1 , x 2 , … , x d ) = ∑ i = 1 N X 1 i ( x 1 ) ⋅ X 2 i ( x 2 ) ⋯ X d i ( x d ) , {\displaystyle \mathbf {u} \approx \mathbf {u} ^{N}(x_{1},x_{2},\ldots ,x_{d})=\sum _{i=1}^{N}\mathbf {X_{1}} _{i}(x_{1})\cdot \mathbf {X_{2}} _{i}(x_{2})\cdots \mathbf {X_{d}} _{i}(x_{d}),} where the number of addends N and the functional products X1(x1), X2(x2), ..., Xd(xd), each depending on a variable (or variables), are unknown beforehand. === Greedy algorithm === The solution is sought by applying a greedy algorithm, usually the fixed point algorithm, to the weak formulation of the problem. For each iteration i of the algorithm, a mode of the solution is computed. Each mode consists of a set of numerical values of the functional products X1(x1), ..., Xd(xd), which enrich the approximation of the solution. Due to the greedy nature of the algorithm, the term 'enrich' is used rather than 'improve', since some modes may actually worsen the approach. The number of computed modes required to obtain an approximation of the solution below a certain error threshold depends on the stopping criterion of the iterative algorithm. == Features == PGD is suitable for solving high-dimensional problems, since it overcomes the limitations of classical approaches. In particular, PGD avoids the curse of dimensionality, as solving decoupled problems is computationally much less expensive than solving multidimensional problems. Therefore, PGD enables to re-adapt parametric problems into a multidimensional framework by setting the parameters of the problem as extra coordinates: u ≈ u N ( x 1 , … , x d ; k 1 , … , k p ) = ∑ i = 1 N X 1 i ( x 1 ) ⋯ X d i ( x d ) ⋅ K 1 i ( k 1 ) ⋯ K p i ( k p ) , {\displaystyle \mathbf {u} \approx \mathbf {u} ^{N}(x_{1},\ldots ,x_{d};k_{1},\ldots ,k_{p})=\sum _{i=1}^{N}\mathbf {X_{1}} _{i}(x_{1})\cdots \mathbf {X_{d}} _{i}(x_{d})\cdot \mathbf {K_{1}} _{i}(k_{1})\cdots \mathbf {K_{p}} _{i}(k_{p}),} where a series of functional products K1(k1), K2(k2), ..., Kp(kp), each depending on a parameter (or parameters), has been incorporated to the equation. In this case, the obtained approximation of the solution is called computational vademecum: a general meta-model containing all the particular solutions for every possible value of the involved parameters. == Sparse Subspace Learning == The Sparse Subspace Learning (SSL) method leverages the use of hierarchical collocation to approximate the numerical solution of parametric models. With respect to traditional projection-based reduced order modeling, the use of a collocation enables non-intrusive approach based on sparse adaptive sampling of the parametric space. This allows to recover the lowdimensional structure of the parametric solution subspace while also learning the functional dependency from the parameters in explicit form. A sparse low-rank approximate tensor representation of the parametric solution can be built through an incremental strategy that only needs to have access to the output of a deterministic solver. Non-intrusiveness makes this approach straightforwardly applicable to challenging problems characterized by nonlinearity or non affine weak forms.

    Read more →