AI Detector Zero Chatgpt

AI Detector Zero Chatgpt — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • CloudPassage

    CloudPassage

    CloudPassage is a company that provides an automation platform, delivered via software as a service, that improves security for private, public, and hybrid cloud computing environments. CloudPassage is headquartered in San Francisco. == History == CloudPassage was founded by Carson Sweet, Talli Somekh, and Vitaliy Geraymovych in 2010. The company used cloud computing and big data analytics to implement security monitoring and control in a platform called Halo. CloudPassage spent a year in stealth developing the Halo technology, coming out of stealth mode to a closed beta in January 2011. In June 2012, the company launched the commercial product that included configuration security monitoring, network microsegmentation, and two-factor authentication for privileged access management. By 2013, CloudPassage expanded Halo to support large enterprises with advanced security and compliance requirements with a product called Halo Enterprise. The first round of venture funding for the company raised $6.5 million. In April 2012, CloudPassage raised $14 million. The financing round was led by Tenaya Capital. In February 2014, CloudPassage announced that it had raised $25.5 million in funding led by Shasta Ventures. In total, the company has invested over $30 million in its technology and raised approximately $88 million in capital. == Product == The CloudPassage platform provides cloud workload security and compliance for systems hosted in public or private cloud infrastructure environments, including hybrid cloud and multi-cloud workload hosting models. The flagship product the company offers is called Halo. Halo secures virtual servers in public, private, and hybrid cloud infrastructures and provides file integrity monitoring (FIM) while also administering firewall automation, vulnerability monitoring, network access control, security event alerting, and assessment. The Halo platform also provides security applications such as privileged access management, software vulnerability scanning, multifactor authentication, and log-based IDS. In December 2013, CloudPassage set up six servers with Microsoft Windows and Linux operating systems and combinations of popular programs and invited hackers to attempt to hack into the servers. The top prize was $5,000 and the winning hacker was a novice that completed the task in four hours. CloudPassage programmed the servers to use basic default security settings to show how vulnerable cloud computing programs can be to security threats. == Awards and recognition == In May 2011, Gigaom named CloudPassage in its list of the Top 50 Cloud Innovators. That same month, eWeek recognized CloudPassage as one of 16 Hot Startup Companies Flying Under the Radar. SC Magazine named CloudPassage an Industry Innovator in the Virtualization and Cloud Security category in 2012. Also in 2012, The Wall Street Journal named CloudPassage a runner-up in the Information Security category of its Technology Innovation Awards. The CloudPassage large-scale security program, Halo, won Best Security Solution in 2014 at the SIIA Codie awards.

    Read more →
  • Conference on Computer Vision and Pattern Recognition

    Conference on Computer Vision and Pattern Recognition

    The Conference on Computer Vision and Pattern Recognition is an annual conference on computer vision and pattern recognition. == Affiliations == The conference was first held in 1983 in Washington, DC, organized by Takeo Kanade and Dana H. Ballard. From 1985 to 2010 it was sponsored by the IEEE Computer Society. In 2011 it was also co-sponsored by University of Colorado Colorado Springs. Since 2012 it has been co-sponsored by the IEEE Computer Society and the Computer Vision Foundation, which provides open access to the conference papers. == Scope == The conference considers a wide range of topics related to computer vision and pattern recognition—basically any topic that is extracting structures or answers from images or video or applying mathematical methods to data to extract or recognize patterns. Common topics include object recognition, image segmentation, motion estimation, 3D reconstruction, and deep learning. The conference generally has less than 30% acceptance rates for all papers and less than 5% for oral presentations. It is managed by a rotating group of volunteers who are chosen in a public election at the Pattern Analysis and Machine Intelligence-Technical Community (PAMI-TC) meeting four years before the meeting. The conference uses a multi-tier double-blind peer review process. The program chairs, who cannot submit papers, select area chairs who manage the reviewers for their subset of submissions. == Location and time == The conference is usually held in June in North America. == Awards == === Best Paper Award === These awards are picked by committees delegated by the program chairs of the conference. === Longuet-Higgins Prize === The Longuet-Higgins Prize recognizes papers from ten years ago that have made a significant impact on computer vision research. === PAMI Young Researcher Award === The Pattern Analysis and Machine Intelligence Young Researcher Award is an award given by the Technical Committee on Pattern Analysis and Machine Intelligence of the IEEE Computer Society to a researcher within 7 years of completing their Ph.D. for outstanding early career research contributions. Candidates are nominated by the computer vision community, with winners selected by a committee of senior researchers in the field. This award was originally instituted in 2012 by the journal Image and Vision Computing, also presented at the conference, and the journal continues to sponsor the award. === PAMI Thomas S. Huang Memorial Prize === The Thomas Huang Memorial Prize was established at the 2020 conference and is awarded annually starting from 2021 to honor researchers who are recognized as examples in research, teaching/mentoring, and service to the computer vision community.

    Read more →
  • FERET database

    FERET database

    The Facial Recognition Technology (FERET) database is a dataset used for facial recognition system evaluation as part of the Face Recognition Technology (FERET) program. It was first established in 1993 under a collaborative effort between Harry Wechsler at George Mason University and Jonathon Phillips at the Army Research Laboratory in Adelphi, Maryland. The FERET database serves as a standard database of facial images for researchers to use to develop various algorithms and report results. The use of a common database also allowed one to compare the effectiveness of different approaches in methodology and gauge their strengths and weaknesses. The facial images for the database were collected between December 1993 and August 1996, accumulating a total of 14,126 images pertaining to 1,199 individuals along with 365 duplicate sets of images that were taken on a different day. In 2003, the Defense Advanced Research Projects Agency (DARPA) released a high-resolution, 24-bit color version of these images. The dataset tested includes 2,413 still facial images, representing 856 individuals. The FERET database has been used by more than 460 research groups and is managed by the National Institute of Standards and Technology (NIST).

    Read more →
  • Neural Networks (journal)

    Neural Networks (journal)

    Neural Networks is a monthly peer-reviewed scientific journal and an official journal of the International Neural Network Society, European Neural Network Society, and Japanese Neural Network Society. == History == The journal was established in 1988 and is published by Elsevier. It covers all aspects of research on artificial neural networks. The founding editor-in-chief was Stephen Grossberg (Boston University). The current editors-in-chief are DeLiang Wang (Ohio State University) and Taro Toyoizumi (RIKEN Center for Brain Science). == Abstracting and indexing == The journal is abstracted and indexed in Scopus and the Science Citation Index Expanded. According to the Journal Citation Reports, the journal has a 2022 impact factor of 7.8.

    Read more →
  • Mountain car problem

    Mountain car problem

    Mountain Car, a standard testing domain in Reinforcement learning, is a problem in which an under-powered car must drive up a steep hill. Since gravity is stronger than the car's engine, even at full throttle, the car cannot simply accelerate up the steep slope. The car is situated in a valley and must learn to leverage potential energy by driving up the opposite hill before the car is able to make it to the goal at the top of the rightmost hill. The domain has been used as a test bed in various reinforcement learning papers. == Introduction == The mountain car problem, although fairly simple, is commonly applied because it requires a reinforcement learning agent to learn on two continuous variables: position and velocity. For any given state (position and velocity) of the car, the agent is given the possibility of driving left, driving right, or not using the engine at all. In the standard version of the problem, the agent receives a negative reward at every time step when the goal is not reached; the agent has no information about the goal until an initial success. == History == The mountain car problem appeared first in Andrew Moore's PhD thesis (1990). It was later more strictly defined in Singh and Sutton's reinforcement learning paper with eligibility traces. The problem became more widely studied when Sutton and Barto added it to their book Reinforcement Learning: An Introduction (1998). Throughout the years many versions of the problem have been used, such as those which modify the reward function, termination condition, and the start state. == Techniques used to solve mountain car == Q-learning and similar techniques for mapping discrete states to discrete actions need to be extended to be able to deal with the continuous state space of the problem. Approaches often fall into one of two categories, state space discretization or function approximation. === Discretization === In this approach, two continuous state variables are pushed into discrete states by bucketing each continuous variable into multiple discrete states. This approach works with properly tuned parameters but a disadvantage is information gathered from one state is not used to evaluate another state. Tile coding can be used to improve discretization and involves continuous variables mapping into sets of buckets offset from one another. Each step of training has a wider impact on the value function approximation because when the offset grids are summed, the information is diffused. === Function approximation === Function approximation is another way to solve the mountain car. By choosing a set of basis functions beforehand, or by generating them as the car drives, the agent can approximate the value function at each state. Unlike the step-wise version of the value function created with discretization, function approximation can more cleanly estimate the true smooth function of the mountain car domain. === Eligibility traces === One aspect of the problem involves the delay of actual reward. The agent is not able to learn about the goal until a successful completion. Given a naive approach for each trial the car can only backup the reward of the goal slightly. This is a problem for naive discretization because each discrete state will only be backed up once, taking a larger number of episodes to learn the problem. This problem can be alleviated via the mechanism of eligibility traces, which will automatically backup the reward given to states before, dramatically increasing the speed of learning. Eligibility traces can be viewed as a bridge from temporal difference learning methods to Monte Carlo methods. == Technical details == The mountain car problem has undergone many iterations. This section focuses on the standard well-defined version from Sutton (2008). === State variables === Two-dimensional continuous state space. V e l o c i t y = ( − 0.07 , 0.07 ) {\displaystyle Velocity=(-0.07,0.07)} P o s i t i o n = ( − 1.2 , 0.6 ) {\displaystyle Position=(-1.2,0.6)} === Actions === One-dimensional discrete action space. m o t o r = ( l e f t , n e u t r a l , r i g h t ) {\displaystyle motor=(left,neutral,right)} === Reward === For every time step: r e w a r d = − 1 {\displaystyle reward=-1} === Update function === For every time step: A c t i o n = [ − 1 , 0 , 1 ] {\displaystyle Action=[-1,0,1]} V e l o c i t y = V e l o c i t y + ( A c t i o n ) ∗ 0.001 + cos ⁡ ( 3 ∗ P o s i t i o n ) ∗ ( − 0.0025 ) {\displaystyle Velocity=Velocity+(Action)0.001+\cos(3Position)(-0.0025)} P o s i t i o n = P o s i t i o n + V e l o c i t y {\displaystyle Position=Position+Velocity} === Starting condition === Optionally, many implementations include randomness in both parameters to show better generalized learning. P o s i t i o n = − 0.5 {\displaystyle Position=-0.5} V e l o c i t y = 0.0 {\displaystyle Velocity=0.0} === Termination condition === End the simulation when: P o s i t i o n ≥ 0.6 {\displaystyle Position\geq 0.6} == Variations == There are many versions of the mountain car which deviate in different ways from the standard model. Variables that vary include but are not limited to changing the constants (gravity and steepness) of the problem so specific tuning for specific policies become irrelevant and altering the reward function to affect the agent's ability to learn in a different manner. An example is changing the reward to be equal to the distance from the goal, or changing the reward to zero everywhere and one at the goal. Additionally, a 3D mountain car can be used, with a 4D continuous state space.

    Read more →
  • Physical neural network

    Physical neural network

    A physical neural network is a type of artificial neural network in which an electrically adjustable material is used to emulate the function of a neural synapse or a higher-order (dendritic) neuron model. "Physical" neural network is used to emphasize the reliance on physical hardware used to emulate neurons as opposed to software-based approaches. More generally the term is applicable to other artificial neural networks in which a memristor or other electrically adjustable resistance material is used to emulate a neural synapse. == Types of physical neural networks == === ADALINE === In the 1960s Bernard Widrow and Ted Hoff developed ADALINE (Adaptive Linear Neuron) which used electrochemical cells called memistors (memory resistors) to emulate synapses of an artificial neuron. The memistors were implemented as 3-terminal devices operating based on the reversible electroplating of copper such that the resistance between two of the terminals is controlled by the integral of the current applied via the third terminal. The ADALINE circuitry was briefly commercialized by the Memistor Corporation in the 1960s enabling some applications in pattern recognition. However, since the memistors were not fabricated using integrated circuit fabrication techniques the technology was not scalable and was eventually abandoned as solid-state electronics became mature. === Analog VLSI === In 1989 Carver Mead published his book Analog VLSI and Neural Systems, which spun off perhaps the most common variant of analog neural networks. The physical realization is implemented in analog VLSI. This is often implemented as field effect transistors in low inversion. Such devices can be modelled as translinear circuits. This is a technique described by Barrie Gilbert in several papers around mid 1970th, and in particular his Translinear Circuits from 1981. With this method circuits can be analyzed as a set of well-defined functions in steady-state, and such circuits assembled into complex networks. === Physical Neural Network === Alex Nugent describes a physical neural network as one or more nonlinear neuron-like nodes used to sum signals and nanoconnections formed from nanoparticles, nanowires, or nanotubes which determine the signal strength input to the nodes. Alignment or self-assembly of the nanoconnections is determined by the history of the applied electric field performing a function analogous to neural synapses. Numerous applications for such physical neural networks are possible. For example, a temporal summation device can be composed of one or more nanoconnections having an input and an output thereof, wherein an input signal provided to the input causes one or more of the nanoconnection to experience an increase in connection strength thereof over time. Another example of a physical neural network is taught by U.S. Patent No. 7,039,619 entitled "Utilized nanotechnology apparatus using a neural network, a solution and a connection gap," which issued to Alex Nugent by the U.S. Patent & Trademark Office on May 2, 2006. A further application of physical neural network is shown in U.S. Patent No. 7,412,428 entitled "Application of hebbian and anti-hebbian learning to nanotechnology-based physical neural networks," which issued on August 12, 2008. Nugent and Molter have shown that universal computing and general-purpose machine learning are possible from operations available through simple memristive circuits operating the AHaH plasticity rule. More recently, it has been argued that also complex networks of purely memristive circuits can serve as neural networks. === Phase change neural network === In 2002, Stanford Ovshinsky described an analog neural computing medium in which phase-change material has the ability to cumulatively respond to multiple input signals. An electrical alteration of the resistance of the phase change material is used to control the weighting of the input signals. === Memristive neural network === Greg Snider of HP Labs describes a system of cortical computing with memristive nanodevices. The memristors (memory resistors) are implemented by thin film materials in which the resistance is electrically tuned via the transport of ions or oxygen vacancies within the film. DARPA's SyNAPSE project has funded IBM Research and HP Labs, in collaboration with the Boston University Department of Cognitive and Neural Systems (CNS), to develop neuromorphic architectures which may be based on memristive systems. === Protonic artificial synapses === In 2022, researchers reported the development of nanoscale brain-inspired artificial synapses, using the ion proton (H+), for 'analog deep learning'.

    Read more →
  • Extremal Ensemble Learning

    Extremal Ensemble Learning

    Extremal Ensemble Learning (EEL) is a machine learning algorithmic paradigm for graph partitioning. EEL creates an ensemble of partitions and then uses information contained in the ensemble to find new and improved partitions. The ensemble evolves and learns how to form improved partitions through extremal updating procedure. The final solution is found by achieving consensus among its member partitions about what the optimal partition is. == Reduced-Network Extremal Ensemble Learning (RenEEL) == A particular implementation of the EEL paradigm is the Reduced-Network Extremal Ensemble Learning (RenEEL) scheme for partitioning a graph. RenEEL uses consensus across many partitions in an ensemble to create a reduced network that can be efficiently analyzed to find more accurate partitions. These better quality partitions are subsequently used to update the ensemble. An algorithm that utilizes the RenEEL scheme is currently the best algorithm for finding the graph partition with maximum modularity, which is an NP-hard problem.

    Read more →
  • Huber loss

    Huber loss

    In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used. == Definition == The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by L δ ( a ) = { 1 2 a 2 for | a | ≤ δ , δ ⋅ ( | a | − 1 2 δ ) , otherwise. {\displaystyle L_{\delta }(a)={\begin{cases}{\frac {1}{2}}{a^{2}}&{\text{for }}|a|\leq \delta ,\\[4pt]\delta \cdot \left(|a|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where | a | = δ {\displaystyle |a|=\delta } . The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y − f ( x ) {\displaystyle a=y-f(x)} , so the former can be expanded to L δ ( y , f ( x ) ) = { 1 2 ( y − f ( x ) ) 2 for | y − f ( x ) | ≤ δ , δ ⋅ ( | y − f ( x ) | − 1 2 δ ) , otherwise. {\displaystyle L_{\delta }(y,f(x))={\begin{cases}{\frac {1}{2}}{\left(y-f(x)\right)}^{2}&{\text{for }}\left|y-f(x)\right|\leq \delta ,\\[4pt]\delta \ \cdot \left(\left|y-f(x)\right|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin. == Motivation == Two very commonly used loss functions are the squared loss, L ( a ) = a 2 {\displaystyle L(a)=a^{2}} , and the absolute loss, L ( a ) = | a | {\displaystyle L(a)=|a|} . The squared loss function results in an arithmetic mean-unbiased estimator, and the absolute-value loss function results in a median-unbiased estimator (in the one-dimensional case, and a geometric median-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of a {\displaystyle a} 's (as in ∑ i = 1 n L ( a i ) {\textstyle \sum _{i=1}^{n}L(a_{i})} ), the sample mean is influenced too much by a few particularly large a {\displaystyle a} -values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions. As defined above, the Huber loss function is strongly convex in a uniform neighborhood of its minimum a = 0 {\displaystyle a=0} ; at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points a = − δ {\displaystyle a=-\delta } and a = δ {\displaystyle a=\delta } . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function). == Pseudo-Huber loss function == The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function. It combines the best properties of L2 squared loss and L1 absolute loss by being strongly convex when close to the target/minimum and less steep for extreme values. The scale at which the Pseudo-Huber loss function transitions from L2 loss for values close to the minimum to L1 loss for extreme values and the steepness at extreme values can be controlled by the δ {\displaystyle \delta } value. The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees. It is defined as L δ ( a ) = δ 2 ( 1 + ( a / δ ) 2 − 1 ) . {\displaystyle L_{\delta }(a)=\delta ^{2}\left({\sqrt {1+(a/\delta )^{2}}}-1\right).} As such, this function approximates a 2 / 2 {\displaystyle a^{2}/2} for small values of a {\displaystyle a} , and approximates a straight line with slope δ {\displaystyle \delta } for large values of a {\displaystyle a} . While the above is the most common form, other smooth approximations of the Huber loss function also exist. == Variant for classification == For classification purposes, a variant of the Huber loss called modified Huber is sometimes used. Given a prediction f ( x ) {\displaystyle f(x)} (a real-valued classifier score) and a true binary class label y ∈ { + 1 , − 1 } {\displaystyle y\in \{+1,-1\}} , the modified Huber loss is defined as L ( y , f ( x ) ) = { max ( 0 , 1 − y f ( x ) ) 2 for y f ( x ) > − 1 , − 4 y f ( x ) otherwise. {\displaystyle L(y,f(x))={\begin{cases}\max(0,1-y\,f(x))^{2}&{\text{for }}\,\,y\,f(x)>-1,\\[4pt]-4y\,f(x)&{\text{otherwise.}}\end{cases}}} The term max ( 0 , 1 − y f ( x ) ) {\displaystyle \max(0,1-y\,f(x))} is the hinge loss used by support vector machines; the quadratically smoothed hinge loss is a generalization of L {\displaystyle L} . == Applications == The Huber loss function is used in robust statistics, M-estimation and additive modelling.

    Read more →
  • Pydio

    Pydio

    Pydio Cells, previously known as just Pydio and formerly known as AjaXplorer, is an open-source file-sharing and synchronisation software that runs on the user's own server or in the cloud. == Presentation == The project was created by musician Charles Du Jeu (current CEO and CTO) in 2007 under the name AjaXplorer. The name was changed in 2013 and became Pydio (an acronym for Put Your Data in Orbit). In May 2018, Pydio switched from PHP to Go with the release of Pydio Cells. The PHP version reached end-of-life state on 31 December 2019. Pydio Cells runs on any server supporting a recent Go version. Windows/Linux/macOS on the Intel architecture are directly supported; a fully functional working ARM implementation is under active development. Pydio Cells has been developed from scratch using the Go programming language; release 4.0.0 introduced code refactoring to fully support the Go modular structure as well as grid computing. Nevertheless, the web-based interface of Cells is very similar to the one from Pydio 8 (in PHP), and it successfully replicates most of its features, while adding a few more. There is also a new synchronisation client (also written in Go). The PHP version has been phased out as the company's focus is moving to Pydio Cells, with community feedback on the new features. According to the company, the switch to the new environment was made "to overcome inherent PHP limitations and provide you with a future-proof and modern solution for collaborating on documents". From a technical point of view, Pydio differs from solutions such as Google Drive or Dropbox. Pydio is not based on a public cloud; instead, the software connects to the user's existing storage (such as SAN / Local FS, SAMBA / CIFS, (s)FTP, NFS, S3-compatible cloud storage, Azure Blob Storage, Google Cloud Storage) as well as to the existing user directories (LDAP / AD, OAuth2 / OIDC SSO, SAML / Azure ADFS SSO, RADIUS, Shibboleth...), which allows companies to keep their data inside their infrastructure, according to their data security policy and user rights management. The software is built in a modular perspective; up to Pydio 8, various plugins allowed administrators to implement extra features. On the server side, Pydio Cells is deployed as a collection of independent microservices communicating among themselves using gRPC and logging user actions via Activity Streams 2.0 (AS2). Pydio Cells microservices are built with the Go Micro framework (using an embedded NATS server). A standard installation will deploy all required services on the same physical server, but for the purposes of performance, reliability and high availability, these can now be spread across several different servers (even in geographically separate locations) according to the 12-factors architecture pattern. Pydio Cells is available either through a free and open-source community distribution (Pydio Cells Home), or a commercially-licensed enterprise distribution (in two variants, Pydio Cells Connect and Pydio Cells Enterprise), which add features not available in the community distribution as well as additional levels of support beyond the community forums. == Features == File sharing between different internal users and across other Pydio instances SSL/TLS Encryption WebDAV file server Creation of dedicated workspaces, for each line of business / project / client, with a dedicated user rights management for each workspace. File-sharing with external users (private links, public links, password protection, download limitation, etc.) Online viewing and editing of documents with Collabora Office (Pydio Cells Enterprise also offers OnlyOffice integration) Preview and editing of image files Integrated audio and video reader Activity stream ('timeline') for all actions taken by users Integrated chat platform Client applications are available for all major desktop and mobile platforms.

    Read more →
  • Correspondence analysis

    Correspondence analysis

    Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a manner similar to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis. It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead. CA may also be applied to binary data given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test. Although the χ 2 {\displaystyle \chi ^{2}} statistic used in inferential statistics and the chi-square distance are computationally related they should not be confused since the latter works as a multivariate statistical distance measure in CA while the χ 2 {\displaystyle \chi ^{2}} statistic is in fact a scalar not a metric. == Details == Like principal components analysis, correspondence analysis creates orthogonal components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see Factor analysis). Correspondence analysis is performed on the data table, conceived as matrix C of size m × n where m is the number of rows and n is the number of columns. In the following mathematical description of the method capital letters in italics refer to a matrix while letters in italics refer to vectors. Understanding the following computations requires knowledge of matrix algebra. === Preprocessing === Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed. First compute a set of weights for the columns and the rows (sometimes called masses), where row and column weights are given by the row and column vectors, respectively: w m = 1 n C C 1 , w n = 1 n C 1 T C . {\displaystyle w_{m}={\frac {1}{n_{C}}}C\mathbf {1} ,\quad w_{n}={\frac {1}{n_{C}}}\mathbf {1} ^{T}C.} Here n C = ∑ i = 1 n ∑ j = 1 m C i j {\displaystyle n_{C}=\sum _{i=1}^{n}\sum _{j=1}^{m}C_{ij}} is the sum of all cell values in matrix C, or short the sum of C, and 1 {\displaystyle \mathbf {1} } is a column vector of ones with the appropriate dimension. Put in simple words, w m {\displaystyle w_{m}} is just a vector whose elements are the row sums of C divided by the sum of C, and w n {\displaystyle w_{n}} is a vector whose elements are the column sums of C divided by the sum of C. The weights are transformed into diagonal matrices W m = diag ⁡ ( 1 / w m ) {\displaystyle W_{m}=\operatorname {diag} (1/{\sqrt {w_{m}}})} and W n = diag ⁡ ( 1 / w n ) {\displaystyle W_{n}=\operatorname {diag} (1/{\sqrt {w_{n}}})} where the diagonal elements of W n {\displaystyle W_{n}} are 1 / w n {\displaystyle 1/{\sqrt {w_{n}}}} and those of W m {\displaystyle W_{m}} are 1 / w m {\displaystyle 1/{\sqrt {w_{m}}}} respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0. Next, compute matrix P {\displaystyle P} by dividing C {\displaystyle C} by its sum P = 1 n C C . {\displaystyle P={\frac {1}{n_{C}}}C.} In simple words, Matrix P {\displaystyle P} is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table. Finally, compute matrix S {\displaystyle S} , sometimes called the matrix of standardized residuals, by matrix multiplication as S = W m ( P − w m w n ) W n {\displaystyle S=W_{m}(P-w_{m}w_{n})W_{n}} Note, the vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are combined in an outer product resulting in a matrix of the same dimensions as P {\displaystyle P} . In words the formula reads: matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is subtracted from matrix P {\displaystyle P} and the resulting matrix is scaled (weighted) by the diagonal matrices W m {\displaystyle W_{m}} and W n {\displaystyle W_{n}} . Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of W m {\displaystyle W_{m}} or W n {\displaystyle W_{n}} , respectively. === Interpretation of preprocessing === The vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} from matrix P {\displaystyle P} is the matrix algebra version of double centering the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the origin of a vector space. This origin is defined by matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} . In fact matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is identical with the matrix of expected frequencies in the chi-squared test. Therefore S {\displaystyle S} is computationally related to the independence model used in that test. But since CA is not an inferential method the term independence model is inappropriate here. === Orthogonal components === The table S {\displaystyle S} is then decomposed by a singular value decomposition as S = U Σ V ∗ {\displaystyle S=U\Sigma V^{}\,} where U {\displaystyle U} and V {\displaystyle V} are the left and right singular vectors of S {\displaystyle S} and Σ {\displaystyle \Sigma } is a square diagonal matrix with the singular values σ i {\displaystyle \sigma _{i}} of S {\displaystyle S} on the diagonal. Σ {\displaystyle \Sigma } is of dimension p ≤ ( min ( m , n ) − 1 ) {\displaystyle p\leq (\min(m,n)-1)} hence U {\displaystyle U} is of dimension m×p and V {\displaystyle V} is of n×p. As orthonormal vectors U {\displaystyle U} and V {\displaystyle V} fulfill U ∗ U = V ∗ V = I {\displaystyle U^{}U=V^{}V=I} . In other words, the multivariate information that is contained in C {\displaystyle C} as well as in S {\displaystyle S} is now distributed across two (coordinate) matrices U {\displaystyle U} and V {\displaystyle V} and a diagonal (scaling) matrix Σ {\displaystyle \Sigma } . The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1. === Inertia === While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called inertia. The sum of the squared singular values is the total inertia I {\displaystyle \mathrm {I} } of the data table, computed as I = ∑ i = 1 p σ i 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{p}\sigma _{i}^{2}.} The total inertia I {\displaystyle \mathrm {I} } of the data table can also computed directly from S {\displaystyle S} as I = ∑ i = 1 n ∑ j = 1 m s i j 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{n}\sum _{j=1}^{m}s_{ij}^{2}.} The amount of inertia covered by the i-th set of singular vectors is ι i {\displaystyle \iota _{i}} , the principal inertia. The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore, all principal inertia values are expressed as portion ϵ i {\displaystyle \epsilon _{i}} of the total inertia ϵ i = σ i 2 / ∑ i = 1 p σ i 2 {\displaystyle \epsilon _{i}=\sigma _{i}^{2}/\sum _{i=1}^{p}\sigma _{i}^{2}} and are presented in the form of a scree plot. In fact a scree plot is just a bar plot of all principal inertia portions ϵ i {\displaystyle \epsilon _{i}} . === Coordinates === To transform the singular vectors to coordinates which preserve the chi-square distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called principal coordinates in CA text books. If principal coordinates are used for

    Read more →
  • ID3 algorithm

    ID3 algorithm

    In decision tree learning, ID3 (Iterative Dichotomiser 3) is a greedy algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm. The 3 in the name is meant to signify that this was Quinlan's third attempt at a model based on entropy-based splitting, and the term dichotimser is a misnomer as it implies a binary split, but the ID3 algorithm can split on multi-valued attributes. == Algorithm == The ID3 algorithm begins with the original set S {\displaystyle S} as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S {\displaystyle S} and calculates the entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} or the information gain I G ( S ) {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S {\displaystyle S} is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the population with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch. === Summary === Calculate the entropy of every attribute a {\displaystyle a} of the data set S {\displaystyle S} . Partition ("split") the set S {\displaystyle S} into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum. Make a decision tree node containing that attribute. Recurse on subsets using the remaining attributes. === Properties === ID3 does not guarantee an optimal solution. It can converge upon local optima. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming. === Usage === The ID3 algorithm is used by training on a data set S {\displaystyle S} to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node. == The ID3 metrics == === Entropy === Entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} is a measure of the amount of uncertainty in the (data) set S {\displaystyle S} (i.e. entropy characterizes the (data) set S {\displaystyle S} ). H ( S ) = ∑ x ∈ X − p ( x ) log 2 ⁡ p ( x ) {\displaystyle \mathrm {H} {(S)}=\sum _{x\in X}{-p(x)\log _{2}p(x)}} Where, S {\displaystyle S} – The current dataset for which entropy is being calculated This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. X {\displaystyle X} – The set of classes in S {\displaystyle S} p ( x ) {\displaystyle p(x)} – The proportion of the number of elements in class x {\displaystyle x} to the number of elements in set S {\displaystyle S} When H ( S ) = 0 {\displaystyle \mathrm {H} {(S)}=0} , the set S {\displaystyle S} is perfectly classified (i.e. all elements in S {\displaystyle S} are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set S {\displaystyle S} on this iteration. Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable (discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values. Its accuracy can be improved by preprocessing the data. === Information gain === Information gain I G ( A ) {\displaystyle IG(A)} is the measure of the difference in entropy from before to after the set S {\displaystyle S} is split on an attribute A {\displaystyle A} . In other words, how much uncertainty in S {\displaystyle S} was reduced after splitting set S {\displaystyle S} on attribute A {\displaystyle A} . I G ( S , A ) = H ( S ) − ∑ t ∈ T p ( t ) H ( t ) = H ( S ) − H ( S | A ) . {\displaystyle IG(S,A)=\mathrm {H} {(S)}-\sum _{t\in T}p(t)\mathrm {H} {(t)}=\mathrm {H} {(S)}-\mathrm {H} {(S|A)}.} Where, H ( S ) {\displaystyle \mathrm {H} (S)} – Entropy of set S {\displaystyle S} T {\displaystyle T} – The subsets created from splitting set S {\displaystyle S} by attribute A {\displaystyle A} such that S = ⋃ t ∈ T t {\displaystyle S=\bigcup _{t\in T}t} p ( t ) {\displaystyle p(t)} – The proportion of the number of elements in t {\displaystyle t} to the number of elements in set S {\displaystyle S} H ( t ) {\displaystyle \mathrm {H} (t)} – Entropy of subset t {\displaystyle t} In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set S {\displaystyle S} on this iteration.

    Read more →
  • Algorithmic learning theory

    Algorithmic learning theory

    Algorithmic learning theory is a mathematical framework for analyzing machine learning problems and algorithms. Synonyms include formal learning theory and algorithmic inductive inference. Algorithmic learning theory is different from statistical learning theory in that it does not make use of statistical assumptions and analysis. Both algorithmic and statistical learning theory are concerned with machine learning and can thus be viewed as branches of computational learning theory. == Distinguishing characteristics == Unlike statistical learning theory and most statistical theory in general, algorithmic learning theory does not assume that data are random samples, that is, that data points are independent of each other. This makes the theory suitable for domains where observations are (relatively) noise-free but not random, such as language learning and automated scientific discovery. The fundamental concept of algorithmic learning theory is learning in the limit: as the number of data points increases, a learning algorithm should converge to a correct hypothesis on every possible data sequence consistent with the problem space. This is a non-probabilistic version of statistical consistency, which also requires convergence to a correct model in the limit, but allows a learner to fail on data sequences with probability measure 0 . Algorithmic learning theory investigates the learning power of Turing machines. Other frameworks consider a much more restricted class of learning algorithms than Turing machines, for example, learners that compute hypotheses more quickly, for instance in polynomial time. An example of such a framework is probably approximately correct learning . == Learning in the limit == The concept was introduced in E. Mark Gold's seminal paper "Language identification in the limit". The objective of language identification is for a machine running one program to be capable of developing another program by which any given sentence can be tested to determine whether it is "grammatical" or "ungrammatical". The language being learned need not be English or any other natural language - in fact the definition of "grammatical" can be absolutely anything known to the tester. In Gold's learning model, the tester gives the learner an example sentence at each step, and the learner responds with a hypothesis, which is a suggested program to determine grammatical correctness. It is required of the tester that every possible sentence (grammatical or not) appears in the list eventually, but no particular order is required. It is required of the learner that at each step the hypothesis must be correct for all the sentences so far. A particular learner is said to be able to "learn a language in the limit" if there is a certain number of steps beyond which its hypothesis no longer changes. At this point it has indeed learned the language, because every possible sentence appears somewhere in the sequence of inputs (past or future), and the hypothesis is correct for all inputs (past or future), so the hypothesis is correct for every sentence. The learner is not required to be able to tell when it has reached a correct hypothesis, all that is required is that it be true. Gold showed that any language which is defined by a Turing machine program can be learned in the limit by another Turing-complete machine using enumeration. This is done by the learner testing all possible Turing machine programs in turn until one is found which is correct so far - this forms the hypothesis for the current step. Eventually, the correct program will be reached, after which the hypothesis will never change again (but note that the learner does not know that it won't need to change). Gold also showed that if the learner is given only positive examples (that is, only grammatical sentences appear in the input, not ungrammatical sentences), then the language can only be guaranteed to be learned in the limit if there are only a finite number of possible sentences in the language (this is possible if, for example, sentences are known to be of limited length). Language identification in the limit is a highly abstract model. It does not allow for limits of runtime or computer memory which can occur in practice, and the enumeration method may fail if there are errors in the input. However the framework is very powerful, because if these strict conditions are maintained, it allows the learning of any program known to be computable. This is because a Turing machine program can be written to mimic any program in any conventional programming language. See Church-Turing thesis. == Other identification criteria == Learning theorists have investigated other learning criteria, such as the following. Efficiency: minimizing the number of data points required before convergence to a correct hypothesis. Mind Changes: minimizing the number of hypothesis changes that occur before convergence. Mind change bounds are closely related to mistake bounds that are studied in statistical learning theory. Kevin Kelly has suggested that minimizing mind changes is closely related to choosing maximally simple hypotheses in the sense of Occam’s Razor. == Annual conference == Since 1990, there is an International Conference on Algorithmic Learning Theory (ALT), called Workshop in its first years (1990–1997). Between 1992 and 2016, proceedings were published in the LNCS series. Starting from 2017, they are published by the Proceedings of Machine Learning Research. The 34th conference will be held in Singapore in Feb 2023. The topics of the conference cover all of theoretical machine learning, including statistical and computational learning theory, online learning, active learning, reinforcement learning, and deep learning.

    Read more →
  • Multiple buffering

    Multiple buffering

    In computer science, multiple buffering is the use of more than one buffer to hold a block of data, so that a "reader" will see a complete (though perhaps old) version of the data instead of a partially updated version of the data being created by a "writer". It is very commonly used for computer display images. It is also used to avoid the need to use dual-ported RAM (DPRAM) when the readers and writers are different devices. == Description == === Double buffering Petri net === The Petri net in the illustration shows double buffering. Transitions W1 and W2 represent writing to buffer 1 and 2 respectively while R1 and R2 represent reading from buffer 1 and 2 respectively. At the beginning, only the transition W1 is enabled. After W1 fires, R1 and W2 are both enabled and can proceed in parallel. When they finish, R2 and W1 proceed in parallel and so on. After the initial transient where W1 fires alone, this system is periodic and the transitions are enabled – always in pairs (R1 with W2 and R2 with W1 respectively). == Double buffering in computer graphics == In computer graphics, double buffering is a technique for drawing graphics that shows less stutter, tearing, and other artifacts. It is difficult for a program to draw a display so that pixels do not change more than once. For instance, when updating a page of text, it is much easier to clear the entire page and then draw the letters than to somehow erase only the pixels that are used in old letters but not in new ones. However, this intermediate image is seen by the user as flickering. In addition, computer monitors constantly redraw the visible video page (traditionally at around 60 times a second), so even a perfect update may be visible momentarily as a horizontal divider between the "new" image and the un-redrawn "old" image, known as tearing. === Software double buffering === A software implementation of double buffering has all drawing operations store their results in some region of system RAM; any such region is often called a "back buffer". When all drawing operations are considered complete, the whole region (or only the changed portion) is copied into the video RAM (the "front buffer"); this copying is usually synchronized with the monitor's raster beam in order to avoid tearing. Software implementations of double buffering necessarily require more memory and CPU time than single buffering because of the system memory allocated for the back buffer, the time for the copy operation, and the time waiting for synchronization. Compositing window managers often combine the "copying" operation with "compositing" used to position windows, transform them with scale or warping effects, and make portions transparent. Thus, the "front buffer" may contain only the composite image seen on the screen, while there is a different "back buffer" for every window containing the non-composited image of the entire window contents. === Page flipping === In the page-flip method, instead of copying the data, both buffers are capable of being displayed. At any one time, one buffer is actively being displayed by the monitor, while the other, background buffer is being drawn. When the background buffer is complete, the roles of the two are switched. The page-flip is typically accomplished by modifying a hardware register in the video display controller—the value of a pointer to the beginning of the display data in the video memory. The page-flip is much faster than copying the data and can guarantee that tearing will not be seen as long as the pages are switched over during the monitor's vertical blanking interval—the blank period when no video data is being drawn. The currently active and visible buffer is called the front buffer, while the background page is called the back buffer. == Triple buffering == In computer graphics, triple buffering is similar to double buffering but can provide improved performance. In double buffering, the program must wait until the finished drawing is copied or swapped before starting the next drawing. This waiting period could be several milliseconds during which neither buffer can be touched. In triple buffering, the program has two back buffers and can immediately start drawing in the one that is not involved in such copying. The third buffer, the front buffer, is read by the graphics card to display the image on the monitor. Once the image has been sent to the monitor, the front buffer is flipped with (or copied from) the back buffer holding the most recent complete image. Since one of the back buffers is always complete, the graphics card never has to wait for the software to complete. Consequently, the software and the graphics card are completely independent and can run at their own pace. Finally, the displayed image was started without waiting for synchronization and thus with minimum lag. Due to the software algorithm not polling the graphics hardware for monitor refresh events, the algorithm may continuously draw additional frames as fast as the hardware can render them. For frames that are completed much faster than interval between refreshes, it is possible to replace a back buffers' frames with newer iterations multiple times before copying. This means frames may be written to the back buffer that are never used at all before being overwritten by successive frames. Nvidia has implemented this method under the name "Fast Sync". An alternative method sometimes referred to as triple buffering is a swap chain three buffers long. After the program has drawn both back buffers, it waits until the first one is placed on the screen, before drawing another back buffer (i.e. it is a 3-long first in, first out queue). Most Windows games seem to refer to this method when enabling triple buffering. == Quad buffering == The term quad buffering is the use of double buffering for each of the left and right eye images in stereoscopic implementations, thus four buffers total (if triple buffering was used then there would be six buffers). The command to swap or copy the buffer typically applies to both pairs at once, so at no time does one eye see an older image than the other eye. Quad buffering requires special support in the graphics card drivers which is disabled for most consumer cards. AMD's Radeon HD 6000 Series and newer support it. 3D standards like OpenGL and Direct3D support quad buffering. == Double buffering for DMA == The term double buffering is used for copying data between two buffers for direct memory access (DMA) transfers, not for enhancing performance, but to meet specific addressing requirements of a device (particularly 32-bit devices on systems with wider addressing provided via Physical Address Extension). Windows device drivers are a place where the term "double buffering" is likely to be used. Linux and BSD source code calls these "bounce buffers". Some programmers try to avoid this kind of double buffering with zero-copy techniques. == Other uses == Double buffering is also used as a technique to facilitate interlacing or deinterlacing of video signals.

    Read more →
  • Neural cryptography

    Neural cryptography

    Neural cryptography is a branch of cryptography dedicated to analyzing the application of stochastic algorithms, especially artificial neural network algorithms, for use in encryption and cryptanalysis. == Definition == Artificial neural networks are well known for their ability to selectively explore the solution space of a given problem. This feature finds a natural niche of application in the field of cryptanalysis. At the same time, neural networks offer a new approach to attack ciphering algorithms based on the principle that any function could be reproduced by a neural network, which is a powerful proven computational tool that can be used to find the inverse-function of any cryptographic algorithm. The ideas of mutual learning, self learning, and stochastic behavior of neural networks and similar algorithms can be used for different aspects of cryptography, like public-key cryptography, solving the key distribution problem using neural network mutual synchronization, hashing or generation of pseudo-random numbers. Another idea is the ability of a neural network to separate space in non-linear pieces using "bias". It gives different probabilities of activating the neural network or not. This is very useful in the case of Cryptanalysis. Two names are used to design the same domain of research: Neuro-Cryptography and Neural Cryptography. The first work that it is known on this topic can be traced back to 1995 in an IT Master Thesis. == Applications == In 1995, Sebastien Dourlens applied neural networks to cryptanalyze DES by allowing the networks to learn how to invert the S-tables of the DES. The bias in DES studied through Differential Cryptanalysis by Adi Shamir is highlighted. The experiment shows about 50% of the key bits can be found, allowing the complete key to be found in a short time. Hardware application with multi micro-controllers have been proposed due to the easy implementation of multilayer neural networks in hardware. One example of a public-key protocol is given by Khalil Shihab . He describes the decryption scheme and the public key creation that are based on a backpropagation neural network. The encryption scheme and the private key creation process are based on Boolean algebra. This technique has the advantage of small time and memory complexities. A disadvantage is the property of backpropagation algorithms: because of huge training sets, the learning phase of a neural network is very long. Therefore, the use of this protocol is only theoretical so far. == Neural key exchange protocol == The most used protocol for key exchange between two parties A and B in the practice is Diffie–Hellman key exchange protocol. Neural key exchange, which is based on the synchronization of two tree parity machines, should be a secure replacement for this method. Synchronizing these two machines is similar to synchronizing two chaotic oscillators in chaos communications. === Tree parity machine === The tree parity machine is a special type of multi-layer feedforward neural network. It consists of one output neuron, K hidden neurons and K×N input neurons. Inputs to the network take three values: x i j ∈ { − 1 , 0 , + 1 } {\displaystyle x_{ij}\in \left\{-1,0,+1\right\}} The weights between input and hidden neurons take the values: w i j ∈ { − L , . . . , 0 , . . . , + L } {\displaystyle w_{ij}\in \left\{-L,...,0,...,+L\right\}} Output value of each hidden neuron is calculated as a sum of all multiplications of input neurons and these weights: σ i = sgn ⁡ ( ∑ j = 1 N w i j x i j ) {\displaystyle \sigma _{i}=\operatorname {sgn}(\sum _{j=1}^{N}w_{ij}x_{ij})} Signum is a simple function, which returns −1,0 or 1: sgn ⁡ ( x ) = { − 1 if x < 0 , 0 if x = 0 , 1 if x > 0. {\displaystyle \operatorname {sgn}(x)={\begin{cases}-1&{\text{if }}x<0,\\0&{\text{if }}x=0,\\1&{\text{if }}x>0.\end{cases}}} If the scalar product is 0, the output of the hidden neuron is mapped to −1 in order to ensure a binary output value. The output of neural network is then computed as the multiplication of all values produced by hidden elements: τ = ∏ i = 1 K σ i {\displaystyle \tau =\prod _{i=1}^{K}\sigma _{i}} Output of the tree parity machine is binary. === Protocol === Each party (A and B) uses its own tree parity machine. Synchronization of the tree parity machines is achieved in these steps Initialize random weight values Execute these steps until the full synchronization is achieved Generate random input vector X Compute the values of the hidden neurons Compute the value of the output neuron Compare the values of both tree parity machines Outputs are the same: one of the suitable learning rules is applied to the weights Outputs are different: go to 2.1 After the full synchronization is achieved (the weights wij of both tree parity machines are same), A and B can use their weights as keys. This method is known as a bidirectional learning. One of the following learning rules can be used for the synchronization: Hebbian learning rule: w i + = g ( w i + σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Anti-Hebbian learning rule: w i + = g ( w i − σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}-\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Random walk: w i + = g ( w i + x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Where: Θ ( a , b ) = 0 {\displaystyle \Theta (a,b)=0} if a ≠ b {\displaystyle a\neq b} otherwise Θ ( a , b ) = 1 {\displaystyle \Theta (a,b)=1} And: g ( x ) {\displaystyle g(x)} is a function that keeps the w i {\displaystyle w_{i}} in the range { − L , − L + 1 , . . . , 0 , . . . , L − 1 , L } {\displaystyle \{-L,-L+1,...,0,...,L-1,L\}} === Attacks and security of this protocol === In every attack it is considered, that the attacker E can eavesdrop messages between the parties A and B, but does not have an opportunity to change them. ==== Brute force ==== To provide a brute force attack, an attacker has to test all possible keys (all possible values of weights wij). By K hidden neurons, K×N input neurons and boundary of weights L, this gives (2L+1)KN possibilities. For example, the configuration K = 3, L = 3 and N = 100 gives us 310253 key possibilities, making the attack impossible with today's computer power. ==== Learning with own tree parity machine ==== One of the basic attacks can be provided by an attacker, who owns the same tree parity machine as the parties A and B. He wants to synchronize his tree parity machine with these two parties. In each step there are three situations possible: Output(A) ≠ Output(B): None of the parties updates its weights. Output(A) = Output(B) = Output(E): All the three parties update weights in their tree parity machines. Output(A) = Output(B) ≠ Output(E): Parties A and B update their tree parity machines, but the attacker can not do that. Because of this situation his learning is slower than the synchronization of parties A and B. It has been proven, that the synchronization of two parties is faster than learning of an attacker. It can be improved by increasing of the synaptic depth L of the neural network. That gives this protocol enough security and an attacker can find out the key only with small probability. ==== Other attacks ==== For conventional cryptographic systems, we can improve the security of the protocol by increasing of the key length. In the case of neural cryptography, we improve it by increasing of the synaptic depth L of the neural networks. Changing this parameter increases the cost of a successful attack exponentially, while the effort for the users grows polynomially. Therefore, breaking the security of neural key exchange belongs to the complexity class NP. Alexander Klimov, Anton Mityaguine, and Adi Shamir say that the original neural synchronization scheme can be broken by at least three different attacks—geometric, probabilistic analysis, and using genetic algorithms. Even though this particular implementation is insecure, the ideas behind chaotic synchronization could potentially lead to a secure implementation. === Permutation parity machine === The permutation parity machine is a binary variant of the tree parity machine. It consists of one input layer, one hidden layer and one output layer. The number of neurons in the output layer depends on the number of hidden units K. Each hidden neuron has N binary input neurons: x i j ∈ { 0 , 1 } {\displaystyle x_{ij}\in \left\{0,1\right\}} The weights between input and hidden neurons are also binary: w i j ∈ { 0 , 1 } {\displaystyle w_{ij}\in \left\{0,1\right\}} Output value of each hidden neuron is calculated as a sum of all exclusive disjunctions (exclusive or) of input neurons and these weights: σ i = θ N ( ∑ j = 1 N w i j ⊕ x i j ) {\displaystyle \sigma _{i}=\theta _{N}(\sum _{j=1}^{N}w_{ij}\oplus x_{ij})} (⊕ means XOR). Th

    Read more →
  • Mathematics of neural networks in machine learning

    Mathematics of neural networks in machine learning

    An artificial neural network (ANN) or neural network combines biological principles with advanced statistics to solve problems in domains such as pattern recognition and game-play. ANNs adopt the basic model of neuron analogues connected to each other in a variety of ways. == Structure == === Neuron === A neuron with label j {\displaystyle j} receiving an input p j ( t ) {\displaystyle p_{j}(t)} from predecessor neurons consists of the following components: an activation a j ( t ) {\displaystyle a_{j}(t)} , the neuron's state, depending on a discrete time parameter, an optional threshold θ j {\displaystyle \theta _{j}} , which stays fixed unless changed by learning, an activation function f {\displaystyle f} that computes the new activation at a given time t + 1 {\displaystyle t+1} from a j ( t ) {\displaystyle a_{j}(t)} , θ j {\displaystyle \theta _{j}} and the net input p j ( t ) {\displaystyle p_{j}(t)} giving rise to the relation a j ( t + 1 ) = f ( a j ( t ) , p j ( t ) , θ j ) , {\displaystyle a_{j}(t+1)=f(a_{j}(t),p_{j}(t),\theta _{j}),} and an output function f out {\displaystyle f_{\text{out}}} computing the output from the activation o j ( t ) = f out ( a j ( t ) ) . {\displaystyle o_{j}(t)=f_{\text{out}}(a_{j}(t)).} Often the output function is simply the identity function. An input neuron has no predecessor but serves as input interface for the whole network. Similarly an output neuron has no successor and thus serves as output interface of the whole network. === Propagation function === The propagation function computes the input p j ( t ) {\displaystyle p_{j}(t)} to the neuron j {\displaystyle j} from the outputs o i ( t ) {\displaystyle o_{i}(t)} and typically has the form p j ( t ) = ∑ i o i ( t ) w i j . {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}.} === Bias === A bias term can be added, changing the form to the following: p j ( t ) = ∑ i o i ( t ) w i j + w 0 j , {\displaystyle p_{j}(t)=\sum _{i}o_{i}(t)w_{ij}+w_{0j},} where w 0 j {\displaystyle w_{0j}} is a bias. == Neural networks as functions == Neural network models can be viewed as defining a function that takes an input (observation) and produces an output (decision) f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} or a distribution over X {\displaystyle \textstyle X} or both X {\displaystyle \textstyle X} and Y {\displaystyle \textstyle Y} . Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a class of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons, number of layers or their connectivity). Mathematically, a neuron's network function f ( x ) {\displaystyle \textstyle f(x)} is defined as a composition of other functions g i ( x ) {\displaystyle \textstyle g_{i}(x)} , that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the nonlinear weighted sum, where f ( x ) = K ( ∑ i w i g i ( x ) ) {\displaystyle \textstyle f(x)=K\left(\sum _{i}w_{i}g_{i}(x)\right)} , where K {\displaystyle \textstyle K} (commonly referred to as the activation function) is some predefined function, such as the hyperbolic tangent, sigmoid function, softmax function, or rectifier function. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions g i {\displaystyle \textstyle g_{i}} as a vector g = ( g 1 , g 2 , … , g n ) {\displaystyle \textstyle g=(g_{1},g_{2},\ldots ,g_{n})} . This figure depicts such a decomposition of f {\displaystyle \textstyle f} , with dependencies between variables indicated by arrows. These can be interpreted in two ways. The first view is the functional view: the input x {\displaystyle \textstyle x} is transformed into a 3-dimensional vector h {\displaystyle \textstyle h} , which is then transformed into a 2-dimensional vector g {\displaystyle \textstyle g} , which is finally transformed into f {\displaystyle \textstyle f} . This view is most commonly encountered in the context of optimization. The second view is the probabilistic view: the random variable F = f ( G ) {\displaystyle \textstyle F=f(G)} depends upon the random variable G = g ( H ) {\displaystyle \textstyle G=g(H)} , which depends upon H = h ( X ) {\displaystyle \textstyle H=h(X)} , which depends upon the random variable X {\displaystyle \textstyle X} . This view is most commonly encountered in the context of graphical models. The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of g {\displaystyle \textstyle g} are independent of each other given their input h {\displaystyle \textstyle h} ). This naturally enables a degree of parallelism in the implementation. Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph. Networks with cycles are commonly called recurrent. Such networks are commonly depicted in the manner shown at the top of the figure, where f {\displaystyle \textstyle f} is shown as dependent upon itself. However, an implied temporal dependence is not shown. == Backpropagation == Backpropagation training algorithms fall into three categories: steepest descent (with variable learning rate and momentum, resilient backpropagation); quasi-Newton (Broyden–Fletcher–Goldfarb–Shanno, one step secant); Levenberg–Marquardt and conjugate gradient (Fletcher–Reeves update, Polak–Ribiére update, Powell–Beale restart, scaled conjugate gradient). === Algorithm === Let N {\displaystyle N} be a network with e {\displaystyle e} connections, m {\displaystyle m} inputs and n {\displaystyle n} outputs. Below, x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } denote vectors in R m {\displaystyle \mathbb {R} ^{m}} , y 1 , y 2 , … {\displaystyle y_{1},y_{2},\dots } vectors in R n {\displaystyle \mathbb {R} ^{n}} , and w 0 , w 1 , w 2 , … {\displaystyle w_{0},w_{1},w_{2},\ldots } vectors in R e {\displaystyle \mathbb {R} ^{e}} . These are called inputs, outputs and weights, respectively. The network corresponds to a function y = f N ( w , x ) {\displaystyle y=f_{N}(w,x)} which, given a weight w {\displaystyle w} , maps an input x {\displaystyle x} to an output y {\displaystyle y} . In supervised learning, a sequence of training examples ( x 1 , y 1 ) , … , ( x p , y p ) {\displaystyle (x_{1},y_{1}),\dots ,(x_{p},y_{p})} produces a sequence of weights w 0 , w 1 , … , w p {\displaystyle w_{0},w_{1},\dots ,w_{p}} starting from some initial weight w 0 {\displaystyle w_{0}} , usually chosen at random. These weights are computed in turn: first compute w i {\displaystyle w_{i}} using only ( x i , y i , w i − 1 ) {\displaystyle (x_{i},y_{i},w_{i-1})} for i = 1 , … , p {\displaystyle i=1,\dots ,p} . The output of the algorithm is then w p {\displaystyle w_{p}} , giving a new function x ↦ f N ( w p , x ) {\displaystyle x\mapsto f_{N}(w_{p},x)} . The computation is the same in each step, hence only the case i = 1 {\displaystyle i=1} is described. w 1 {\displaystyle w_{1}} is calculated from ( x 1 , y 1 , w 0 ) {\displaystyle (x_{1},y_{1},w_{0})} by considering a variable weight w {\displaystyle w} and applying gradient descent to the function w ↦ E ( f N ( w , x 1 ) , y 1 ) {\displaystyle w\mapsto E(f_{N}(w,x_{1}),y_{1})} to find a local minimum, starting at w = w 0 {\displaystyle w=w_{0}} . This makes w 1 {\displaystyle w_{1}} the minimizing weight found by gradient descent. == Learning pseudocode == To implement the algorithm above, explicit formulas are required for the gradient of the function w ↦ E ( f N ( w , x ) , y ) {\displaystyle w\mapsto E(f_{N}(w,x),y)} where the function is E ( y , y ′ ) = | y − y ′ | 2 {\displaystyle E(y,y')=|y-y'|^{2}} . The learning algorithm can be divided into two phases: propagation and weight update. === Propagation === Propagation involves the following steps: Propagation forward through the network to generate the output value(s) Calculation of the cost (error term) Propagation of the output activations back through the network using the training pattern target to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons. === Weight update === For each weight: Multiply the weight's output delta and input activation to find the gradient of the weight. Subtract the ratio (percentage) of the weight's gradient from the weight. The learning rate is the ratio (percentage) that influences the speed and quality of learning. The greater the ratio, the faster the neuron trains, but the lower the ratio, the more accurat

    Read more →