AI Content Update Google

AI Content Update Google — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • AI anthropomorphism

    AI anthropomorphism

    AI anthropomorphism is the attribution of human-like feelings, mental states, and behavioral characteristics to artificial intelligence systems. Factors related to the user of the AI – such as culture, age, education, gender, and personality traits – are also important determinants of the strength of anthropomorphic effects. Since the earliest days of AI development, humans have interpreted machine outputs through anthropomorphic frameworks, but the recent emergence of generative AI has amplified these tendencies. In research and engineering, there is a distinction between anthropomorphism and anthropomorphic design. The former is an innate human tendency toward non-human entities. The latter is the scientific community effort to “design anthropomorphism”. Such a design can involve the manipulation of cues, including AI appearance, behaviour and language. Contemporary AI systems today can generate extremely human-like outputs and are often designed specifically to do so, meaning that their anthropomorphic effects can be especially powerful. In some cases, anthropomorphism is accompanied with explicit beliefs that AI systems are capable of empathy, goodwill, understanding, or consciousness. == Background == === In early AIs === Views of artificial agents possessing a human-like intelligence have existed since the early development of computers in the mid-1900s. The use of the human mind as a metaphor for understanding the workings of machine systems was prevalent among researchers in the early days of computer science, with multiple influential works widely distributing the idea of intelligent machines. Among the most widely cited papers of this period was Alan Turing's "Computing Machinery and Intelligence" in which he introduced the Turing Test, stating that a machine was intelligent if it could produce conversation that was indistinguishable from that of a human. These academic works in the 1940s and 1950s gave early credibility to the idea that machine workings could be thought of similarly to human minds. The public quickly came to view artificial systems similarly, with often exaggerated conceptions of the capabilities of early machines. Among the most well-known demonstrations of this was through the chatbot ELIZA designed by Joseph Weizenbaum in 1966. ELIZA responded to user inputs with a rudimentary text-processing approach that could not be considered anything resembling true understanding of the inputs, yet users, even when operating with full conscious knowledge of ELIZA's limitations, often began to ascribe motivation and understanding to the program's output. Weizenbaum later wrote, "I had not realized ... that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people." Comparisons between the intellectual capabilities of artificial intelligence and human intelligence were continually intensified by the attempts of computer scientists to develop machines that could perform human tasks at a level equal to or better than humans. A symbolic turning point was achieved in 1997, when IBM's chess supercomputer Deep Blue defeated then-world champion Garry Kasparov in a highly publicized six-game match. The defeat of a human by a machine for the first time in chess – a game viewed as a canonical example of human intellect – and the media attention surrounding the match led to a significant shift, where views of parallels between human and artificial intelligence moved from abstract speculation to being concretely demonstrated. A similar achievement was reached in the board game Go in 2017, when the program AlphaGo defeated world top-ranked Ke Jie. === Large language models === The AI boom of the 2020s brought about the widespread emergence of generative AI; in particular, chatbots such as ChatGPT, Gemini, and Claude based on large language models (LLMs) have become increasingly pervasive in everyday society. These systems are notable for the fact that they are able to respond to a wide range of prompts across contexts while producing strikingly human-like outputs – research has shown that humans are often unable to distinguish human-generated text from AI-generated text, and modern AI chatbots have formally been shown to pass the Turing test. As such, the anthropomorphic effects of AI are more powerful than ever. Given that LLMs have brought AI into the technological mainstream, considerable scientific effort has been devoted in recent years to understand existing and potential ramifications of AI in the public sphere; the prevalence and effects of anthropomorphism is one of those domains where much of this effort has been directed. == Current anthropomorphic attributions == === In the general public === Surveys have shown that a substantial portion of the public attributes human-like qualities to AI. In one sample of U.S. adults from 2024, two-thirds of people believed that ChatGPT is possibly conscious on some level, though other research has shown that the public still views the likelihood itself of AI consciousness as comparatively low. Another study conducted in 2025 found that women, people of color, and older individuals were most likely to anthropomorphize AI, as well as that – in general – humans view AIs as warm and competent, and anthropomorphic attributions to AI had increased by 34% in the past year. A YouGov poll reported that 46% of Americans believe that people should display politeness to AI chatbots by saying "please" and "thank you", demonstrating the application of social norms to AI. These beliefs extend to behavior, where majorities of AI users claim to always be polite to chatbots; of those who behave politely, most say they do so simply because it is the "nice" thing to do. In many recent cases, humans have developed robust interpersonal bonds with AI systems. For example: users of social chatbots like Replika and Character.ai have been documented to fall in love with the AIs, or to otherwise treat the AIs as intimate companions, and it has become increasingly common for individuals to use LLMs like ChatGPT as therapists. Chatbots are able to produce responses deeply attuned to users, as they are often designed to maximize agreeableness and mirror users' emotions; this can create compelling illusions of intimacy. === In the research community === In many cases, even AI researchers anthropomorphize AI systems in some capacity. Among the most extreme and well-publicized of these instances occurred in 2022, when engineer Blake Lemoine publicly claimed that Google's LLM LaMDA was conscious. Lemoine published the transcript of a conversation he had had with LaMDA regarding self identity and morality which he claimed was evidence of its sentience; he asserted that LaMDA was "a person" as defined by the United States Constitution and compared its mental capability to that of a 7- or 8-year-old. Lemoine's claims were widely dismissed by the scientific community and by Google itself, which described Lemoine's conclusions as "wholly unfounded" and fired him on the grounds that he had violated policies "to safeguard product information". It is much more common that AI researchers unintentionally imply humanness of AI through the ordinary use of anthropomorphic language to describe nonhuman agents. This kind of language, which Daniel Dennett coined the "intentional stance", is very common in everyday life in a variety of different contexts (e.g., "My computer doesn't want to turn on today"). For AI agents that may actually appear to very closely replicate some human abilities, however, the casual use of such anthropomorphic language in research has been scrutinized for being potentially misleading to the public. As early as 1976, Drew McDermott criticized the research community for the use of "wishful mnemonics", where AIs were referred to with terms like "understand" and "learn". In the LLM era, these criticisms have further intensified, with the negative effects of AI anthropomorphism in the public posing an especially salient danger given the elevated accessibility of modern AI. In some cases, the use of anthropomorphic language for AI is not unintentional, but is willfully used by researchers in order to promote better understanding of the brain – the idea being that, as AI can be functionally similar in some ways to the human brain, we may gain new insights and ideas from treating AI as a kind of model of the brain's workings. In particular, deep neuronal networks (DNNs) are often explicitly compared to the human brain, and significant advances in DNN research have stirred considerable enthusiasm about the ability of AI to emulate the human abilities. Caution has been urged in this domain as well, however; the use of anthropomorphic language can mask important differences that fundamentally distinguish AI from human intelligence. When it comes to DNNs, for example, it has been pointed out that they are still structurally quite different

    Read more →
  • Computational intelligence

    Computational intelligence

    In computer science, computational intelligence (CI) refers to concepts, paradigms, algorithms and implementations of systems that are designed to show "intelligent" behavior in complex and changing environments. These systems are aimed at mastering complex tasks in a wide variety of technical or commercial areas and offer solutions that recognize and interpret patterns, control processes, support decision-making or autonomously manoeuvre vehicles or robots in unknown environments, among other things. These concepts and paradigms are characterized by the ability to learn or adapt to new situations, to generalize, to abstract, to discover and associate. Nature-analog or nature-inspired methods play a key role in this. CI approaches primarily address those complex real-world problems for which traditional or mathematical modeling is not appropriate for various reasons: the processes cannot be described exactly with complete knowledge, the processes are too complex for mathematical reasoning, they contain some uncertainties during the process, such as unforeseen changes in the environment or in the process itself, or the processes are simply stochastic in nature. Thus, CI techniques are properly aimed at processes that are ill-defined, complex, nonlinear, time-varying and/or stochastic. A recent definition of the IEEE Computational Intelligence Societey describes CI as the theory, design, application and development of biologically and linguistically motivated computational paradigms. Traditionally the three main pillars of CI have been Neural Networks, Fuzzy Systems and Evolutionary Computation. ... CI is an evolving field and at present in addition to the three main constituents, it encompasses computing paradigms like ambient intelligence, artificial life, cultural learning, artificial endocrine networks, social reasoning, and artificial hormone networks. ... Over the last few years there has been an explosion of research on Deep Learning, in particular deep convolutional neural networks. Nowadays, deep learning has become the core method for artificial intelligence. In fact, some of the most successful AI systems are based on CI. However, as CI is an emerging and developing field there is no final definition of CI, especially in terms of the list of concepts and paradigms that belong to it. The general requirements for the development of an “intelligent system” are ultimately always the same, namely the simulation of intelligent thinking and action in a specific area of application. To do this, the knowledge about this area must be represented in a model so that it can be processed. The quality of the resulting system depends largely on how well the model was chosen in the development process. Sometimes data-driven methods are suitable for finding a good model and sometimes logic-based knowledge representations deliver better results. Hybrid models are usually used in real applications. According to actual textbooks, the following methods and paradigms, which largely complement each other, can be regarded as parts of CI: Fuzzy systems Neural networks and, in particular, convolutional neural networks Evolutionary computation and, in particular, multi-objective evolutionary optimization Swarm intelligence Bayesian networks Artificial immune systems Learning theory Probabilistic methods == Relationship between hard and soft computing and artificial and computational intelligence == Artificial intelligence (AI) is used in the media, but also by some of the scientists involved, as a kind of umbrella term for the various techniques associated with it or with CI. Craenen and Eiben state that attempts to define or at least describe CI can usually be assigned to one or more of the following groups: "Relative definition” comparing CI to AI Conceptual treatment of key notions and their roles in CI Listing of the (established) areas that belong to it The relationship between CI and AI has been a frequently discussed topic during the development of CI. While the above list implies that they are synonyms, the vast majority of AI/CI researchers working on the subject consider them to be distinct fields, where either CI is an alternative to AI AI includes CI CI includes AI The view of the first of the above three points goes back to Zadeh, the founder of the fuzzy set theory, who differentiated machine intelligence into hard and soft computing techniques, which are used in artificial intelligence on the one hand and computational intelligence on the other. In hard computing (HC) and traditional AI (e.g. expert systems), inaccuracy and uncertainty are undesirable characteristics of a system, while soft computing (SC) and thus CI focus on dealing with these characteristics. The adjacent figure illustrates this view and lists the most important CI techniques. Another frequently mentioned distinguishing feature is the representation of information in symbolic form in AI and in sub-symbolic form in CI techniques. Hard computing is a conventional computing method based on the principles of certainty and accuracy and it is deterministic. It requires a precisely stated analytical model of the task to be processed and a prewritten program, i.e. a fixed set of instructions. The models used are based on Boolean logic (also called crisp logic), where e.g. an element can be either a member of a set or not and there is nothing in between. When applied to real-world tasks, systems based on HC result in specific control actions defined by a mathematical model or algorithm. If an unforeseen situation occurs that is not included in the model or algorithm used, the action will most likely fail. Soft computing, on the other hand, is based on the fact that the human mind is capable of storing information and processing it in a goal-oriented way, even if it is imprecise and lacks certainty. SC is based on the model of the human brain with probabilistic thinking, fuzzy logic and multi-valued logic. Soft computing can process a wealth of data and perform a large number of computations, which may not be exact, in parallel. For hard problems for which no satisfying exact solutions based on HC are available, SC methods can be applied successfully. SC methods are usually stochastic in nature i.e., they are a randomly defined processes that can be analyzed statistically but not with precision. Up to now, the results of some CI methods, such as deep learning, cannot be verified and it is also not clear what they are based on. This problem represents an important scientific issue for the future. AI and CI are catchy terms, but they are also so similar that they can be confused. The meaning of both terms has developed and changed over a long period of time, with AI being used first. Bezdek describes this impressively and concludes that such buzzwords are frequently used and hyped by the scientific community, science management and (science) journalism. Not least because AI and biological intelligence are emotionally charged terms and it is still difficult to find a generally accepted definition for the basic term intelligence. == History == In 1950, Alan Turing, one of the founding fathers of computer science, developed a test for computer intelligence known as the Turing test. In this test, a person can ask questions via a keyboard and a monitor without knowing whether his counterpart is a human or a computer. A computer is considered intelligent if the interrogator cannot distinguish the computer from a human. This illustrates the discussion about intelligent computers at the beginning of the computer age. The term Computational Intelligence was first used as the title of the journal of the same name in 1985 and later by the IEEE Neural Networks Council (NNC), which was founded 1989 by a group of researchers interested in the development of biological and artificial neural networks. On November 21, 2001, the NNC became the IEEE Neural Networks Society, to become the IEEE Computational Intelligence Society two years later by including new areas of interest such as fuzzy systems and evolutionary computation. The NNC helped organize the first IEEE World Congress on Computational Intelligence in Orlando, Florida in 1994. On this conference the first clear definition of Computational Intelligence was introduced by Bezdek: A system is computationally intelligent when it: deals with only numerical (low-level) data, has pattern-recognition components, does not use knowledge in the AI sense; and additionally when it (begins to) exhibit (1) computational adaptivity; (2) computational fault tolerance; (3) speed approaching human-like turnaround and (4) error rates that approximate human performance. Today, with machine learning and deep learning in particular utilizing a breadth of supervised, unsupervised, and reinforcement learning approaches, the CI landscape has been greatly enhanced, with novell intelligent approaches. == The main algorithmic approaches of CI and their applicati

    Read more →
  • Hierarchical Risk Parity

    Hierarchical Risk Parity

    Hierarchical Risk Parity (HRP) is an advanced investment portfolio optimization framework developed in 2016 by Marcos López de Prado at Guggenheim Partners and Cornell University. HRP is a probabilistic graph-based alternative to the prevailing mean-variance optimization (MVO) framework developed by Harry Markowitz in 1952, and for which he received the Nobel Prize in economic sciences. HRP algorithms apply discrete mathematics and machine learning techniques to create diversified and robust investment portfolios that outperform MVO methods out-of-sample. HRP aims to address the limitations of traditional portfolio construction methods, particularly when dealing with highly correlated assets. Following its publication, HRP has been implemented in numerous open-source libraries, and received multiple extensions. == Key features == HRP portfolios have been proposed as a robust alternative to traditional quadratic optimization methods, including the Critical Line Algorithm (CLA) of Markowitz. HRP addresses three central issues commonly associated with quadratic optimizers: numerical instability, excessive concentration in a small number of assets, and poor out-of-sample performance. HRP leverages techniques from graph theory and machine learning to construct diversified portfolios using only the information embedded in the covariance matrix. Unlike quadratic programming methods, HRP does not require the covariance matrix to be invertible. Consequently, HRP remains applicable even in cases where the covariance matrix is ill-conditioned or singular—conditions under which standard optimizers fail. Monte Carlo simulations indicate that HRP achieves lower out-of-sample variance than CLA, despite the fact that minimizing variance is the explicit optimization objective of CLA. Furthermore, HRP portfolios exhibit lower realized risk compared to those generated by traditional risk parity methodologies. Empirical backtests have demonstrated that HRP would have historically outperformed conventional portfolio construction techniques. Algorithms within the HRP framework are characterized by the following features: Machine Learning Approach: HRP employs hierarchical clustering, a machine learning technique, to group similar assets based on their correlations. This allows the algorithm to identify the underlying hierarchical structure of the portfolio, and avoid that errors spread through the entire network. Risk-Based Allocation: The algorithm allocates capital based on risk, ensuring that assets only compete with similar assets for representation in the portfolio. This approach leads to better diversification across different risk sources, while avoiding the instability associated with noisy returns estimates. Covariance Matrix Handling: Unlike traditional methods like Mean-Variance Optimization, HRP does not require inverting the covariance matrix. This makes it more stable and applicable to portfolios with a large number of assets, particularly when the covariance matrix's condition number is high. == The problem: Markowitz's Curse == Portfolio construction is perhaps the most recurrent financial problem. On a daily basis, investment managers must build portfolios that incorporate their views and forecasts on risks and returns. Despite the theoretical elegance of Markowitz's mean-variance framework, its practical implementation is hindered by several limitations that undermine the reliability of solutions derived from the Critical Line Algorithm (CLA). A principal concern is the high sensitivity of optimal portfolios to small perturbations in expected returns: even minor forecasting errors can result in significantly different allocations (Michaud, 1998). Given the inherent difficulty of producing accurate return forecasts, numerous researchers have advocated for approaches that forgo expected returns entirely and instead rely solely on the covariance structure of asset returns. This has given rise to risk-based allocation methods, among which risk parity is a widely cited example (Jurczenko, 2015). While eliminating return forecasts mitigates some instability, it does not eliminate it. Quadratic programming techniques employed in portfolio optimization require the inversion of a positive-definite covariance matrix, meaning all eigenvalues must be strictly positive. When the matrix is numerically ill-conditioned—that is, when the ratio of its largest to smallest eigenvalue (its condition number) is large—matrix inversion becomes unreliable and prone to significant numerical errors (Bailey and López de Prado, 2012). The condition number of a covariance, correlation, or any symmetric (and thus diagonalizable) matrix is defined as the absolute value of the ratio between its largest and smallest eigenvalues in modulus. The figure on the right presents the sorted eigenvalues of several correlation matrices; the condition number is represented by the ratio of the first to last eigenvalues in each sequence. A diagonal correlation matrix, which is equal to its own inverse, exhibits the minimum possible condition number. As the number of correlated (or multicollinear) assets in a portfolio increases, the condition number rises. At high levels, this leads to severe numerical instability, whereby slight modifications in any matrix entry may result in drastically different inverses. This phenomenon, often referred to as Markowitz’s curse, encapsulates the paradox wherein increased correlation among assets heightens the theoretical need for diversification, yet simultaneously increases the likelihood of unstable optimization outcomes. Consequently, the potential benefits of diversification are frequently overshadowed by estimation errors. These problems are exacerbated as the dimensionality of the covariance matrix increases. The estimation of each covariance term consumes degrees of freedom, and in general, a minimum of 1 2 N ( N + 1 ) {\displaystyle {\frac {1}{2}}N(N+1)} independent and identically distributed (IID) observations is required to estimate a non-singular covariance matrix of dimension N {\displaystyle N} . For example, constructing an invertible covariance matrix of dimension 50 necessitates at least five years of daily IID observations. However, empirical evidence suggests that the correlation structure of financial assets is highly unstable over such extended periods. These difficulties are highlighted by the observation that even naïve allocation strategies—such as equally weighted portfolios—have frequently outperformed both mean-variance and risk-based optimizations in out-of-sample tests (De Miguel et al., 2009). == The solution: Hierarchical Risk Parity == The HRP algorithm addresses Markowitz's curse in three steps: Hierarchical Clustering: Assets are grouped into clusters based on their correlations, forming a hierarchical tree structure. Quasi-Diagonalization: The correlation matrix is reordered based on the clustering results, revealing a block diagonal structure. Recursive Bisection: Weights are assigned to assets through a top-down approach, splitting the portfolio into smaller sub-portfolios and allocating capital based on inverse variance. === Step 1: Hierarchical clustering === Given a T × N {\displaystyle T\times N} matrix of asset returns X {\displaystyle X} , where each column represents a time series of returns for one of N {\displaystyle N} assets over T {\displaystyle T} time periods, a hierarchical clustering process can be used to construct a tree-based representation of asset relationships. First, we compute the N × N {\displaystyle N\times N} correlation matrix ρ = ρ i , j i , j = 1 . . . N {\displaystyle \rho ={\rho _{i,j}}\;{i,j=1\;...\;N}} , where ρ i , j = c o r r ( X i , X j ) {\displaystyle \rho _{i,j}=\mathrm {corr} (X_{i},X_{j})} . From this, a pairwise distance matrix D = d i , j {\displaystyle D={d_{i,j}}} is defined using the transformation: d i , j = 1 2 ( 1 − ρ i , j ) {\displaystyle d_{i,j}={\sqrt {{\frac {1}{2}}(1-\rho _{i,j})}}} This distance function defines a proper metric space, satisfying non-negativity, identity of indiscernibles, symmetry, and the triangle inequality. Next, a secondary distance matrix D ~ = d ~ i , j {\displaystyle {\tilde {D}}={{\tilde {d}}_{i,j}}} is computed, where each entry measures the Euclidean distance between the distance profiles of two assets: d ~ i , j = ∑ n = 1 N ( d n , i − d n , j ) 2 {\displaystyle {\tilde {d}}_{i,j}={\sqrt {\sum _{n=1}^{N}(d_{n,i}-d_{n,j})^{2}}}} While d i , j {\displaystyle d_{i,j}} reflects correlation-based proximity between two assets, d ~ i , j {\displaystyle {\tilde {d}}_{i,j}} quantifies dissimilarity across the entire system, as it depends on all pairwise distances. Hierarchical clustering proceeds by identifying the pair ( i , j ) {\displaystyle (i,j)} with the smallest value of d ~ i , j {\displaystyle {\tilde {d}}_{i,j}} (for i ≠ j {\displaystyle i\neq j} ), and forming a new cluster u [ 1 ] = ( i , j ) {\displaystyle u[1]=(i,j)} .

    Read more →
  • Behavior informatics

    Behavior informatics

    Behavior informatics (BI) is the informatics of behaviors so as to obtain behavior intelligence and behavior insights. BI is a research method combining science and technology, specifically in the area of engineering. The purpose of BI includes analysis of current behaviors as well as the inference of future possible behaviors. This occurs through pattern recognition. Different from applied behavior analysis from the psychological perspective, BI builds computational theories, systems and tools to qualitatively and quantitatively model, represent, analyze, and manage behaviors of individuals, groups and/or organizations. BI is built on classic study of behavioral science, including behavior modeling, applied behavior analysis, behavior analysis, behavioral economics, and organizational behavior. Typical BI tasks consist of individual and group behavior formation, representation, computational modeling, analysis, learning, simulation, and understanding of behavior impact, utility, non-occurring behaviors, etc. for behavior intervention and management. The Behavior Informatics approach to data utilizes cognitive as well as behavioral data. By combining the data, BI has the potential to effectively illustrate the big picture when it comes to behavioral decisions and patterns. One of the goals of BI is also to be able to study human behavior while eliminating issues like self-report bias. This creates more reliable and valid information for research studies. == Behavior == From an Informatics perspective, a behavior consists of three key elements: actors (behavioral subjects and objects), operations (actions, activities) and interactions (relationships), and their properties. A behavior can be represented as a behavior vector, all behaviors of an actor or an actor group can be represented as behavior sequences and multi-dimensional behavior matrix. The following table explains some of the elements of behavior. Behavior Informatics takes into account behavior when analyzing business patterns and intelligence. The inclusion of behavior in these analyses provides prominent information on social and driving factors of patterns. == Applications == Behavior Informatics is being used in a variety of settings, including but not limited to health care management, telecommunications, marketing, and security. Behavior Informatics provides a manner in which to analyze and organize the many aspects that go into a person's health care needs and decisions. When it comes to business models, behavior informatics may be utilized for a similar role. Organizations implement behavior informatics to enhance business structure and regime, where it helps moderate ideal business decisions and situations.

    Read more →
  • Video renderer

    Video renderer

    A video renderer is software that processes a video file and sends it sequentially to the video display controller card for display on a computer screen. An example of a video renderer, is the VMR-7 that was used by Microsoft's DirectShow. An example of a UNIX video renderer is the one container within GStreamer. Commonly used video renderers are: Enhanced Video Renderer VMR9 Renderless Haali's Video Renderer Madvr Video Renderer JRVR, a part of JRiver Media Center

    Read more →
  • Pedagogical agent

    Pedagogical agent

    A pedagogical agent is a concept borrowed from computer science and artificial intelligence and applied to education, usually as part of an intelligent tutoring system (ITS). It is a simulated human-like interface between the learner and the content, in an educational environment. A pedagogical agent is designed to model the type of interactions between a student and another person. Mabanza and de Wet define it as "a character enacted by a computer that interacts with the user in a socially engaging manner". A pedagogical agent can be assigned different roles in the learning environment, such as tutor or co-learner, depending on the desired purpose of the agent. "A tutor agent plays the role of a teacher, while a co-learner agent plays the role of a learning companion". == History == The history of Pedagogical Agents is closely aligned with the history of computer animation. As computer animation progressed, it was adopted by educators to enhance computerized learning by including a lifelike interface between the program and the learner. The first versions of a pedagogical agent were more cartoon than person, like Microsoft's Clippy which helped users of Microsoft Office load and use the program's features in 1997. However, with developments in computer animation, pedagogical agents can now look lifelike. By 2006 there was a call to develop modular, reusable agents to decrease the time and expertise required to create a pedagogical agent. There was also a call in 2009 to enact agent standards. The standardization and re-usability of pedagogical agents is less of an issue since the decrease in cost and widespread availability of animation tools. Individualized pedagogical agents can be found across disciplines including medicine, math, law, language learning, automotive, and armed forces. They are used in applications directed to every age, from preschool to adult. == Learning theories related to pedagogical agent design == === Distributed cognition theory === Distributed cognition theory is the method in which cognition progresses in the context of collaboration with others. Pedagogical agents can be designed to assist the cognitive transfer to the learner, operating as artifacts or partners with collaborative role in learning. To support the performance of an action by the user, the pedagogical agent can act as a cognitive tool as long as the agent is equipped with the knowledge that the user lacks. The interactions between the user and the pedagogical agent can facilitate a social relationship. The pedagogical agent may fulfill the role of a working partner. === Socio-cultural learning theory === Socio-cultural learning theory is how the user develops when they are involved in learning activities in which there is interaction with other agents. A pedagogical agent can: intervene when the user requests, provide support for tasks that the user cannot address, and potentially extend the learners cognitive reach. Interaction with the pedagogical agent may elicit a variety of emotions from the learner. The learner may become excited, confused, frustrated, and/or discouraged. These emotions affect the learners' motivation. === Extraneous Cognitive Load === Extraneous cognitive load is the extra effort being exerted by an individual's working memory due to the way information is being presented. A pedagogical agent can increase the user's cognitive load by distracting them and becoming the focus of their attention, causing split attention between the instructional material and the agent. Agents can reduce the perceived cognitive load by providing narration and personalization that can also promote a user's interest and motivation. While research on the reduction of cognitive load from pedagogical agents is minimal, more studies have shown that agents do not increase it. == Effectiveness == It has been suggested by researchers that pedagogical agents may take on different roles in the learning environment. Examples of these roles are: supplanting, scaffolding, coaching, testing, or demonstrating or modelling a procedure. A pedagogical agent as a tutor has not been demonstrated to add any benefit to an educational strategy in equivalent lessons with and without a pedagogical agent. According to Richard Mayer, there is some support in research for pedagogical agent increasing learning, but only as a presenter of social cues. A co-learner pedagogical agent is believed to increase the student's self-efficacy. By pointing out important features of instructional content, a pedagogical agent can fulfill the signaling function, which research on multimedia learning has shown to enhance learning. Research has demonstrated that human-human interaction may not be completely replaced by pedagogical agents, but learners may prefer the agents to non-agent multimedia systems. This finding is supported by social agency theory. Much like the varying effectiveness of the pedagogical agent roles in the learning environment, agents that take into account the user's affect have had mixed results. Research has shown pedagogical agents that make use of the users’ affect have been found to increase user knowledge retention, motivation, and perceived self-efficacy. However, with such a broad range of modalities in affective expressions, it is often difficult to utilize them. Additionally, having agents detect a user's affective state with precision remains challenging, as displays of affect are different across individuals. == Design == === Attractiveness === The appearance of a pedagogical agent can be manipulated to meet the learning requirements. The attractiveness of a pedagogical agent can enhance student's learning when the users were the opposite gender of the pedagogical agent. Male students prefer a sexy appearance of a female pedagogical agents and dislike the sexy appearance of male agents. Female students were not attracted by the sexy appearance of either male or female pedagogical agents. === Affective Response === Pedagogical agents have reached a point where they can convey and elicit emotion, but also reason about and respond to it. These agents are often designed to elicit and respond to affective actions from users through various modalities such as speech, facial expressions, and body gestures. They respond to the affective state of the given user, and make use of these modalities using a wide array of sensors incorporated into the design of the agent. Specifically in education and training applications, pedagogical agents are often designed to increasingly recognize when users or learners exhibit frustration, boredom, confusion, and states of flow. The added recognition in these agents is a step toward making them more emotionally intelligent, comforting and motivating the users as they interact. === Digital Representation === The design of a pedagogical agent often begins with its digital representation, whether it will be 2D or 3D and static or animated. Several studies have developed pedagogical agents that were both static and animated, then evaluated the relative benefits. Similar to other design considerations, the improved learning from static or animated agents remains questionable. One study showed that the appearance of an agent portrayed using a static image can impact a user's recall, based on the visual appearance. Other research found results that suggest static agent images improve learning outcomes. However, several other studies found user's learned more when the pedagogical agent was animated rather than static. Recently a meta-analysis of such research found a negligible improvement in learning via pedagogical agents, suggesting more work needs to be done in the area to support any claims.

    Read more →
  • Outline of deep learning

    Outline of deep learning

    The following outline is provided as an overview of, and topical guide to, deep learning: Deep learning is a subfield of machine learning and artificial intelligence based on artificial neural networks with multiple processing layers. It emphasizes representation learning and is widely used in areas such as computer vision, natural language processing, speech recognition, recommender systems, robotics, and generative artificial intelligence. == Ways to categorize deep learning == A field of study A branch of artificial intelligence A subfield of machine learning A subfield of computer science A form of representation learning A class of methods based on artificial neural networks An approach used in computational statistics == History == === Precursors === Cybernetics Perceptron Connectionism Neocognitron Backpropagation === Milestones === LeNet Long short-term memory Deep belief network AlexNet Sequence to sequence learning Generative adversarial network Residual neural network Transformer BERT Generative pre-trained transformer Diffusion model === Related histories === History of artificial intelligence History of machine learning Timeline of machine learning == Core concepts == == Learning settings == Supervised learning Unsupervised learning Self-supervised learning Semi-supervised learning Reinforcement learning Transfer learning Multitask learning Multimodal learning Online machine learning Continual learning == Common tasks == Image classification Object detection Image segmentation Automatic speech recognition Neural machine translation Question answering Automatic summarization Text-to-image model Protein structure prediction == Architectures == === Feedforward and convolutional architectures === Feedforward neural network Multilayer perceptron Convolutional neural network Radial basis function network Residual neural network U-Net === Recurrent and sequence architectures === Recurrent neural network Long short-term memory Gated recurrent unit Sequence to sequence learning Recursive neural network === Representation-learning architectures === Autoencoder Denoising autoencoder Sparse autoencoder Variational autoencoder Restricted Boltzmann machine Deep belief network === Attention and transformer architectures === Attention (machine learning) Transformer BERT Generative pre-trained transformer Vision transformer === Generative and probabilistic architectures === Autoregressive model Diffusion model Energy-based model Generative adversarial network Mixture of experts === Graph and memory architectures === Graph neural network Graph convolutional network Siamese network Neural Turing machine Memory network Echo state network Capsule neural network == Neural network components and techniques == Artificial neuron Activation function Rectified linear unit Sigmoid function Softmax function Embedding Convolution Pooling layer Attention Batch normalization Layer normalization Residual connections == Training and optimization == Backpropagation Gradient descent Stochastic gradient descent Adam optimization Learning rate Loss function Cross-entropy Mean squared error Regularization Dropout Early stopping Batch normalization Data augmentation Transfer learning Knowledge distillation Ensemble learning Curriculum learning == Datasets and benchmarks == CIFAR-10 ImageNet MNIST database Common Objects in Context (COCO) General Language Understanding Evaluation (GLUE) benchmark LibriSpeech SQuAD == Applications == === Computer vision === Computer vision Facial recognition system Image classification Image segmentation Medical imaging Object detection Optical character recognition === Natural language processing === Automatic summarization Chatbot Information retrieval Large language model Natural language processing Neural machine translation Question answering Sentiment analysis === Speech and audio === Automatic speech recognition Music information retrieval Speaker recognition Speech synthesis === Science and medicine === Bioinformatics Computational biology Drug discovery Medical diagnosis Protein structure prediction === Robotics and control === Autonomous car Computer game bot Control theory Robotics === Recommendation, search, and forecasting === Anomaly detection Forecasting Fraud detection Recommender system Search engine === Generative artificial intelligence === Deepfake Generative artificial intelligence Large language model Speech synthesis Text-to-image model === Computer graphics and video games === Deep Learning Anti-Aliasing (DLAA) Deep Learning Super Sampling (DLSS) == Hardware == AMD Instinct AMD XDNA Application-specific integrated circuit Deep learning processor, Neural processing unit (NPU), or Neural Engine Field-programmable gate array General-purpose computing on graphics processing units (GPGPU) Graphics processing unit NVIDIA Deep Learning Accelerator (NVDLA) Tensor processing unit Vision processing unit Wafer-scale integration === Supporting software platforms === CUDA Metal ROCm == Software == === Open-source frameworks and libraries === === Neural network software === EDLUT Emergent Encog JOONE Neuroph NeuroSolutions OpenNN Peltarion Synapse SNNS === Platforms, tools, and deployment === Amazon SageMaker Google Colab Hugging Face Kaggle Kubeflow MLflow ONNX OpenVINO TensorFlow Hub == Algorithms for deep learning and neural networks == Backpropagation Conjugate gradient method Generalized Hebbian algorithm Gradient descent Levenberg–Marquardt algorithm Perceptron Quasi-Newton method Wake-sleep algorithm == Methods and related topics == === Representation and metric learning === Contrastive learning Embedding Feature learning Manifold learning Metric learning === Generative modeling === Autoregressive model Diffusion model Generative adversarial network Generative model Variational inference === Efficient and scalable deep learning === Knowledge distillation Low-rank approximation Mixture of experts Quantization Sparsity === Reliability, safety, and interpretability === Adversarial machine learning AI alignment Algorithmic bias Catastrophic forgetting Differential privacy Explainable artificial intelligence Federated learning Hallucination (artificial intelligence) == Conferences and workshops == Annual Meeting of the Association for Computational Linguistics Conference on Computer Vision and Pattern Recognition Conference on Neural Information Processing Systems International Conference on Computer Vision International Conference on Learning Representations International Conference on Machine Learning == Organizations == === Research laboratories and institutions === Allen Institute for AI Alberta Machine Intelligence Institute European Laboratory for Learning and Intelligent Systems Google DeepMind Meta AI Mila Microsoft Research Vector Institute === Companies === Anthropic Cerebras Cohere DeepSeek Mistral AI OpenAI Stability AI xAI == Publications == === Books === Deep Learning – Ian Goodfellow and Yoshua Bengio Neural Networks and Deep Learning – Michael Nielsen Perceptrons – Marvin Minsky and Seymour Papert === Journals === IEEE Transactions on Neural Networks and Learning Systems Neural Networks Neural Computation == Influential persons ==

    Read more →
  • Reparameterization trick

    Reparameterization trick

    The reparameterization trick (aka "reparameterization gradient estimator") is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and the variance reduction of estimators. It was developed in the 1980s in operations research, under the name of "pathwise gradients", or "stochastic gradients". Its use in variational inference was proposed in 2013. == Mathematics == Let z {\displaystyle z} be a random variable with distribution q ϕ ( z ) {\displaystyle q_{\phi }(z)} , where ϕ {\displaystyle \phi } is a vector containing the parameters of the distribution. === REINFORCE estimator === Consider an objective function of the form: L ( ϕ ) = E z ∼ q ϕ ( z ) [ f ( z ) ] {\displaystyle L(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[f(z)]} Without the reparameterization trick, estimating the gradient ∇ ϕ L ( ϕ ) {\displaystyle \nabla _{\phi }L(\phi )} can be challenging, because the parameter appears in the random variable itself. In more detail, we have to statistically estimate: ∇ ϕ L ( ϕ ) = ∇ ϕ ∫ d z q ϕ ( z ) f ( z ) {\displaystyle \nabla _{\phi }L(\phi )=\nabla _{\phi }\int dz\;q_{\phi }(z)f(z)} The REINFORCE estimator, widely used in reinforcement learning and especially policy gradient, uses the following equality: ∇ ϕ L ( ϕ ) = ∫ d z q ϕ ( z ) ∇ ϕ ( ln ⁡ q ϕ ( z ) ) f ( z ) = E z ∼ q ϕ ( z ) [ ∇ ϕ ( ln ⁡ q ϕ ( z ) ) f ( z ) ] {\displaystyle \nabla _{\phi }L(\phi )=\int dz\;q_{\phi }(z)\nabla _{\phi }(\ln q_{\phi }(z))f(z)=\mathbb {E} _{z\sim q_{\phi }(z)}[\nabla _{\phi }(\ln q_{\phi }(z))f(z)]} This allows the gradient to be estimated: ∇ ϕ L ( ϕ ) ≈ 1 N ∑ i = 1 N ∇ ϕ ( ln ⁡ q ϕ ( z i ) ) f ( z i ) {\displaystyle \nabla _{\phi }L(\phi )\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }(\ln q_{\phi }(z_{i}))f(z_{i})} The REINFORCE estimator has high variance, and many methods were developed to reduce its variance. === Reparameterization estimator === The reparameterization trick expresses z {\displaystyle z} as: z = g ϕ ( ϵ ) , ϵ ∼ p ( ϵ ) {\displaystyle z=g_{\phi }(\epsilon ),\quad \epsilon \sim p(\epsilon )} Here, g ϕ {\displaystyle g_{\phi }} is a deterministic function parameterized by ϕ {\displaystyle \phi } , and ϵ {\displaystyle \epsilon } is a noise variable drawn from a fixed distribution p ( ϵ ) {\displaystyle p(\epsilon )} . This gives: L ( ϕ ) = E ϵ ∼ p ( ϵ ) [ f ( g ϕ ( ϵ ) ) ] {\displaystyle L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[f(g_{\phi }(\epsilon ))]} Now, the gradient can be estimated as: ∇ ϕ L ( ϕ ) = E ϵ ∼ p ( ϵ ) [ ∇ ϕ f ( g ϕ ( ϵ ) ) ] ≈ 1 N ∑ i = 1 N ∇ ϕ f ( g ϕ ( ϵ i ) ) {\displaystyle \nabla _{\phi }L(\phi )=\mathbb {E} _{\epsilon \sim p(\epsilon )}[\nabla _{\phi }f(g_{\phi }(\epsilon ))]\approx {\frac {1}{N}}\sum _{i=1}^{N}\nabla _{\phi }f(g_{\phi }(\epsilon _{i}))} == Examples == For some common distributions, the reparameterization trick takes specific forms: Normal distribution: For z ∼ N ( μ , σ 2 ) {\displaystyle z\sim {\mathcal {N}}(\mu ,\sigma ^{2})} , we can use: z = μ + σ ϵ , ϵ ∼ N ( 0 , 1 ) {\displaystyle z=\mu +\sigma \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,1)} Exponential distribution: For z ∼ Exp ( λ ) {\displaystyle z\sim {\text{Exp}}(\lambda )} , we can use: z = − 1 λ log ⁡ ( ϵ ) , ϵ ∼ Uniform ( 0 , 1 ) {\displaystyle z=-{\frac {1}{\lambda }}\log(\epsilon ),\quad \epsilon \sim {\text{Uniform}}(0,1)} Discrete distribution can be reparameterized by the Gumbel distribution (Gumbel-softmax trick or "concrete distribution") and diffusion models. In general, any distribution that is differentiable with respect to its parameters can be reparameterized by inverting the multivariable CDF function, then apply the implicit method. See for an exposition and application to the Gamma, Beta, Dirichlet, and von Mises distributions. == Applications == === Variational autoencoder === In Variational Autoencoders (VAEs), the VAE objective function, known as the Evidence Lower Bound (ELBO), is given by: ELBO ( ϕ , θ ) = E z ∼ q ϕ ( z | x ) [ log ⁡ p θ ( x | z ) ] − D KL ( q ϕ ( z | x ) | | p ( z ) ) {\displaystyle {\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{z\sim q_{\phi }(z|x)}[\log p_{\theta }(x|z)]-D_{\text{KL}}(q_{\phi }(z|x)||p(z))} where q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} is the encoder (recognition model), p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is the decoder (generative model), and p ( z ) {\displaystyle p(z)} is the prior distribution over latent variables. The gradient of ELBO with respect to θ {\displaystyle \theta } is simply E z ∼ q ϕ ( z | x ) [ ∇ θ log ⁡ p θ ( x | z ) ] ≈ 1 L ∑ l = 1 L ∇ θ log ⁡ p θ ( x | z l ) {\displaystyle \mathbb {E} _{z\sim q_{\phi }(z|x)}[\nabla _{\theta }\log p_{\theta }(x|z)]\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\theta }\log p_{\theta }(x|z_{l})} but the gradient with respect to ϕ {\displaystyle \phi } requires the trick. Express the sampling operation z ∼ q ϕ ( z | x ) {\displaystyle z\sim q_{\phi }(z|x)} as: z = μ ϕ ( x ) + σ ϕ ( x ) ⊙ ϵ , ϵ ∼ N ( 0 , I ) {\displaystyle z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon ,\quad \epsilon \sim {\mathcal {N}}(0,I)} where μ ϕ ( x ) {\displaystyle \mu _{\phi }(x)} and σ ϕ ( x ) {\displaystyle \sigma _{\phi }(x)} are the outputs of the encoder network, and ⊙ {\displaystyle \odot } denotes element-wise multiplication. Then we have ∇ ϕ ELBO ( ϕ , θ ) = E ϵ ∼ N ( 0 , I ) [ ∇ ϕ log ⁡ p θ ( x | z ) + ∇ ϕ log ⁡ q ϕ ( z | x ) − ∇ ϕ log ⁡ p ( z ) ] {\displaystyle \nabla _{\phi }{\text{ELBO}}(\phi ,\theta )=\mathbb {E} _{\epsilon \sim {\mathcal {N}}(0,I)}[\nabla _{\phi }\log p_{\theta }(x|z)+\nabla _{\phi }\log q_{\phi }(z|x)-\nabla _{\phi }\log p(z)]} where z = μ ϕ ( x ) + σ ϕ ( x ) ⊙ ϵ {\displaystyle z=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon } . This allows us to estimate the gradient using Monte Carlo sampling: ∇ ϕ ELBO ( ϕ , θ ) ≈ 1 L ∑ l = 1 L [ ∇ ϕ log ⁡ p θ ( x | z l ) + ∇ ϕ log ⁡ q ϕ ( z l | x ) − ∇ ϕ log ⁡ p ( z l ) ] {\displaystyle \nabla _{\phi }{\text{ELBO}}(\phi ,\theta )\approx {\frac {1}{L}}\sum _{l=1}^{L}[\nabla _{\phi }\log p_{\theta }(x|z_{l})+\nabla _{\phi }\log q_{\phi }(z_{l}|x)-\nabla _{\phi }\log p(z_{l})]} where z l = μ ϕ ( x ) + σ ϕ ( x ) ⊙ ϵ l {\displaystyle z_{l}=\mu _{\phi }(x)+\sigma _{\phi }(x)\odot \epsilon _{l}} and ϵ l ∼ N ( 0 , I ) {\displaystyle \epsilon _{l}\sim {\mathcal {N}}(0,I)} for l = 1 , … , L {\displaystyle l=1,\ldots ,L} . This formulation enables backpropagation through the sampling process, allowing for end-to-end training of the VAE model using stochastic gradient descent or its variants. === Variational inference === More generally, the trick allows using stochastic gradient descent for variational inference. Let the variational objective (ELBO) be of the form: ELBO ( ϕ ) = E z ∼ q ϕ ( z ) [ log ⁡ p ( x , z ) − log ⁡ q ϕ ( z ) ] {\displaystyle {\text{ELBO}}(\phi )=\mathbb {E} _{z\sim q_{\phi }(z)}[\log p(x,z)-\log q_{\phi }(z)]} Using the reparameterization trick, we can estimate the gradient of this objective with respect to ϕ {\displaystyle \phi } : ∇ ϕ ELBO ( ϕ ) ≈ 1 L ∑ l = 1 L ∇ ϕ [ log ⁡ p ( x , g ϕ ( ϵ l ) ) − log ⁡ q ϕ ( g ϕ ( ϵ l ) ) ] , ϵ l ∼ p ( ϵ ) {\displaystyle \nabla _{\phi }{\text{ELBO}}(\phi )\approx {\frac {1}{L}}\sum _{l=1}^{L}\nabla _{\phi }[\log p(x,g_{\phi }(\epsilon _{l}))-\log q_{\phi }(g_{\phi }(\epsilon _{l}))],\quad \epsilon _{l}\sim p(\epsilon )} === Dropout === The reparameterization trick has been applied to reduce the variance in dropout, a regularization technique in neural networks. The original dropout can be reparameterized with Bernoulli distributions: y = ( W ⊙ ϵ ) x , ϵ i j ∼ Bernoulli ( α i j ) {\displaystyle y=(W\odot \epsilon )x,\quad \epsilon _{ij}\sim {\text{Bernoulli}}(\alpha _{ij})} where W {\displaystyle W} is the weight matrix, x {\displaystyle x} is the input, and α i j {\displaystyle \alpha _{ij}} are the (fixed) dropout rates. More generally, other distributions can be used than the Bernoulli distribution, such as the gaussian noise: y i = μ i + σ i ⊙ ϵ i , ϵ i ∼ N ( 0 , I ) {\displaystyle y_{i}=\mu _{i}+\sigma _{i}\odot \epsilon _{i},\quad \epsilon _{i}\sim {\mathcal {N}}(0,I)} where μ i = m i ⊤ x {\displaystyle \mu _{i}=\mathbf {m} _{i}^{\top }x} and σ i 2 = v i ⊤ x 2 {\displaystyle \sigma _{i}^{2}=\mathbf {v} _{i}^{\top }x^{2}} , with m i {\displaystyle \mathbf {m} _{i}} and v i {\displaystyle \mathbf {v} _{i}} being the mean and variance of the i {\displaystyle i} -th output neuron. The reparameterization trick can be applied to all such cases, resulting in the variational dropout method.

    Read more →
  • Imo.im

    Imo.im

    imo.im is a proprietary audio/video calling and instant messaging software service. It allows sending music, video, PDFs and other files, along with various free stickers. It supports encrypted group video and voice calls with up to 20 participants. According to its developer, the service possesses over 200 million users and over 50 million messages per day are sent through it. == History == The product was created as a web-based application in 2005 for accessing multiple chat platforms, including Facebook Messenger, Google Talk, Yahoo! Messenger, and Skype chat. It was developed by Pagebites, which is a subsidiary of Singularity IM, Inc. and required a subscriber's phone number to verify the users' account. In March 2014, support for all third-party messaging networks ended. In January 2018, the app reached 500 million installs. imo.im has implemented end-to-end encryption for its chats and calls, ensuring that the conversations remain private between the sender and receiver.

    Read more →
  • Data exploration

    Data exploration

    Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. Data exploration is typically conducted using a combination of automated and manual activities. Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data. All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis. Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining data quality. Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field. Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations. == Interactive Data Exploration == This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning. By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques. == Software == Trifacta – a data preparation and analysis platform Paxata – self-service data preparation software Alteryx – data blending and advanced data analytics software Microsoft Power BI - interactive visualization and data analysis tool OpenRefine - a standalone open source desktop application for data clean-up and data transformation Tableau software – interactive data visualization software

    Read more →
  • Neural scaling law

    Neural scaling law

    In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost. Some models also exhibit performance gains by scaling inference through increased test-time compute (TTC), extending neural scaling laws beyond training to the deployment phase. == Introduction == In general, a deep learning model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate). Each of these variables can be defined as a real number, usually written as N , D , C , L {\displaystyle N,D,C,L} (respectively: parameter count, dataset size, computing cost, and loss). A neural scaling law is a theoretical or empirical statistical law between these parameters. There are also other parameters with other scaling laws. === Size of the model === In most cases, the model's size is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. With sparse models, during inference, only a fraction of their parameters are used. In comparison, most other kinds of neural networks, such as transformer models, always use all their parameters during inference. === Size of the training dataset === The size of the training dataset is usually quantified by the number of data points within it. Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn. This can lead to improved generalization performance when the model is applied to new, unseen data. However, increasing the size of the training dataset also increases the computational resources and time required for model training. With the "pretrain, then finetune" method used for most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. In some cases, a small amount of high quality data suffices for finetuning, and more data does not necessarily improve performance. Many scaling laws, due to their inherent diminishing returns nature, value data based on a submodular set function which was shown in a paper on this topic. === Cost of training === Training cost is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required). It is important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware such as GPUs or TPUs. The cost of training a neural network model is a function of several factors, including model size, training dataset size, the training algorithm complexity, and the computational resources available. In particular, doubling the training dataset size does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "epoch"). === Performance === The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data. Common metrics for evaluating model performance include: Negative log-likelihood per token (logarithm of perplexity) for language modeling; Accuracy, precision, recall, and F1 score for classification tasks; Mean squared error (MSE) or mean absolute error (MAE) for regression tasks; Elo rating in a competition against other models, such as gameplay or preference by a human judge. Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set. When the performance is a number bounded within the range of [ 0 , 1 ] {\displaystyle [0,1]} , such as accuracy, precision, etc., it often scales as a sigmoid function of cost, as seen in the figures. == Examples == === (Hestness, Narang, et al, 2017) === The 2017 paper is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like L ∝ D − α {\displaystyle L\propto D^{-\alpha }} , with α ∈ { 0.5 , 1 , 2 } {\displaystyle \alpha \in \{0.5,1,2\}} , the paper found that α ∈ [ 0.07 , 0.35 ] {\displaystyle \alpha \in [0.07,0.35]} . Of the factors they varied, only task can change the exponent α {\displaystyle \alpha } . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have L = 1000 D − 0.3 {\displaystyle L=1000D^{-0.3}} while another might have L = 500 D − 0.3 {\displaystyle L=500D^{-0.3}} . They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like N ∝ D β {\displaystyle N\propto D^{\beta }} for another exponent β {\displaystyle \beta } . They studied machine translation with LSTM ( α ∼ 0.13 {\displaystyle \alpha \sim 0.13} ), generative language modelling with LSTM ( α ∈ [ 0.06 , 0.09 ] , β ≈ 0.7 {\displaystyle \alpha \in [0.06,0.09],\beta \approx 0.7} ), ImageNet classification with ResNet ( α ∈ [ 0.3 , 0.5 ] , β ≈ 0.6 {\displaystyle \alpha \in [0.3,0.5],\beta \approx 0.6} ), and speech recognition with two hybrid (LSTMs complemented by either CNNs or an attention decoder) architectures ( α ≈ 0.3 {\displaystyle \alpha \approx 0.3} ). === (Henighan, Kaplan, et al, 2020) === A 2020 analysis studied statistical relations between C , N , D , L {\displaystyle C,N,D,L} over a wide range of values and found similar scaling laws, over the range of N ∈ [ 10 3 , 10 9 ] {\displaystyle N\in [10^{3},10^{9}]} , C ∈ [ 10 12 , 10 21 ] {\displaystyle C\in [10^{12},10^{21}]} , and over multiple modalities (text, video, image, text to image, etc.). In particular, the scaling laws it found are (Table 1 of ): For each modality, they fixed one of the two C , N {\displaystyle C,N} , and varying the other one ( D {\displaystyle D} is varied along using D = C / 6 N {\displaystyle D=C/6N} ), the achievable test loss satisfies L = L 0 + ( x 0 x ) α {\displaystyle L=L_{0}+\left({\frac {x_{0}}{x}}\right)^{\alpha }} where x {\displaystyle x} is the varied variable, and L 0 , x 0 , α {\displaystyle L_{0},x_{0},\alpha } are parameters to be found by statistical fitting. The parameter α {\displaystyle \alpha } is the most important one. When N {\displaystyle N} is the varied variable, α {\displaystyle \alpha } ranges from 0.037 {\displaystyle 0.037} to 0.24 {\displaystyle 0.24} depending on the model modality. This corresponds to the α = 0.34 {\displaystyle \alpha =0.34} from the Chinchilla scaling paper. When C {\displaystyle C} is the varied variable, α {\displaystyle \alpha } ranges from 0.048 {\displaystyle 0.048} to 0.19 {\displaystyle 0.19} depending on the model modality. This corresponds to the β = 0.28 {\displaystyle \beta =0.28} from the Chinchilla scaling paper. Given fixed computing budget, optimal model parameter count is consistently around N o p t ( C ) = ( C 5 × 10 − 12 petaFLOP-day ) 0.7 = 9.0 × 10 − 7 C 0.7 {\displaystyle N_{opt}(C)=\left({\frac {C}{5\times 10^{-12}{\text{petaFLOP-day}}}}\right)^{0.7}=9.0\times 10^{-7}C^{0.7}} The parameter 9.0 × 10 − 7 {\displaystyle 9.0\times 10^{-7}} varies by a factor of up to 10 for different modalities. The exponent parameter 0.7 {\displaystyle 0.7} varies from 0.64 {\displaystyle 0.64} to 0.75 {\displaystyle 0.75} for different modalities. This exponent corresponds to the ≈ 0.5 {\displaystyle \approx 0.5} from the Chinchilla scaling paper. It's "strongly suggested" (but not statistically checked) that D o p t ( C ) ∝ N o p t ( C ) 0.4 ∝ C 0.28 {\displaystyle D_{opt}(C)\propto N_{opt}(C)^{0.4}\propto C^{0.28}} . This exponent corresponds to the ≈ 0.5 {\displaystyle \approx 0.5} from the Chinchilla scaling paper. The scaling law of L = L 0 + ( C 0 / C ) 0.048 {\displaystyle L=L_{0}+(C_{0}/C)^{0.048}} was confirmed during the training of GPT-3 (Figure 3.1 ). === Chinchilla scaling (Hoffmann, et al, 2022) === One particular scaling law ("Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have: { C = C 0 N D L = A N α + B D β + L 0 {\displaystyle {\begin{cases}C=C_{0}ND\\L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}\end{cases}}} where the variables are C {\displaystyle C} is the cost o

    Read more →
  • Inception score

    Inception score

    The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inception v3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true: The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct". The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse". It has been somewhat superseded by the related Fréchet inception distance. While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth"). == Definition == Let there be two spaces, the space of images Ω X {\displaystyle \Omega _{X}} and the space of labels Ω Y {\displaystyle \Omega _{Y}} . The space of labels is finite. Let p g e n {\displaystyle p_{gen}} be a probability distribution over Ω X {\displaystyle \Omega _{X}} that we wish to judge. Let a discriminator be a function of type p d i s : Ω X → M ( Ω Y ) {\displaystyle p_{dis}:\Omega _{X}\to M(\Omega _{Y})} where M ( Ω Y ) {\displaystyle M(\Omega _{Y})} is the set of all probability distributions on Ω Y {\displaystyle \Omega _{Y}} . For any image x {\displaystyle x} , and any label y {\displaystyle y} , let p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} be the probability that image x {\displaystyle x} has label y {\displaystyle y} , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet. The Inception Score of p g e n {\displaystyle p_{gen}} relative to p d i s {\displaystyle p_{dis}} is I S ( p g e n , p d i s ) := exp ⁡ ( E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x ) ] ) {\displaystyle IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)} Equivalent rewrites include ln ⁡ I S ( p g e n , p d i s ) := E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ) ] {\displaystyle \ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]} ln ⁡ I S ( p g e n , p d i s ) := H [ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ] − E x ∼ p g e n [ H [ p d i s ( ⋅ | x ) ] ] {\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]} ln ⁡ I S {\displaystyle \ln IS} is nonnegative by Jensen's inequality. Pseudocode:INPUT discriminator p d i s {\displaystyle p_{dis}} . INPUT generator g {\displaystyle g} . Sample images x i {\displaystyle x_{i}} from generator. Compute p d i s ( ⋅ | x i ) {\displaystyle p_{dis}(\cdot |x_{i})} , the probability distribution over labels conditional on image x i {\displaystyle x_{i}} . Sum up the results to obtain p ^ {\displaystyle {\hat {p}}} , an empirical estimate of ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx} . Sample more images x i {\displaystyle x_{i}} from generator, and for each, compute D K L ( p d i s ( ⋅ | x i ) ‖ p ^ ) {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)} . Average the results, and take its exponential. RETURN the result. === Interpretation === A higher inception score is interpreted as "better", as it means that p g e n {\displaystyle p_{gen}} is a "sharp and distinct" collection of pictures. ln ⁡ I S ( p g e n , p d i s ) ∈ [ 0 , ln ⁡ N ] {\displaystyle \ln IS(p_{gen},p_{dis})\in [0,\ln N]} , where N {\displaystyle N} is the total number of possible labels. ln ⁡ I S ( p g e n , p d i s ) = 0 {\displaystyle \ln IS(p_{gen},p_{dis})=0} iff for almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} p d i s ( ⋅ | x ) = ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx} That means p g e n {\displaystyle p_{gen}} is completely "indistinct". That is, for any image x {\displaystyle x} sampled from p g e n {\displaystyle p_{gen}} , discriminator returns exactly the same label predictions p d i s ( ⋅ | x ) {\displaystyle p_{dis}(\cdot |x)} . The highest inception score N {\displaystyle N} is achieved if and only if the two conditions are both true: For almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} , the distribution p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} is concentrated on one label. That is, H y [ p d i s ( y | x ) ] = 0 {\displaystyle H_{y}[p_{dis}(y|x)]=0} . That is, every image sampled from p g e n {\displaystyle p_{gen}} is exactly classified by the discriminator. For every label y {\displaystyle y} , the proportion of generated images labelled as y {\displaystyle y} is exactly E x ∼ p g e n [ p d i s ( y | x ) ] = 1 N {\displaystyle \mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}} . That is, the generated images are equally distributed over all labels.

    Read more →
  • Latent semantic analysis

    Latent semantic analysis

    Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents. An information retrieval technique using latent semantic structure was patented in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI). == Overview == === Occurrence matrix === LSA can use a document-term matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents. A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the weight of an element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance. This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used. === Rank lowering === After the construction of the occurrence matrix, LSA finds a low-rank approximation to the term-document matrix. There could be various reasons for these approximations: The original term-document matrix is presumed too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil"). The original term-document matrix is presumed noisy: for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original). The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document—generally a much larger set due to synonymy. The consequence of the rank lowering is that some dimensions are combined and depend on more than one term: {(car), (truck), (flower)} → {(1.3452 car + 0.2828 truck), (flower)} This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also partially mitigates the problem with polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense. === Derivation === Let X {\displaystyle X} be a matrix where element ( i , j ) {\displaystyle (i,j)} describes the occurrence of term i {\displaystyle i} in document j {\displaystyle j} (this can be, for example, the frequency). X {\displaystyle X} will look like this: d j ↓ t i T → [ x 1 , 1 … x 1 , j … x 1 , n ⋮ ⋱ ⋮ ⋱ ⋮ x i , 1 … x i , j … x i , n ⋮ ⋱ ⋮ ⋱ ⋮ x m , 1 … x m , j … x m , n ] {\displaystyle {\begin{matrix}&{\textbf {d}}_{j}\\&\downarrow \\{\textbf {t}}_{i}^{T}\rightarrow &{\begin{bmatrix}x_{1,1}&\dots &x_{1,j}&\dots &x_{1,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{m,1}&\dots &x_{m,j}&\dots &x_{m,n}\\\end{bmatrix}}\end{matrix}}} Now a row in this matrix will be a vector corresponding to a term, giving its relation to each document: t i T = [ x i , 1 … x i , j … x i , n ] {\displaystyle {\textbf {t}}_{i}^{T}={\begin{bmatrix}x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\end{bmatrix}}} Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term: d j = [ x 1 , j ⋮ x i , j ⋮ x m , j ] {\displaystyle {\textbf {d}}_{j}={\begin{bmatrix}x_{1,j}\\\vdots \\x_{i,j}\\\vdots \\x_{m,j}\\\end{bmatrix}}} Now the dot product t i T t p {\displaystyle {\textbf {t}}_{i}^{T}{\textbf {t}}_{p}} between two term vectors gives the correlation between the terms over the set of documents. The matrix product X X T {\displaystyle XX^{T}} contains all these dot products. Element ( i , p ) {\displaystyle (i,p)} (which is equal to element ( p , i ) {\displaystyle (p,i)} ) contains the dot product t i T t p {\displaystyle {\textbf {t}}_{i}^{T}{\textbf {t}}_{p}} ( = t p T t i {\displaystyle ={\textbf {t}}_{p}^{T}{\textbf {t}}_{i}} ). Likewise, the matrix X T X {\displaystyle X^{T}X} contains the dot products between all the document vectors, giving their correlation over the terms: d j T d q = d q T d j {\displaystyle {\textbf {d}}_{j}^{T}{\textbf {d}}_{q}={\textbf {d}}_{q}^{T}{\textbf {d}}_{j}} . Now, from the theory of linear algebra, there exists a decomposition of X {\displaystyle X} such that U {\displaystyle U} and V {\displaystyle V} are orthogonal matrices and Σ {\displaystyle \Sigma } is a diagonal matrix. This is called a singular value decomposition (SVD): X = U Σ V T {\displaystyle {\begin{matrix}X=U\Sigma V^{T}\end{matrix}}} The matrix products giving us the term and document correlations then become X X T = ( U Σ V T ) ( U Σ V T ) T = ( U Σ V T ) ( V T T Σ T U T ) = U Σ V T V Σ T U T = U Σ Σ T U T X T X = ( U Σ V T ) T ( U Σ V T ) = ( V T T Σ T U T ) ( U Σ V T ) = V Σ T U T U Σ V T = V Σ T Σ V T {\displaystyle {\begin{matrix}XX^{T}&=&(U\Sigma V^{T})(U\Sigma V^{T})^{T}=(U\Sigma V^{T})(V^{T^{T}}\Sigma ^{T}U^{T})=U\Sigma V^{T}V\Sigma ^{T}U^{T}=U\Sigma \Sigma ^{T}U^{T}\\X^{T}X&=&(U\Sigma V^{T})^{T}(U\Sigma V^{T})=(V^{T^{T}}\Sigma ^{T}U^{T})(U\Sigma V^{T})=V\Sigma ^{T}U^{T}U\Sigma V^{T}=V\Sigma ^{T}\Sigma V^{T}\end{matrix}}} Since Σ Σ T {\displaystyle \Sigma \Sigma ^{T}} and Σ T Σ {\displaystyle \Sigma ^{T}\Sigma } are diagonal we see that U {\displaystyle U} must contain the eigenvectors of X X T {\displaystyle XX^{T}} , while V {\displaystyle V} must be the eigenvectors of X T X {\displaystyle X^{T}X} . Both products have the same non-zero eigenvalues, given by the non-zero entries of Σ Σ T {\displaystyle \Sigma \Sigma ^{T}} , or equally, by the non-zero entries of Σ T Σ {\displaystyle \Sigma ^{T}\Sigma } . Now the decomposition looks like this: X U Σ V T ( d j ) ( d ^ j ) ↓ ↓ ( t i T ) → [ x 1 , 1 … x 1 , j … x 1 , n ⋮ ⋱ ⋮ ⋱ ⋮ x i , 1 … x i , j … x i , n ⋮ ⋱ ⋮ ⋱ ⋮ x m , 1 … x m , j … x m , n ] = ( t ^ i T ) → [ [ u 1 ] … [ u l ] ] ⋅ [ σ 1 … 0 ⋮ ⋱ ⋮ 0 … σ l ] ⋅ [ [ v 1 ] ⋮ [ v l ] ] {\displaystyle {\begin{matrix}&X&&&U&&\Sigma &&V^{T}\\&({\textbf {d}}_{j})&&&&&&&({\hat {\textbf {d}}}_{j})\\&\downarrow &&&&&&&\downarrow \\({\textbf {t}}_{i}^{T})\rightarrow &{\begin{bmatrix}x_{1,1}&\dots &x_{1,j}&\dots &x_{1,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{m,1}&\dots &x_{m,j}&\dots &x_{m,n}\\\end{bmatrix}}&=&({\hat {\textbf {t}}}_{i}^{T})\rightarrow &{\begin{bmatrix}{\begin{bmatrix}\,\\\,\\{\textbf {u}}_{1}\\\,\\\,\end{bmatrix}}\dots {\begin{bmatrix}\,\\\,\\{\textbf {u}}_{l}\\\,\\\,\end{bmatrix}}\end{bmatrix}}&\cdot &{\begin{bmatrix}\sigma _{1}&\dots &0\\\vdots &\ddots &\vdots \\0&\dots &\sigma _{l}\\\end{bmatrix}}&\cdot &{\begin{bmatrix}{\begin{bmatrix}&&{\textbf {v}}_{1}&&\end{bmatrix}}\\\vdots \\{\begin{bmatrix}&&{\textbf {v}}_{l}&&\end{bmatrix}}\end{bmatrix}}\end{matrix}}} The values σ 1 , … , σ l {\displaystyle \sigma _{1},\dots ,\sigma _{l}} are called the singular values, and u 1 , … , u l {\displaystyle u_{1},\dots ,u_{l}} and v 1 , … , v l {\displaystyle v_{1},\dots ,v_{l}} the left and right singular vectors. Notice the only part of U {\displaystyle U} that contributes to t i {\displaystyle {\textbf {t}}_{i}} is the i 'th {\displaystyle i{\textrm {'th}}} row. Let this row vector be called t ^ i T {\displaystyle {\hat {\textrm {t}}}_{i}^{T}} . Likewise, the only part of V T {\displaystyle V^{T}} that contributes to d j {\displaystyle {\textbf {d}}_{j}} is the j 'th {\displaystyle j{\textrm {'th}}} column, d ^ j {\displaystyle {\hat {\textrm {d}}}_{j}} . These are not the eigenvectors, but depend on all the eigenvectors. I

    Read more →
  • Three-factor learning

    Three-factor learning

    In neuroscience and machine learning, three-factor learning is the combination of Hebbian plasticity with a third modulatory factor to stabilise and enhance synaptic learning. This third factor can represent various signals such as reward, punishment, error, surprise, or novelty, often implemented through neuromodulators. == Description == Three-factor learning introduces the concept of eligibility traces, which flag synapses for potential modification pending the arrival of the third factor, and helps temporal credit assignement by bridging the gap between rapid neuronal firing and slower behavioral timescales, from which learning can be done. Biological basis for Three-factor learning rules have been supported by experimental evidence. This approach addresses the instability of classical Hebbian learning by minimizing autocorrelation and maximizing cross-correlation between inputs.

    Read more →
  • Weak artificial intelligence

    Weak artificial intelligence

    Weak artificial intelligence (weak AI) is artificial intelligence that implements a limited part of the mind, or, as narrow AI, artificial narrow intelligence (ANI), is focused on one narrow task. Weak AI is contrasted with strong AI, which can be interpreted in various ways: Artificial general intelligence (AGI): a machine with the ability to apply intelligence to any problem, rather than just one specific problem. Artificial superintelligence (ASI): a machine with a vastly superior intelligence to the average human being. Artificial consciousness: a machine that has consciousness, sentience and mind (John Searle uses "strong AI" in this sense). Narrow AI can be classified as being "limited to a single, narrowly defined task. Most modern AI systems would be classified in this category." Artificial general intelligence is conversely the opposite. == Applications and risks == Some examples of narrow AI are AlphaGo, self-driving cars, robot systems used in the medical field, and diagnostic doctors. Narrow AI systems are sometimes dangerous if unreliable. And the behavior that it follows can become inconsistent. It could be difficult for the AI to grasp complex patterns and get to a solution that works reliably in various environments. This "brittleness" can cause it to fail in unpredictable ways. Narrow AI failures can sometimes have significant consequences. It could for example cause disruptions in the electric grid, damage nuclear power plants, cause global economic problems, and misdirect autonomous vehicles. Medicines could be incorrectly sorted and distributed. Also, medical diagnoses can ultimately have serious and sometimes deadly consequences if the AI is faulty or biased. Simple AI programs have already worked their way into society, oftentimes unnoticed by the public. Autocorrection for typing, speech recognition for speech-to-text programs, and vast expansions in the data science fields are examples. Narrow AI has also been the subject of some controversy, including resulting in unfair prison sentences, discrimination against women in the workplace for hiring, resulting in death via autonomous driving, among other cases. Despite being "narrow" AI, recommender systems are efficient at predicting user reactions based on their posts, patterns, or trends. For instance, TikTok's "For You" algorithm can determine a user's interests or preferences in less than an hour. Some other social media AI systems are used to detect bots that may be involved in propaganda or other potentially malicious activities. == Weak AI versus strong AI == John Searle contests the possibility of strong AI (by which he means conscious AI). He further believes that the Turing test (created by Alan Turing and originally called the "imitation game", used to assess whether a machine can converse indistinguishably from a human) is not accurate or appropriate for testing whether an AI is "strong". Scholars such as Antonio Lieto have argued that the current research on both AI and cognitive modelling are perfectly aligned with the weak-AI hypothesis (that should not be confused with the "general" vs "narrow" AI distinction) and that the popular assumption that cognitively inspired AI systems espouse the strong AI hypothesis is ill-posed and problematic since "artificial models of brain and mind can be used to understand mental phenomena without pretending that that they are the real phenomena that they are modelling" (as, on the other hand, implied by the strong AI assumption).

    Read more →