AI Apply Reddit

AI Apply Reddit — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • CEITON

    CEITON

    CEITON is a web-based software system for facilitating and automating business processes such as planning, scheduling, and payroll using workflow technologies. The system is used by several media companies such as MDR, Yle, RAI and Red Bull Media House. In December 2018, the first CEITON User Group Meeting took place in Leipzig, Germany. == Architecture == The software runs on a server (on premises) or in the cloud and is scalable on parallel servers. Data security is warranted by role-based access control (RBAC). The software is used via web-browsers and not dependent on particular system software. == Structure and Features == CEITON combines the two classical approaches of production planning and control and workflow management. === Project Management === The scheduling system plans, manages, bills, and analyzes projects or tasks. It manages human and technical resources, material, and locations on a single GUI. The system uses a gantt chart to assign tasks to be done to available and eligible resources (i.e. staff), automatically or by drag-and-drop. The scheduling module includes material management, resource management/ human resource management, integration of freelancers, clients and suppliers, long-term budget planning, time-tracking, shift scheduling, quality management, delivery and logistics, document management, archive, analysis and controlling, business reporting, as well as all accounting and documentation processes. === Workflow === The workflow management system module coordinates business processes. Processes are defined once as a workflow and then repeatedly executed. Human resources are automatically assigned to steps (tasks) and integrated in workflow forms. Systems are integrated with an EAI/SOAP module, allowing data exchange with arbitrary external systems which are also involved in the business process. It also features a 3-D workflow overview in which the status of each project step can be determined by its color in the overview. === Process Management === For project and order processing management, business processes are designed as workflows, and coordinate communication automatically. Different user interfaces for staff, customers or suppliers can be created so each gets only relevant information. Different workflow forms are associated with different log-ins. The main application for the system is knowledge-based business processes, in which many people are involved and virtual results are produced, e.g. in research, or development of media products, such as TV and movies. Broadcasters and media companies such as MDR and Yle use CEITON to control their production processes for products and services and coordinate complex workflows with all kinds of resources. === Integrations === An integrated EAI module allows CEITON to integrate every external system in any business process without programming, using SOAP and similar technologies. Aspera and FileCatalyst were integrated for faster data transfer, yet complex ERP systems and numerous SAP modules have also been integrated, for example, to extract working times to payroll. === Mobile Working === Since Version 7, released in 2015, CEITON includes a time-tracking module allowing employees to enter their times from mobile devices such as tablets running Android, iPhones etc. == History == Ceiton Technologies (SME tech firm), the company developing CEITON, was founded in Leipzig, Germany in 2000, staffing solutions for the Bureau of Internal Revenue in Manila, Philippines, were implemented in 2000 together with the Deutsche Gesellschaft für Technische Zusammenarbeit of the German government. The first version (1.0) of the software was released in July 2001. The product was originally developed for German broadcasting companies. CEITON is named after the Japanese concept Seiton, one of the principles of Japanese workplace design methodology known as 5S. Since version 7, released in 2015, CEITON includes a time-tracking module allowing employees to enter their times from mobile devices such as tablets running Android, iPhones etc. In May 2005 CEITON won the IQ innovation award, sponsored by Siemens, in the category Excellent innovation in the IT-sector. Since 2007, CEITON has been present at the broadcast trade fairs NAB in Las Vegas and IBC in Amsterdam. In 2020, the company celebrated its 20th anniversary.

    Read more →
  • Explanation-based learning

    Explanation-based learning

    Explanation-based learning (EBL) is a form of machine learning that exploits a very strong, or even perfect, domain theory (i.e. a formal theory of an application domain akin to a domain model in ontology engineering, not to be confused with Scott's domain theory) in order to make generalizations or form concepts from training examples. It is also linked with Encoding (memory) to help with Learning. == Details == An example of EBL using a perfect domain theory is a program that learns to play chess through example. A specific chess position that contains an important feature such as "Forced loss of black queen in two moves" includes many irrelevant features, such as the specific scattering of pawns on the board. EBL can take a single training example and determine what are the relevant features in order to form a generalization. A domain theory is perfect or complete if it contains, in principle, all information needed to decide any question about the domain. For example, the domain theory for chess is simply the rules of chess. Knowing the rules, in principle, it is possible to deduce the best move in any situation. However, actually making such a deduction is impossible in practice due to combinatoric explosion. EBL uses training examples to make searching for deductive consequences of a domain theory efficient in practice. In essence, an EBL system works by finding a way to deduce each training example from the system's existing database of domain theory. Having a short proof of the training example extends the domain-theory database, enabling the EBL system to find and classify future examples that are similar to the training example very quickly. The main drawback of the method—the cost of applying the learned proof macros, as these become numerous—was analyzed by Minton. === Basic formulation === EBL software takes four inputs: a hypothesis space (the set of all possible conclusions) a domain theory (axioms about a domain of interest) training examples (specific facts that rule out some possible hypothesis) operationality criteria (criteria for determining which features in the domain are efficiently recognizable, e.g. which features are directly detectable using sensors) == Application == An especially good application domain for an EBL is natural language processing (NLP). Here a rich domain theory, i.e., a natural language grammar—although neither perfect nor complete, is tuned to a particular application or particular language usage, using a treebank (training examples). Rayner pioneered this work. The first successful industrial application was to a commercial NL interface to relational databases. The method has been successfully applied to several large-scale natural language parsing systems, where the utility problem was solved by omitting the original grammar (domain theory) and using specialized LR-parsing techniques, resulting in huge speed-ups, at a cost in coverage, but with a gain in disambiguation. EBL-like techniques have also been applied to surface generation, the converse of parsing. When applying EBL to NLP, the operationality criteria can be hand-crafted, or can be inferred from the treebank using either the entropy of its or-nodes or a target coverage/disambiguation trade-off (= recall/precision trade-off = f-score). EBL can also be used to compile grammar-based language models for speech recognition, from general unification grammars. Note how the utility problem, first exposed by Minton, was solved by discarding the original grammar/domain theory, and that the quoted articles tend to contain the phrase grammar specialization—quite the opposite of the original term explanation-based generalization. Perhaps the best name for this technique would be data-driven search space reduction. Other people who worked on EBL for NLP include Guenther Neumann, Aravind Joshi, Srinivas Bangalore, and Khalil Sima'an.

    Read more →
  • Learnable function class

    Learnable function class

    In statistical learning theory, a learnable function class is a set of functions for which an algorithm can be devised to asymptotically minimize the expected risk, uniformly over all probability distributions. The concept of learnable classes are closely related to regularization in machine learning, and provides large sample justifications for certain learning algorithms. == Definition == === Background === Let Ω = X × Y = { ( x , y ) } {\displaystyle \Omega ={\mathcal {X}}\times {\mathcal {Y}}=\{(x,y)\}} be the sample space, where y {\displaystyle y} are the labels and x {\displaystyle x} are the covariates (predictors). F = { f : X ↦ Y } {\displaystyle {\mathcal {F}}=\{f:{\mathcal {X}}\mapsto {\mathcal {Y}}\}} is a collection of mappings (functions) under consideration to link x {\displaystyle x} to y {\displaystyle y} . L : Y × Y ↦ R {\displaystyle L:{\mathcal {Y}}\times {\mathcal {Y}}\mapsto \mathbb {R} } is a pre-given loss function (usually non-negative). Given a probability distribution P ( x , y ) {\displaystyle P(x,y)} on Ω {\displaystyle \Omega } , define the expected risk I P ( f ) {\displaystyle I_{P}(f)} to be: I P ( f ) = ∫ L ( f ( x ) , y ) d P ( x , y ) {\displaystyle I_{P}(f)=\int L(f(x),y)dP(x,y)} The general goal in statistical learning is to find the function in F {\displaystyle {\mathcal {F}}} that minimizes the expected risk. That is, to find solutions to the following problem: f ^ = arg ⁡ min f ∈ F I P ( f ) {\displaystyle {\hat {f}}=\arg \min _{f\in {\mathcal {F}}}I_{P}(f)} But in practice the distribution P {\displaystyle P} is unknown, and any learning task can only be based on finite samples. Thus we seek instead to find an algorithm that asymptotically minimizes the empirical risk, i.e., to find a sequence of functions { f ^ n } n = 1 ∞ {\displaystyle \{{\hat {f}}_{n}\}_{n=1}^{\infty }} that satisfies lim n → ∞ P ( I P ( f ^ n ) − inf f ∈ F I P ( f ) > ϵ ) = 0 {\displaystyle \lim _{n\rightarrow \infty }\mathbb {P} (I_{P}({\hat {f}}_{n})-\inf _{f\in {\mathcal {F}}}I_{P}(f)>\epsilon )=0} One usual algorithm to find such a sequence is through empirical risk minimization. === Learnable function class === We can make the condition given in the above equation stronger by requiring that the convergence is uniform for all probability distributions. That is: The intuition behind the more strict requirement is as such: the rate at which sequence { f ^ n } {\displaystyle \{{\hat {f}}_{n}\}} converges to the minimizer of the expected risk can be very different for different P ( x , y ) {\displaystyle P(x,y)} . Because in real world the true distribution P {\displaystyle P} is always unknown, we would want to select a sequence that performs well under all cases. However, by the no free lunch theorem, such a sequence that satisfies (1) does not exist if F {\displaystyle {\mathcal {F}}} is too complex. This means we need to be careful and not allow too "many" functions in F {\displaystyle {\mathcal {F}}} if we want (1) to be a meaningful requirement. Specifically, function classes that ensure the existence of a sequence { f ^ n } {\displaystyle \{{\hat {f}}_{n}\}} that satisfies (1) are known as learnable classes. It is worth noting that at least for supervised classification and regression problems, if a function class is learnable, then the empirical risk minimization automatically satisfies (1). Thus in these settings not only do we know that the problem posed by (1) is solvable, we also immediately have an algorithm that gives the solution. == Interpretations == If the true relationship between y {\displaystyle y} and x {\displaystyle x} is y ∼ f ∗ ( x ) {\displaystyle y\sim f^{}(x)} , then by selecting the appropriate loss function, f ∗ {\displaystyle f^{}} can always be expressed as the minimizer of the expected loss across all possible functions. That is, f ∗ = arg ⁡ min f ∈ F ∗ I P ( f ) {\displaystyle f^{}=\arg \min _{f\in {\mathcal {F}}^{}}I_{P}(f)} Here we let F ∗ {\displaystyle {\mathcal {F}}^{}} be the collection of all possible functions mapping X {\displaystyle {\mathcal {X}}} onto Y {\displaystyle {\mathcal {Y}}} . f ∗ {\displaystyle f^{}} can be interpreted as the actual data generating mechanism. However, the no free lunch theorem tells us that in practice, with finite samples we cannot hope to search for the expected risk minimizer over F ∗ {\displaystyle {\mathcal {F}}^{}} . Thus we often consider a subset of F ∗ {\displaystyle {\mathcal {F}}^{}} , F {\displaystyle {\mathcal {F}}} , to carry out searches on. By doing so, we risk that f ∗ {\displaystyle f^{}} might not be an element of F {\displaystyle {\mathcal {F}}} . This tradeoff can be mathematically expressed as In the above decomposition, part ( b ) {\displaystyle (b)} does not depend on the data and is non-stochastic. It describes how far away our assumptions ( F {\displaystyle {\mathcal {F}}} ) are from the truth ( F ∗ {\displaystyle {\mathcal {F}}^{}} ). ( b ) {\displaystyle (b)} will be strictly greater than 0 if we make assumptions that are too strong ( F {\displaystyle {\mathcal {F}}} too small). On the other hand, failing to put enough restrictions on F {\displaystyle {\mathcal {F}}} will cause it to be not learnable, and part ( a ) {\displaystyle (a)} will not stochastically converge to 0. This is the well-known overfitting problem in statistics and machine learning literature. == Example: Tikhonov regularization == A good example where learnable classes are used is the so-called Tikhonov regularization in reproducing kernel Hilbert space (RKHS). Specifically, let F ∗ {\displaystyle {\mathcal {F^{}}}} be an RKHS, and | | ⋅ | | 2 {\displaystyle ||\cdot ||_{2}} be the norm on F ∗ {\displaystyle {\mathcal {F^{}}}} given by its inner product. It is shown in that F = { f : | | f | | 2 ≤ γ } {\displaystyle {\mathcal {F}}=\{f:||f||_{2}\leq \gamma \}} is a learnable class for any finite, positive γ {\displaystyle \gamma } . The empirical minimization algorithm to the dual form of this problem is arg ⁡ min f ∈ F ∗ { ∑ i = 1 n L ( f ( x i ) , y i ) + λ | | f | | 2 } {\displaystyle \arg \min _{f\in {\mathcal {F}}^{}}\left\{\sum _{i=1}^{n}L(f(x_{i}),y_{i})+\lambda ||f||_{2}\right\}} This was first introduced by Tikhonov to solve ill-posed problems. Many statistical learning algorithms can be expressed in such a form (for example, the well-known ridge regression). The tradeoff between ( a ) {\displaystyle (a)} and ( b ) {\displaystyle (b)} in (2) is geometrically more intuitive with Tikhonov regularization in RKHS. We can consider a sequence of { F γ } {\displaystyle \{{\mathcal {F}}_{\gamma }\}} , which are essentially balls in F ∗ {\displaystyle {\mathcal {F^{}}}} with centers at 0. As γ {\displaystyle \gamma } gets larger, F γ {\displaystyle {\mathcal {F}}_{\gamma }} gets closer to the entire space, and ( b ) {\displaystyle (b)} is likely to become smaller. However we will also suffer smaller convergence rates in ( a ) {\displaystyle (a)} . The way to choose an optimal γ {\displaystyle \gamma } in finite sample settings is usually through cross-validation. == Relationship to empirical process theory == Part ( a ) {\displaystyle (a)} in (2) is closely linked to empirical process theory in statistics, where the empirical risk { ∑ i = 1 n L ( y i , f ( x i ) ) , f ∈ F } {\displaystyle \{\sum _{i=1}^{n}L(y_{i},f(x_{i})),f\in {\mathcal {F}}\}} are known as empirical processes. In this field, the function class F {\displaystyle {\mathcal {F}}} that satisfies the stochastic convergence are known as uniform Glivenko–Cantelli classes. It has been shown that under certain regularity conditions, learnable classes and uniformly Glivenko-Cantelli classes are equivalent. Interplay between ( a ) {\displaystyle (a)} and ( b ) {\displaystyle (b)} in statistics literature is often known as the bias-variance tradeoff. However, note that in the authors gave an example of stochastic convex optimization for General Setting of Learning where learnability is not equivalent with uniform convergence.

    Read more →
  • Socially assistive robot

    Socially assistive robot

    A socially assistive robot (SAR) aids users through social engagement and support rather than through physical tasks and interactions. == Background == The field of socially assistive robotics emerged in the early 2000s, following the emergence of the field of social robots. In contrast to social robots, SARs aid users with specific goals related to behavior change rather than serving as purely social entities. The term "Socially assistive robot" was initially defined by Maja Matarić and David Feil-Seifer in 2005. Since its inception, the field has gained substantial recognition, featuring numerous research projects, a wealth of global research publications, startup companies, and a growing array of products on the consumer market. The COVID-19 pandemic has underscored the immense potential of socially assistive robots, particularly in addressing the needs of large user populations, including children engaged in remote learning, elderly individuals grappling with loneliness, and those affected by social isolation and its associated negative consequences. == Characteristics of interaction == SARs rely on artificial intelligence (AI) to generate real-time, responsive, natural, and meaningful robot behaviors during interactions with humans. The robots employ various forms of communication, such as facial expressions, gestures, body movements, and speech. In contrast to robots intended for physical tasks, SARs are designed to support and motivate users to perform their own tasks. The tasks a user engages in can be physical (e.g., rehabilitation exercises for post-stroke users), cognitive (e.g., dementia screening for elderly users), or social (e.g., turn-taking for users with autism spectrum disorders). This complex interaction involves detecting and interpreting the user's movement, behavior, intent, goals, speech, and preferences. Machine learning and robot learning techniques are frequently employed to enhance the robot's understanding of the user, predict user preferences, and provide effective assistance. The effectiveness of socially assistive robots is assessed based on objective measurements of user performance and improvement resulting from the robot’s assistance and support. Unlike other branches of robotics, where effectiveness depends on the robot's physical task completion, SAR measures the success of the robot based on the user's progress and achievements. This evaluation is carried out using quantitative objective metrics, such as time spent on tasks, accuracy, retention, and verbalization, as well as quantitative subjective metrics, such as user survey tools. SAR is based on the large body of evidence showing that users tend to respond more positively to interactions with physical robots compared to interactions with screens. Interaction with physical robots also encourages users to learn and retain more information than screen-based interactions. This fundamental insight underlines why physical robots in SAR applications are more effective, as opposed to interactions solely involving screens, tablets, or computers. == Uses and applications == SARs have been developed and validated in a wide array of applications, including healthcare, elder care, education, and training. For example, SARs have been developed to support children on the autism spectrum in acquiring and practicing social and cognitive skills, to motivate and coach stroke patients throughout their rehabilitation exercises, monitoring individuals health (ex. fall detection), and to encourage elderly users to be more physically and socially active. There is a concern that technophobia and lack of trust in robots will pose a barrier to the effectiveness of SARs in older adults.

    Read more →
  • Micro stuttering

    Micro stuttering

    Micro stuttering is a visual artifact in real-time computer graphics in which the time intervals between consecutively displayed frames are uneven, even though the average frame rate reported by benchmarking software appears adequate. Tools such as 3DMark typically compute frame rates over intervals of one second or more, which can conceal momentary drops in the instantaneous frame rate that the viewer perceives as hitching or jerking of on-screen motion. At low frame rates the effect is visible as a stutter in moving images, degrading the experience in interactive applications such as video games. In severe cases a lower but more consistent frame rate can appear smoother than a higher but more erratic one. The term gained prominence in the late 2000s in discussions of multi-GPU rendering (see History), but micro stuttering also affects single-GPU systems. Common causes on modern hardware include real-time shader compilation, asset streaming from storage, VRAM exhaustion, and driver bugs. == Causes == === Shader compilation === A common cause of micro stuttering on modern PCs is real-time shader compilation. Shaders are small programs that instruct the GPU on how to render visual effects such as lighting, shadows, and reflections. On consoles, developers can pre-compile all shaders for the known, fixed hardware. On PCs, the variety of GPU architectures means shaders must often be compiled at run time, either when the game launches or during gameplay itself. When the rendering engine encounters a shader that has not yet been compiled, the CPU must finish the compilation before the GPU can draw the affected object. This causes a spike in frame time that the player perceives as a hitch. The problem has been particularly associated with games built on Unreal Engine 4 running under DirectX 12, because DX12 shifts more shader management responsibility to the application. Several techniques exist to reduce shader compilation stutter. Pipeline State Object (PSO) pre-caching records the shader permutations used at runtime so that they can be compiled in advance on subsequent launches. Asynchronous shader compilation moves the work to background CPU threads to avoid blocking the main rendering thread. Platform-level services such as Steam's shader pre-caching distribute previously compiled shaders to users with matching GPU hardware. The Steam Deck, which contains a single fixed GPU, benefits from pre-compiled shader caches because all units share the same hardware configuration. === Other causes === Micro stuttering on single-GPU systems can have several additional causes. CPU bottlenecks or scheduling interruptions from background tasks can prevent the processor from preparing frames at regular intervals. Asset streaming during gameplay (loading textures, geometry, or audio from storage) can produce hitches sometimes called traversal stutter; the use of solid-state drives and technologies such as DirectStorage has reduced but not eliminated this. VRAM exhaustion forces data to be swapped between video memory and system memory over the PCI Express bus, which is slower. Graphics driver bugs can also introduce stutter; Nvidia released hotfix driver 551.46 in February 2024 to correct intermittent micro stuttering when V-Sync was enabled. == Measurement == Micro stuttering drew attention to the limitations of average frame rate as a performance metric. In 2013, Scott Wasson at The Tech Report published a series of articles advocating frame time analysis, in which the delivery time of every individual frame is recorded and plotted rather than collapsed into a single frames-per-second figure. This approach was adopted by other hardware review publications in the following years. GPU reviews now routinely report 1% low and 0.1% low frame rates alongside the average. The 1% low is the average frame rate of the slowest 1% of frames in a sample; it serves as an indicator of worst-case smoothness. A large gap between the average and the 1% low suggests poor frame pacing. Tools for capturing per-frame timing data include FRAPS, PresentMon, OCAT, CapFrameX, and MSI Afterburner with RivaTuner Statistics Server. == Mitigation == === Frame pacing === Frame pacing is a software technique that regulates the timing of frame delivery to produce even intervals between displayed frames. Game engines, GPU drivers, and platform libraries all implement frame pacing strategies to varying degrees. On mobile platforms, Google provides the Android Frame Pacing library (Swappy) as part of the Android Game Development Kit. In December 2025, the Khronos Group published the VK_EXT_present_timing Vulkan extension, giving developers explicit control over presentation timing in a cross-platform graphics API for the first time. === Variable refresh rate === Variable refresh rate (VRR) display technologies allow a monitor's refresh rate to change to match the GPU's frame output. Implementations include Nvidia G-Sync (2013), AMD FreeSync (2015), and the VESA Adaptive-Sync standard built into DisplayPort 1.2a and later. VRR eliminates the screen tearing that results from a mismatch between frame rate and refresh rate, and avoids the frame-holding behaviour of V-Sync that can itself cause stutter. It is effective at smoothing moderate frame rate fluctuations but cannot compensate for large sudden spikes in frame time such as those caused by shader compilation or heavy asset streaming. VRR support has become standard in gaming monitors, televisions (via HDMI 2.1), and the Xbox Series X/S and PlayStation 5 consoles. === Frame generation === Beginning with DLSS 3 on the GeForce RTX 40 series in 2022, Nvidia introduced AI-based frame generation, which uses dedicated optical flow hardware and a neural network to create new frames between traditionally rendered ones. AMD followed with FSR 3 in 2023, using an algorithmic approach, and the AI-based FSR 4 for the Radeon RX 9000 series in 2025. DLSS 4, released in January 2025 for the GeForce RTX 50 series, can generate up to three frames per rendered frame using a technique called Multi Frame Generation. Frame generation increases the displayed frame rate but introduces its own frame pacing concerns. If the underlying rendered frames are unevenly timed, the interpolated frames can make the unevenness more apparent rather than less. DLSS 4 addresses this with hardware-level flip metering on the GPU's display engine, which controls the timing of frame presentation more precisely than the CPU-based pacing used in DLSS 3. Both vendors pair frame generation with latency-reduction features (Nvidia Reflex and AMD Anti-Lag+) to offset the additional input latency that results from inserting synthetic frames into the pipeline. === Frame rate limiters === Capping the frame rate below the display's maximum refresh rate, using tools such as RivaTuner Statistics Server, in-game limiters, or driver-level settings, is a common way to improve frame pacing. Preventing the GPU from running ahead of the display reduces variability in frame delivery times and can produce a smoother result than an uncapped but more irregular frame rate. == History == === Multi-GPU configurations === Micro stuttering was first widely documented in the late 2000s as a side effect of multi-GPU configurations using Alternate Frame Rendering (AFR), in which consecutive frames are assigned to alternating GPUs. Because each GPU may take a different amount of time to complete its assigned frame — due to varying scene complexity, driver scheduling, or inter-GPU communication overhead — the resulting frame delivery is irregular even when the average frame rate is high. Both Nvidia SLI and AMD CrossFireX were affected, with dual-GPU setups exhibiting the worst frame pacing irregularities. In 2012 benchmarks using Battlefield 3, dual Radeon HD 7970 cards in CrossFire showed 85% variation in frame delivery times compared with 7% for a single card, while dual GeForce GTX 680 cards in SLI showed only 7% variation compared with 5% for a single card. Multi-GPU micro stuttering became a significant factor in the eventual decline and discontinuation of consumer multi-GPU gaming. Nvidia restricted SLI to a handful of enthusiast-class cards from the GeForce 10 series onward, then replaced it with NVLink on the GeForce RTX 20 series, which saw limited gaming adoption. AMD ceased active CrossFire development around 2017. By the mid-2020s, neither vendor's current consumer GPUs support multi-GPU rendering for games. Other factors that contributed to the decline include DirectX 12 placing multi-GPU support in the hands of game developers rather than driver authors, the incompatibility of temporal anti-aliasing and other temporal rendering techniques with AFR, and the increasing size, power draw, and cost of individual GPUs. The third-party utility RadeonPro could reduce CrossFire micro stuttering through dynamic V-Sync and frame pacing adjustments, and AMD later introduced a driver-level frame paci

    Read more →
  • Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, and aesthetic ranking. == Algorithm == The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart. To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of N {\displaystyle N} image-caption pairs. Let the outputs from the text and image models be respectively v 1 , . . . , v N , w 1 , . . . , w N {\displaystyle v_{1},...,v_{N},w_{1},...,w_{N}} . Two vectors are considered "similar" if their dot product is large. The loss incurred on this batch is the multi-class N-pair loss, which is a symmetric cross-entropy loss over similarity scores: − 1 N ∑ i ln ⁡ e v i ⋅ w i / T ∑ j e v i ⋅ w j / T − 1 N ∑ j ln ⁡ e v j ⋅ w j / T ∑ i e v i ⋅ w j / T {\displaystyle -{\frac {1}{N}}\sum _{i}\ln {\frac {e^{v_{i}\cdot w_{i}/T}}{\sum _{j}e^{v_{i}\cdot w_{j}/T}}}-{\frac {1}{N}}\sum _{j}\ln {\frac {e^{v_{j}\cdot w_{j}/T}}{\sum _{i}e^{v_{i}\cdot w_{j}/T}}}} In essence, this loss function encourages the dot product between matching image and text vectors ( v i ⋅ w i {\displaystyle v_{i}\cdot w_{i}} ) to be high, while discouraging high dot products between non-matching pairs. The parameter T > 0 {\displaystyle T>0} is the temperature, which is parameterized in the original CLIP model as T = e − τ {\displaystyle T=e^{-\tau }} where τ ∈ R {\displaystyle \tau \in \mathbb {R} } is a learned parameter. Other loss functions are possible. For example, Sigmoid CLIP (SigLIP) proposes the following loss function: L = 1 N ∑ i , j ∈ 1 : N f ( ( 2 δ i , j − 1 ) ( e τ w i ⋅ v j + b ) ) {\displaystyle L={\frac {1}{N}}\sum _{i,j\in 1:N}f((2\delta _{i,j}-1)(e^{\tau }w_{i}\cdot v_{j}+b))} where f ( x ) = ln ⁡ ( 1 + e − x ) {\displaystyle f(x)=\ln(1+e^{-x})} is the negative log sigmoid loss, and the Dirac delta symbol δ i , j {\displaystyle \delta _{i,j}} is 1 if i = j {\displaystyle i=j} else 0. == CLIP models == While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well. === Image model === The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order. Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt (in the OpenCLIP model series by LAION). Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension". For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024, and for the ViTs, from 512 to 768. Its implementation of ViT was the same as the original one, with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm. Its implementation of ResNet was the same as the original one, with 3 modifications: In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by. There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it rect-2 blur pooling according to the terminology of ). This has the effect of blurring images before downsampling, for antialiasing. The final convolutional layer is followed by a multiheaded attention pooling. ALIGN a model with similar capabilities, trained by researchers from Google used EfficientNet, a kind of convolutional neural network. === Text model === The text encoding models used in CLIP are typically Transformers. In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention. Its architecture is the same as GPT-2. Like BERT, the text sequence is bracketed by two special tokens [SOS] and [EOS] ("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408. ALIGN used BERT of various sizes. == Dataset == === WebImageText === The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data. The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets. The dataset is private and has not been released to the public, and there is no further information on it. ==== Data preprocessing ==== For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by [0.48145466, 0.4578275, 0.40821073], and dividing by [0.26862954, 0.26130258, 0.27577711]. The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly whitens the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. If the input image does not have the same resolution as the native resolution (224×224 for all except ViT-L/14@336px, which has 336×336 resolution), then the input image is first scaled by bicubic interpolation, so that its shorter side is the same as the native resolution, then the central square of the image is cropped out. === Others === ALIGN used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset was constructed, but instead of complex filtering, they only applied a frequency-based filtering. Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B. == Training == In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs. All ViT models were trained on 224×224 image resolution. The ViT-L/14 was then boosted to 336×336 resolution by FixRes, resulting in a model. They found this was the best-performing model. In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen. == Applications == === Cross-modal retrieval === CLIP's cross-modal retrieval enables the alignment of visual and textual data in a shared latent space, allowing users to retrieve images based on text descriptions and vice versa, without the need for explicit image annotations. In text-to-image retrieval, users input descriptive text, and CLIP retrieves images with matching embeddings. In image-to-text retrieval, images are used to find related text content. CLIP’s ability to connect vis

    Read more →
  • Knowledge integration

    Knowledge integration

    Knowledge integration is the process of synthesizing multiple knowledge models (or representations) into a common model (representation). Compared to information integration, which involves merging information having different schemas and representation models, knowledge integration focuses more on synthesizing the understanding of a given subject from different perspectives. For example, multiple interpretations are possible of a set of student grades, typically each from a certain perspective. An overall, integrated view and understanding of this information can be achieved if these interpretations can be put under a common model, say, a student performance index. The Web-based Inquiry Science Environment (WISE), from the University of California at Berkeley has been developed along the lines of knowledge integration theory. Knowledge integration has also been studied as the process of incorporating new information into a body of existing knowledge with an interdisciplinary approach. This process involves determining how the new information and the existing knowledge interact, how existing knowledge should be modified to accommodate the new information, and how the new information should be modified in light of the existing knowledge. A learning agent that actively investigates the consequences of new information can detect and exploit a variety of learning opportunities; e.g., to resolve knowledge conflicts and to fill knowledge gaps. By exploiting these learning opportunities the learning agent is able to learn beyond the explicit content of the new information. The machine learning program KI, developed by Murray and Porter at the University of Texas at Austin, was created to study the use of automated and semi-automated knowledge integration to assist knowledge engineers constructing a large knowledge base. A possible technique which can be used is semantic matching. More recently, a technique useful to minimize the effort in mapping validation and visualization has been presented which is based on Minimal Mappings. Minimal mappings are high quality mappings such that i) all the other mappings can be computed from them in time linear in the size of the input graphs, and ii) none of them can be dropped without losing property i). The University of Waterloo operates a Bachelor of Knowledge Integration undergraduate degree program as an academic major or minor. The program started in 2008.

    Read more →
  • Belief–desire–intention model

    Belief–desire–intention model

    For popular psychology, the belief–desire–intention (BDI) model of human practical reasoning was developed by Michael Bratman as a way of explaining future-directed intention. BDI is fundamentally reliant on folk psychology (the 'theory theory'), which is the notion that our mental models of the world are theories. It was used as a basis for developing the belief–desire–intention software model. == Applications == BDI was part of the inspiration behind the BDI software architecture, which Bratman was also involved in developing. Here, the notion of intention was seen as a way of limiting time spent on deliberating about what to do, by eliminating choices inconsistent with current intentions. BDI has also aroused some interest in psychology. BDI formed the basis for a computational model of childlike reasoning CRIBB.

    Read more →
  • Multimodal representation learning

    Multimodal representation learning

    Multimodal representation learning is a subfield of representation learning focused on integrating and interpreting information from different modalities, such as text, images, audio, or video, by projecting them into a shared latent space. This allows for semantically similar content across modalities to be mapped to nearby points within that space, facilitating a unified understanding of diverse data types. By automatically learning meaningful features from each modality and capturing their inter-modal relationships, multimodal representation learning enables a unified representation that enhances performance in cross-media analysis tasks such as video classification, event detection, and sentiment analysis. It also supports cross-modal retrieval and translation, including image captioning, video description, and text-to-image synthesis. == Motivation == The primary motivations for multimodal representation learning arise from the inherent nature of real-world data and the limitations of unimodal approaches. Since multimodal data offers complementary and supplementary information about an object or event from different perspectives, it is more informative than relying on a single modality. A key motivation is to narrow the heterogeneity gap that exists between different modalities by projecting their features into a shared semantic subspace. This allows semantically similar content across modalities to be represented by similar vectors, facilitating the understanding of relationships and correlations between them. Multimodal representation learning aims to leverage the unique information provided by each modality to achieve a more comprehensive and accurate understanding of concepts. These unified representations are crucial for improving performance in various cross-media analysis tasks such as video classification, event detection, and sentiment analysis. They also enable cross-modal retrieval, allowing users to search and retrieve content across different modalities. Additionally, it facilitates cross-modal translation, where information can be converted from one modality to another, as seen in applications like image captioning and text-to-image synthesis. The abundance of ubiquitous multimodal data in real-world applications, including understudied areas like healthcare, finance, and human-computer interaction (HCI), further motivates the development of effective multimodal representation learning techniques. == Approaches and methods == === Canonical-correlation analysis based methods === Canonical-correlation analysis (CCA) was first introduced in 1936 by Harold Hotelling and is a fundamental approach for multimodal learning. CCA aims to find linear relationships between two sets of variables. Given two data matrices X ∈ R n × p {\displaystyle X\in \mathbb {R} ^{n\times p}} and Y ∈ R n × q {\displaystyle Y\in \mathbb {R} ^{n\times q}} representing different modalities, CCA finds projection vectors w x ∈ R p {\displaystyle w_{x}\in \mathbb {R} ^{p}} and w y ∈ R q {\displaystyle w_{y}\in \mathbb {R} ^{q}} that maximizes the correlation between the projected variables: ρ = max w x , w y w x ⊤ Σ x y w y w x ⊤ Σ x x w x w y ⊤ Σ y y w y {\displaystyle \rho =\max _{w_{x},w_{y}}{\frac {w_{x}^{\top }\Sigma _{xy}w_{y}}{{\sqrt {w_{x}^{\top }\Sigma _{xx}w_{x}}}{\sqrt {w_{y}^{\top }\Sigma _{yy}w_{y}}}}}} such that Σ x x {\displaystyle \Sigma _{xx}} and Σ y y {\displaystyle \Sigma _{yy}} are the within-modality covariance matrices, and Σ x y {\displaystyle \Sigma _{xy}} is the between-modality covariance matrix. However, standard CCA is limited by its linearity, which led to the development of nonlinear extensions, such as kernel CCA and deep CCA. ==== Kernel CCA ==== Kernel canonical correlation analysis (KCCA) extends traditional CCA to capture nonlinear relationships between modalities by implicitly mapping the data into high dimensional feature spaces using kernel functions. Given kernel functions K x {\displaystyle K_{x}} and K y {\displaystyle K_{y}} with corresponding Gram matrices K x ∈ R n × n {\displaystyle K_{x}\in \mathbb {R} ^{n\times n}} and K y ∈ R n × n {\displaystyle K_{y}\in \mathbb {R} ^{n\times n}} , KCCA seeks coefficients α {\displaystyle \alpha } and β {\displaystyle \beta } that maximize: ρ = max α , β α ⊤ K x K y β α ⊤ K x 2 α β ⊤ K y 2 β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{\top }K_{x}Ky\beta }{{\sqrt {\alpha ^{\top }K_{x}^{2}\alpha }}{\sqrt {\beta ^{\top }K_{y}^{2}\beta }}}}} To prevent overfitting, regularization terms are typically added, resulting in: ρ = max α , β α T K x K y β α T ( K x 2 + λ x K x ) α β T ( K y 2 + λ y K y ) β {\displaystyle \rho =\max _{\alpha ,\beta }{\frac {\alpha ^{T}K_{x}K_{y}\beta }{{\sqrt {\alpha ^{T}\left(K_{x}^{2}+\lambda _{x}K_{x}\right)\alpha }}{\sqrt {\;\beta ^{T}\left(K_{y}^{2}+\lambda _{y}K_{y}\right)\beta }}}}} where λ x {\displaystyle \lambda _{x}} and λ y {\displaystyle \lambda _{y}} are regularization parameters. KCCA has proven effective for tasks such as cross-modal retrieval and semantic analysis, though it faces computational challenges with large datasets due to its O ( n 2 ) {\displaystyle O(n^{2})} memory requirement for sorting kernel matrices. KCCA was proposed independently by several researchers. ==== Deep CCA ==== Deep canonical correlation analysis (DCCA), introduced in 2013, employs neural networks to learn nonlinear transformations for maximizing the correlation between modalities. DCCA uses separate neural networks f x {\displaystyle f_{x}} and f y {\displaystyle f_{y}} for each modality to transform the original data before applying CCA: max W x , W y , θ x , θ y corr ⁡ ( f x ( X ; θ x ) , f y ( Y ; θ y ) ) {\displaystyle \max _{W_{x},W_{y},\theta _{x},\theta _{y}}\operatorname {corr} \left(f_{x}(X;\theta _{x}),f_{y}(Y;\theta _{y})\right)} where θ x {\displaystyle \theta _{x}} and θ y {\displaystyle \theta _{y}} represent the parameters of the neural networks, and W x {\displaystyle W_{x}} and W y {\displaystyle W_{y}} are the CCA projection matrices. The correlation objective is computed as: corr ⁡ ( H x , H y ) = tr ⁡ ( T − 1 / 2 H x T H y S − 1 / 2 ) {\displaystyle \operatorname {corr} (H_{x},H_{y})=\operatorname {tr} \left(T^{-1/2}H_{x}^{T}H_{y}S^{-1/2}\right)} where H x = f x ( X ) {\displaystyle H_{x}=f_{x}(X)} and H y = f y ( Y ) {\displaystyle H_{y}=f_{y}(Y)} are the network outputs, T = H x T H x + r x I {\displaystyle T=H_{x}^{T}H_{x}+r_{x}I} , S = H y T H y + r y I {\displaystyle S=H_{y}^{T}H_{y}+r_{y}I} and r x , r y {\displaystyle r_{x},r_{y}} are the regularization parameters. DCCA overcomes the limitations of linear CCA and kernel CCA by learning complex nonlinear relationships while maintaining computational efficiency for large datasets through mini-batch optimization. === Graph-based methods === Graph-based approaches for multimodal representation learning leverage graph structure to model relationships between entities across different modalities. These methods typically represent each modality as a graph and then learn embedding that preserve cross-modal similarities, enabling more effective joint representation of heterogeneous data. One such method is cross-modal graph neural networks (CMGNNs) that extend traditional graph neural networks (GNNs) to handle data from multiple modalities by constructing graphs that capture both intra-modal and inter-modal relationships. These networks model interactions across modalities by representing them as nodes and their relationships as edges. Other graph-based methods include Probabilistic Graphical Models (PGMs) such as deep belief networks (DBN) and deep Boltzmann machines (DBM). These models can learn a joint representation across modalities, for instance, a multimodal DBN achieves this by adding a shared restricted Boltzmann Machine (RBM) hidden layer on top of modality-specific DBNs. Additionally, the structure of data in some domains like Human-Computer Interaction (HCI), such as the view hierarchy of app screens, can potentially be modeled using graph-like structures. The field of graph representation learning is also relevant, with ongoing progress in developing evaluation benchmarks. === Diffusion maps === Another set of methods relevant to multimodal representation learning are based on diffusion maps and their extensions to handle multiple modalities. ==== Multi-view diffusion maps ==== Multi-view diffusion maps address the challenge of achieving multi-view dimensionality reduction by effectively utilizing the availability of multiple views to extract a coherent low-dimensional representation of the data. The core idea is to exploit both the intrinsic relations within each view and the mutual relations between the different views, defining a cross-view model where a random walk process implicitly hops between objects in different views. A multi-view kernel matrix is constructed by combining these relations, defining a cross-view diffusion process and associ

    Read more →
  • Wumpus world

    Wumpus world

    Wumpus world is a simple world use in artificial intelligence for which to represent knowledge and to reason. Wumpus world was introduced by Michael Genesereth, and is discussed in the Russell-Norvig Artificial Intelligence book Artificial Intelligence: A Modern Approach. Wumpus World is loosely inspired by the 1972 video game Hunt the Wumpus. == Problem description == In Artificial Intelligence: A Modern Approach, the wumpus world features a 4x4 grid, containing a monster called a wumpus, multiple bottomless pits and hidden gold. The agent starts at (1,1) and has to find the gold and return to the starting position. The agent loses 1 point for every move and gains 1000 points for bringing the gold to the starting position. The agent can sense pits by a breeze, stench indicates a wumpus, and sparkle indicates gold. The wumpus can be killed by an arrow but costs 10 points.

    Read more →
  • Artificial intelligence controversies

    Artificial intelligence controversies

    The controversies surrounding artificial intelligence encompass a broad range of public, academic, and political debates regarding the societal effects of artificial intelligence (AI). These debates intensified particularly in the late 2010s and 2020s, coinciding with an accelerated period of development known as the AI boom. While advocates emphasize the technology's potential to solve complex problems and enhance human quality of life, detractors highlight a wide array of dangers and challenges. These include concerns over ethics, plagiarism and theft, fraud, safety and alignment, environmental impacts, technological unemployment, and the spread of misinformation. It also covers severe future or theoretical challenges, such as the emergence of artificial superintelligence and existential risks. == 2016 == === Microsoft Tay chatbot (2016) === On March 23, 2016, Microsoft released Tay, a chatbot designed to mimic the language patterns of a 19-year-old American girl and learn from interactions with Twitter users. Soon after its launch, Tay began posting racist, sexist, and otherwise inflammatory tweets after Twitter users deliberately taught it offensive phrases and exploited its "repeat after me" capability. Examples of controversial outputs included Holocaust denial and calls for genocide using racial slurs. Within 16 hours of its release, Microsoft suspended the Twitter account, deleted the offensive tweets, and stated that Tay had suffered from a "coordinated attack by a subset of people" that "exploited a vulnerability." Tay was briefly and accidentally re-released on March 30 during testing, after which it was permanently shut down. Microsoft CEO Satya Nadella later stated that Tay "has had a great influence on how Microsoft is approaching AI" and taught the company the importance of taking accountability. == 2022 == === Voiceverse NFT plagiarism scandal (2022) === On January 14, 2022, voice actor Troy Baker announced a partnership with Voiceverse, a blockchain-based company that marketed proprietary AI voice cloning technology as non-fungible tokens (NFT), triggering immediate backlash over environmental concerns, fears that AI could displace human voice actors, and concerns about fraud. Later that same day, the pseudonymous creator of 15.ai—a free, non-commercial AI voice synthesis research project—revealed through server logs that Voiceverse had used 15.ai to generate voice samples, pitch-shifted them to make them unrecognizable, and falsely marketed them as their own proprietary technology before selling them as NFTs; the developer of 15.ai had previously stated that they had no interest in incorporating NFTs into their work. Voiceverse confessed within an hour and stated that their marketing team had used 15.ai without attribution while rushing to create a demo. News publications and AI watchdog groups universally characterized the incident as theft stemming from generative artificial intelligence. === Théâtre D'opéra Spatial (2022) === On August 29, 2022, Jason Michael Allen won first place in the "emerging artist" (non-professional) division of the "Digital Arts/Digitally-Manipulated Photography" category of the Colorado State Fair's fine arts competition with Théâtre D'opéra Spatial, a digital artwork created using the AI image generator Midjourney, Adobe Photoshop, and AI upscaling tools, becoming one of the first images made using generative AI to win such a prize. Allen disclosed his use of Midjourney when submitting, though the judges did not know it was an AI tool but stated they would have awarded him first place regardless. While there was little contention about the image at the fair, reactions to the win on social media were negative. On September 5, 2023, the United States Copyright Office ruled that the work was not eligible for copyright protection as the human creative input was de minimis and that copyright rules "exclude works produced by non-humans." == 2023 == === Statements on AI risk (2023) === On March 22, 2023, the Future of Life Institute published an open letter calling on "all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4", citing risks such as AI-generated propaganda, extreme automation of jobs, human obsolescence, and a society-wide loss of control. The letter, published a week after the release of OpenAI's GPT-4, asserted that current large language models were "becoming human-competitive at general tasks". It received more than 30,000 signatures, including academic AI researchers and industry CEOs such as Yoshua Bengio, Stuart Russell, Elon Musk, Steve Wozniak and Yuval Noah Harari. The letter was criticized for diverting attention from more immediate societal risks such as algorithmic biases, with Timnit Gebru and others arguing that it amplified "some futuristic, dystopian sci-fi scenario" instead of current problems with AI. On May 30, 2023, the Center for AI Safety released a one-sentence statement signed by hundreds of artificial intelligence experts and other notable figures: "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war." Signatories included Turing laureates Geoffrey Hinton and Yoshua Bengio, as well as the scientific and executive leaders of several major AI companies, including Sam Altman, Demis Hassabis, and Bill Gates. The statement prompted responses from political leaders, including UK Prime Minister Rishi Sunak, who retweeted it with a statement that the UK government would look carefully into it, and White House Press Secretary Karine Jean-Pierre, who commented that AI "is one of the most powerful technologies that we see currently in our time." Skeptics, including from Human Rights Watch, argued that scientists should focus on known risks of AI instead of speculative future risks. === Removal of Sam Altman from OpenAI (2023) === On November 17, 2023, OpenAI's board of directors ousted co-founder and chief executive Sam Altman, stating that "the board no longer has confidence in his ability to continue leading OpenAI." The removal was precipitated by employee concerns about his handling of artificial intelligence safety and allegations of abusive behavior. Altman was reinstated on November 22 after pressure from employees and investors, including a letter signed by 745 of OpenAI's 770 employees threatening mass resignations if the board did not resign. The removal and subsequent reinstatement caused widespread reactions, including Microsoft's stock falling nearly three percent following the initial announcement and then rising over two percent to an all-time high after Altman was hired to lead a Microsoft AI research team before his reinstatement. The incident also prompted investigations from the Competition and Markets Authority and the Federal Trade Commission into Microsoft's relationship with OpenAI. == 2024 == === Taylor Swift deepfake pornography controversy (2024) === In late January 2024, sexually explicit AI-generated deepfake images of Taylor Swift were proliferated on X, with one post reported to have been seen over 47 million times before its removal. Disinformation research firm Graphika traced the images back to 4chan, while members of a Telegram group had discussed ways to circumvent censorship safeguards of AI image generators to create pornographic images of celebrities. The images prompted responses from anti-sexual assault advocacy groups, US politicians, and Swifties. Microsoft CEO Satya Nadella called the incident "alarming and terrible." X briefly blocked searches of Swift's name on January 27, 2024, and Microsoft enhanced its text-to-image model safeguards to prevent future abuse. On January 30, US senators Dick Durbin, Lindsey Graham, Amy Klobuchar, and Josh Hawley introduced a bipartisan bill that would allow victims to sue individuals who produced or possessed "digital forgeries" with intent to distribute, or those who received the material knowing it was made without consent. === Google Gemini image generation controversy (2024) === In February 2024, social media users reported that Google's Gemini chatbot was generating images that featured people of color and women in historically inaccurate contexts—such as Vikings, Nazi soldiers, and the Founding Fathers—and refusing prompts to generate images of white people. The images were derided on social media, including by conservatives who cited them as evidence of Google's "wokeness", and criticized by Elon Musk, who denounced Google's products as biased and racist. In response, Google paused Gemini's ability to generate images of people. Google executive Prabhakar Raghavan released a statement explaining that Gemini had "overcompensate[d]" in its efforts to strive for diversity and acknowledging that the images were "embarrassing and wrong". Google CEO Sundar Pichai called the incident offensive and unacceptable in an internal memo, promising struc

    Read more →
  • Smart object

    Smart object

    A smart object is an object that enhances the interaction with not only people but also with other smart objects. Also known as smart connected products or smart connected things (SCoT), they are products, assets and other things embedded with processors, sensors, software and connectivity that allow data to be exchanged between the product and its environment, manufacturer, operator/user, and other products and systems. Connectivity also enables some capabilities of the product to exist outside the physical device, in what is known as the product cloud. The data collected from these products can be then analysed to inform decision-making, enable operational efficiencies and continuously improve the performance of the product. It can not only refer to interaction with physical world objects but also to interaction with virtual (computing environment) objects. A smart physical object may be created either as an artifact or manufactured product or by embedding electronic tags such as RFID tags or sensors into non-smart physical objects. Smart virtual objects are created as software objects that are intrinsic when creating and operating a virtual or cyber world simulation or game. The concept of a smart object has several origins and uses, see History. There are also several overlapping terms, see also smart device, tangible object or tangible user interface and Thing as in the Internet of things. == History == In the early 1990s, Mark Weiser, from whom the term ubiquitous computing originated, referred to a vision "When almost every object either contains a computer or can have a tab attached to it, obtaining information will be trivial", Although Weiser did not specifically refer to an object as being smart, his early work did imply that smart physical objects are smart in the sense that they act as digital information sources. Hiroshi Ishii and Brygg Ullmer refer to tangible objects in terms of tangibles bits or tangible user interfaces that enable users to "grasp & manipulate" bits in the center of users' attention by coupling the bits with everyday physical objects and architectural surfaces. The smart object concept was introduced by Marcelo Kallman and Daniel Thalmann as an object that can describe its own possible interactions. The main focus here is to model interactions of smart virtual objects with virtual humans, agents, in virtual worlds. The opposite approach to smart objects is 'plain' objects that do not provide this information. The additional information provided by this concept enables far more general interaction schemes, and can greatly simplify the planner of an artificial intelligence agent. In contrast to smart virtual objects used in virtual worlds, Lev Manovich focuses on physical space filled with electronic and visual information. Here, "smart objects" are described as "objects connected to the Net; objects that can sense their users and display smart behaviour". More recently in the early 2010s, smart objects are being proposed as a key enabler for the vision of the Internet of things. The combination of the Internet and emerging technologies such as near field communications, real-time localization, and embedded sensors enables everyday objects to be transformed into smart objects that can understand and react to their environment. Such objects are building blocks for the Internet of things and enable novel computing applications. In 2018, one of the world's first smart houses was built in Klaukkala, Finland in the form of a five-floor apartment block, using the Kone Residential Flow solution created by KONE, allowing even a smartphone to act as a home key. == Characteristics == Although we can view interaction with physical smart object in the physical world as distinct from interaction with virtual smart objects in a virtual simulated world, these can be related. Poslad considers the progression of: how humans use models of smart objects situated in the physical world to enhance human to physical world interaction; versus how smart physical objects situated in the physical world can model human interaction in order to lessen the need for human to physical world interaction; versus how virtual smart objects by modelling both physical world objects and modelling humans as objects and their subsequent interactions can form a predominantly smart virtual object environment. === Smart physical objects === The concept smart for a smart physical object simply means that it is active, digital, networked, can operate to some extent autonomously, is reconfigurable and has local control of the resources it needs such as energy, data storage, etc. Note, a smart object does not necessarily need to be intelligent as in exhibiting a strong essence of artificial intelligence—although it can be designed to also be intelligent. Physical world smart objects can be described in terms of three properties: Awareness: is a smart object's ability to understand (that is, sense, interpret, and react to) events and human activities occurring in the physical world. Representation: refers to a smart object's application and programming model—in particular, programming abstractions. Interaction: denotes the object's ability to converse with the user in terms of input, output, control, and feedback. Based upon these properties, these have been classified into three types: Activity-Aware Smart Objects: Are objects that can record information about work activities and its own use. Policy-Aware Smart Objects: Are objects that are activity-aware Objects can interpret events and activities with respect to predefined organizational policies. Process-Aware Smart Objects: Processes play a fundamental role in industrial work management and operation. A process is a collection of related activities or tasks that are ordered according to their position in time and space. === Smart virtual objects === For the virtual object in a virtual world case, an object is called smart when it has the ability to describe its possible interactions. This focuses on constructing a virtual world using only virtual objects that contain their own interaction information. There are four basic elements to constructing such a smart virtual object framework. Object properties: physical properties and a text description Interaction information: position of handles, buttons, grips, and the like Object behavior: different behaviors based on state variables Agent behaviors: description of the behavior an agent should follow when using the object Some versions of smart objects also include animation information in the object information, but this is not considered to be an efficient approach, since this can make objects inappropriately oversized. === Categorization === The terms smart, connected product or smart product can be confusing as it is used to cover a broad range of different products, ranging from smart home appliances (e.g., smart bathroom scales or smart light bulbs) to smart cars (e.g., Tesla). While these products share certain similarities, they often differ substantially in their capabilities. Raff et al. developed a conceptual framework that distinguishes different smart products based on their capabilities, which features 4 types of smart product archetypes (in ascending order of "smartness"). Digital Connected Responsive Intelligent == Advantages == Smart, connected products have three primary components: Physical – made up of the product's mechanical and electrical parts. Smart – made up of sensors, microprocessors, data storage, controls, software, and an embedded operating system with enhanced user interface. Connectivity – made up of ports, antennae, and protocols enabling wired/wireless connections that serve two purposes, it allows data to be exchanged with the product and enables some functions of the product to exist outside the physical device. Each component expands the capabilities of one another resulting in "a virtuous cycle of value improvement". First, the smart components of a product amplify the value and capabilities of the physical components. Then, connectivity amplifies the value and capabilities of the smart components. These improvements include: Monitoring of the product's conditions, its external environment, and its operations and usage. Control of various product functions to better respond to changes in its environment, as well as to personalize the user experience. Optimization of the product's overall operations based on actual performance data, and reduction of downtimes through predictive maintenance and remote service. Autonomous product operation, including learning from their environment, adapting to users' preferences and self-diagnosing and service. === The Internet of things (IoT) === The Internet of things is the network of physical objects that contain embedded technology to communicate and sense or interact with their internal states or the external environment. The phrase "Internet of things" reflects the gro

    Read more →
  • List of color palettes

    List of color palettes

    The following is a list that contains color palettes for notable computer graphics, terminals and video game consoles. Only a simulated image using a palette and its name are given. Main articles are linked from the name of each palette, test charts, sample colours, simulated images, and further technical details (including references). During older eras of computing, manufacturers developed many different display systems often in a competitive, non-collaborative basis (with a few exceptions in the VESA consortium), creating many proprietary, non-standard different instances of display hardware. Often, as with early personal and home computers, a given machine employed its unique display subsystem, also with its unique color palette. Furthermore, software developers had made use of the color abilities of distinct display systems in many different ways. The result is that there is no single common standard nomenclature or classification taxonomy which can encompass every computer color palette. In order to organize the material, color palettes have been grouped following certain criteria. First, generic monochrome and full RGB repertories common to various computer display systems are listed. Then, usual color repertories used for display systems that employ indexed color techniques. And finally, specific manufacturers' color palettes implemented in many representative early personal computers and video game consoles of various brands. The list for personal computer palettes is split into two categories: 8-bit and 16-bit machines. This is not intended as a true strict categorization of such machines, because mixed architectures also exist (16-bit processors with an 8-bit data bus or 32-bit processors with a 16-bit data bus, among others). The distinction is based more on broad 8-bit and 16-bit computer ages or generations (around 1975–1985 and 1985–1995, respectively) and their associated state of the art in color display capabilities. The following is the common color test chart and sample image used to render each palette in this list: See further details in the summary paragraph of the corresponding article. == List of monochrome and RGB palettes == In this article, the term monochrome palette means a set of intensities for a monochrome display, and the term RGB palette is defined as the complete set of combinations a given RGB display can offer by mixing all the possible intensities of the red, green, and blue primaries available in its hardware. These are generic complete repertories of colors to produce black and white and RGB color pictures by the display hardware, not necessarily the total number of such colors that can be simultaneously displayed in a given text or graphic mode of any machine. RGB is the most common method to produce colors for displays; so these complete RGB color repertories have every possible combination of R-G-B triplets within any given maximum number of levels per component. For specific hardware and different methods to produce colors than RGB, see the List of computer hardware palettes and the List of video game consoles sections. For various software arrangements and sorts of colors, including other possible full RGB arrangements within 8-bit depth displays, see the List of software palettes section. === Monochrome palettes === These palettes only have shades of gray. === Dichrome palettes === Each permuted pair of red, green, and blue (16-bit color palette, with 65,536 colors). For example, "additive red green" has zero blue and "subtractive red green" has full blue. === Regular RGB palettes === These full RGB palettes employ the same number of bits to store the relative intensity for the red, green and blue components of every image's pixel color. Thus, they have the same number of levels per channel and the total number of possible colors is always the cube of a power of two. It should be understood that 'when developed' many of these formats were directly related to the size of some host computers 'natural word length' in bytes—the amount of memory in bits held by a single memory address such that the CPU can grab or put it in one operation. === Non-regular RGB palettes === These are also RGB palettes, in the sense defined above (except for 4-bit RGBI, which has an intensity bit that affects all channels at once), but either they do not have the same number of levels for each primary channel, or the numbers are not powers of two, so are not represented as separate bit fields. All of these have been used in popular personal computers. == List of software palettes == Systems that use a 4-bit or 8-bit pixel depth can display up to 16 or 256 colors simultaneously. Many personal computers in the later 1980s and early 1990s displayed at most 256 different colors, freely selected by software (either by the user or by a program) from their wider hardware's color palette. Usual selections of colors in limited subsets (generally 16 or 256) of the full palette includes some RGB level arrangements commonly used with the 8 bpp palettes as master palettes or universal palettes (i.e., palettes for multipurpose uses). These are some representative software palettes, but any selection can be made in such types of systems. === System specific palettes === These are selections of colors officially employed as system palettes in some popular operating systems for personal computers that feature 8-bit displays. === RGB arrangements === These are selections of colors based on evenly ordered RGB levels, mainly used as master palettes to display any kind of image within the limitations of the 8-bit pixel depth. === Other common uses of software palettes === == List of computer hardware palettes == In old personal computers and terminals that offered color displays, some color palettes were chosen algorithmically to provide the most diverse set of colors for a given palette size, and others were chosen to assure the availability of certain colors. In many early home computers, especially when the palette choices were determined at the hardware level by resistor combinations, the palette was determined by the manufacturer. Many early models output composite video colors. When seen on TV devices, the perception of the colors may not correspond with the value levels for the color values employed (most noticeable with NTSC TV color system). For current RGB display systems for PCs (Super VGA, etc.), see the 16-bit RGB and 24-bit RGB for High Color (thousands) and True Color (millions of colors) modes. For video game consoles, see the List of video game consoles section. For every model, their main different graphical color modes are listed based exclusively in the way they handle colors on screen, not all their different screen modes. The list is organized roughly historically by video hardware, not by branch. They are listed according to the original model of each system, which means that extended versions, clones, and compatibles also support the original palette. === Terminals and 8-bit machines === === 16-bit machines === === Video game console palettes === Color palettes of some of the most popular video game consoles. The criteria are the same as those of the List of computer hardware palettes section.

    Read more →
  • Enterprise cognitive system

    Enterprise cognitive system

    Enterprise cognitive systems (ECS) are part of a broader shift in computing, from a programmatic to a probabilistic approach, called cognitive computing. An Enterprise Cognitive System makes a new class of complex decision support problems computable, where the business context is ambiguous, multi-faceted, and fast-evolving, and what to do in such a situation is usually assessed today by the business user. An ECS is designed to synthesize a business context and link it to the desired outcome. It recommends evidence-based actions to help the end-user achieve the desired outcome. It does so by finding past situations similar to the current situation, and extracting the repeated actions that best influence the desired outcome. While general-purpose cognitive systems can be used for different outputs, prescriptive, suggestive, instructive, or simply entertaining, an enterprise cognitive system is focused on action, not insight, to help in assessing what to do in a complex situation. == Key characteristics == ECS have to be: Adaptive: They must learn as information changes, and as goals and requirements evolve. They must resolve ambiguity and tolerate unpredictability. They must be engineered to feed on dynamic data in real time, or near real time. In the Enterprise, near-real time learning from data requires an agile information federation approach to ingest incremental data updates as they occur, and an unsupervised learning approach to ensure that new best practice is leveraged across the organization in a timely manner. Interactive: They must interact easily with users so that those users can define their needs comfortably. They may also interact with other processors, devices, and Cloud services, as well as with people. In the Enterprise, interactions are controlled via existing workflows and UIs. Therefore, embedding best practices directly into these existing interfaces, in the context of a specific step, is critical to ensure maximum end-user adoption. Iterative and stateful: They must aid in defining a problem by asking questions or finding additional source input if a problem statement is ambiguous or incomplete. They must “remember” previous interactions in a process and return information that is suitable for the specific application at that point in time. In the Enterprise, business context is often structured by a business process, and therefore sufficiently data-rich to make relevant recommendations without significant iterations from the end-user. A stateful memory of overall interactions across communication channels is critical for understanding of context, as a static profile will not capture intent and outcome potential the way behavior does. Contextual: They must understand, identify, and extract contextual elements such as meaning, syntax, time, location, appropriate domain, regulations, user's profile, process, task and goal. They may draw on multiple sources of information, including both structured and unstructured digital information, as well as sensory inputs (visual, gestural, auditory, or sensor-provided). In the Enterprise, Context is fragmented and must be aggregated across data types, sources, and locations. In most business environments, such data is captured in existing enterprise information systems, and the effort is linked to quickly source and unify such information. It is rare to have to directly process sensor, audio or visual data in real-time as direct input into the enterprise cognitive system. Instead, these data types are captured by Enterprise Applications and pre-processed into a binary or text format prior to consumption by the System. == Business applications powered by an ECS == Bottlenose – trends and brands monitoring Cybereason – security threat monitoring Dataminr – social media monitoring

    Read more →
  • Embodied agent

    Embodied agent

    In artificial intelligence, an embodied agent, also sometimes referred to as an interface agent, is an intelligent agent that interacts with the environment through a physical body within that environment. Agents that are represented graphically with a body, for example a human or a cartoon animal, are also called embodied agents, although they have only virtual, not physical, embodiment. A branch of artificial intelligence focuses on empowering such agents to interact autonomously with human beings and the environment. Mobile robots are one example of physically embodied agents; Ananova and Microsoft Agent are examples of graphically embodied agents. Embodied conversational agents are embodied agents (usually with a graphical front-end as opposed to a robotic body) that are capable of engaging in conversation with one another and with humans employing the same verbal and nonverbal means that humans do (such as gesture, facial expression, and so forth). == Embodied conversational agents == Embodied conversational agents are a form of intelligent user interface. Graphically embodied agents aim to unite gesture, facial expression and speech to enable face-to-face communication with users, providing a powerful means of human-computer interaction. == Advantages == Face-to-face communication allows communication protocols that give a much richer communication channel than other means of communicating. It enables pragmatic communication acts such as conversational turn-taking, facial expression of emotions, information structure and emphasis, visualization and iconic gestures, and orientation in a three-dimensional environment. This communication takes place through both verbal and non-verbal channels such as gaze, gesture, spoken intonation and body posture. Research has found that users prefer a non-verbal visual indication of an embodied system's internal state to a verbal indication, demonstrating the value of additional non-verbal communication channels. As well as this, the face-to-face communication involved in interacting with an embodied agent can be conducted alongside another task without distracting the human participants, instead improving the enjoyment of such an interaction. Furthermore, the use of an embodied presentation agent results in improved recall of the presented information. Embodied agents also provide a social dimension to the interaction. Humans willingly ascribe social awareness to computers, and thus interaction with embodied agents follows social conventions, similar to human to human interactions. This social interaction both raises the believably and perceived trustworthiness of agents, and increases the user's engagement with the system. Rickenberg and Reeves found that the presence of an embodied agent on a website increased the level of user trust in that website, but also increased users' anxiety and affected their performance, as if they were being watched by a real human. Another effect of the social aspect of agents is that presentations given by an embodied agent are perceived as being more entertaining and less difficult than similar presentations given without an agent. Research shows that perceived enjoyment, followed by perceived usefulness and ease of use, is the major factor influencing user adoption of embodied agents. A study in January 2004 by Byron Reeves at Stanford demonstrated how digital characters could "enhance online experiences" through explaining how virtual characters essentially add a sense of familiarity to the user experience and make it more approachable. This increase in likability in turn helps make the products better, which benefits both the end users and those creating the product. === Applications === The rich style of communication that characterizes human conversation makes conversational interaction with embodied conversational agents ideal for many non-traditional interaction tasks. A familiar application of graphically embodied agents is computer games; embodied agents are ideal for this setting because the richer communication style makes interacting with the agent enjoyable. Embodied conversational agents have also been used in virtual training environments, portable personal navigation guides, interactive fiction and storytelling systems, interactive online characters and automated presenters and commentators. Major virtual assistants like Siri, Amazon Alexa and Google Assistant do not come with any visual embodied representation, which is believed to limit the sense of human presence by users. The U.S. Department of Defense utilizes a software agent called SGT STAR on U.S. Army-run Web sites and Web applications for site navigation, recruitment and propaganda purposes. Sgt. Star is run by the Army Marketing and Research Group, a division operated directly from The Pentagon. Sgt. Star is based upon the ActiveSentry technology developed by Next IT, a Washington-based information technology services company. Other such bots in the Sgt. Star "family" are utilized by the Federal Bureau of Investigation and the Central Intelligence Agency for intelligence gathering purposes.

    Read more →